Search for More Jobs
Forward job to a friend
Apply without Registering
Apply by creating/using an account
Must be a US Citizen who can work at the client's site.
The National Renewable Energy Laboratory (NREL) is the U.S. Department of Energy's primary national laboratory for renewable energy and energy efficiency research and development. NREL develops renewable energy and energy efficiency technologies and practices, advances related science and engineering, and transfers knowledge and innovations to address the nation's energy and environment goals.
The current NREL HPC system is the largest HPC system in the world dedicated to advancing renewable energy and energy efficiency technologies. Research and development projects that are funded by EERE Offices or aligned with the EERE mission are eligible to utilize these resources. To support the continuation and expansion of this type of research, NREL aims to procure a highly productive, robust system that will handle a diverse workload and has the ability to handle new workloads associated with energy efficiency and renewable energy research and development. To meet these needs, NREL intends to procure a computational resource that comprises two different types of compute nodes plus data analysis and visualization nodes, data transfer nodes as well as a storage system that provides shared file systems, all connected with a single high-performance interconnect fabric.
The on-site, Senior HPC System Administrator shall primarily be engaged in full spectrum storage system management/administration responsibilities for the HPC systems to include a file system migration from Lustre to Spectrum Scale. The System Administrator is expected to carry out projects of moderate to large size and complexity with minimal supervision. Staff will be expected to have expertise with large-scale integrated HPC computing systems. Staff is expected to apply a wide range of problem solving and resource management techniques to their work, which is of moderate to complex difficulty.
Duties and Responsibilities
•Responsible for implementation of migration strategy from Lustre to IBM Spectrum Scale and related support for storage and backup Infrastructure.
•Configure and maintain storage systems for use in a HPC environment.
•Install, configure, and maintain parallel file system (IBM Spectrum Scale(GPFS) as well as interaction with tiered-performance NFS.
•Monitor usage of storage resources and make recommendations to maintain service standards.
•Interact with integrated systems in an HPC environment – including job schedulers, Infiniband fabrics, etc.
•Effective utilization of configuration, monitoring, and notification tools.
•Provide excellent customer technical support to include system tuning, troubleshooting and maintenance and/or coordination of maintenance activities with related system hardware or software vendors
•Create and maintain clear and effective technical documentation and provide status updates to leadership as required.
•Interact with vendors, assessing products and making purchasing recommendations.
•Performs related duties as required.
•Plans and implements technical work to include system updates as needed
•Coordinates technical work with other system support personnel and customer key stakeholders as required.
•Performs storage allocations and switch related tasks.
•Contributes and develops best practices, methodologies, templates and documentation.
•Performs management and capacity planning for infrastructure components.
•Performs root cause analysis for the storage and infrastructure related issues.
•Provides recommendations for the improvement of the infrastructure.
•Responsible for test planning and execution.
•Ensures compliance with policies and procedures for change and incident management.
•Maintains updated knowledge of industry trends and new technologies and recommend future initiatives.
1. Bachelor's degree in Computer Science, Computer Engineering, or closely related field; or equivalent combination of training and experience and 2 or more years' experience in storage administration in a large scale computing environment.
2. Two or more years of recent experience managing large scale file systems, preferably parallel systems such as Spectrum Scale(GPFS), Lustre, Storage and Archive Manager
3. Experience installing and troubleshooting enterprise data storage hardware platforms.
4. Demonstrated experience using scripting/programming (Bash, Python, etc.) in support of systems operations.
5. Demonstrated commitment to providing excellent technical support to a diverse user base.
6. Good organizational skills and attention to detail along with good written and oral communications.
7. The ability to work with minimal supervision.
8. The ability to work effectively with staff and users at all levels, vendors and other technical staff as a member of a team.
9. Demonstrated ability to meet deadlines and work under pressure.
10. Demonstrated ability to lead projects.
11. Experience in the hands-on management and troubleshooting of Linux operating systems (Red Hat, etc.) and applications.
Other Required skills:
Excellent verbal and written communication skills
1. Experience working in an HPC operations team.
2. Experience in systems automation (DevOps) using tools such as Ansible, Puppet, Chef, etc.
3. Maintenance and troubleshooting of an Infiniband fabric.
4. Certifications relevant to this position.
Apply by creating/using an account