What You Will Do Reporting to the HPC-SYS group management, the HPCST team leader will be responsible for managing and supervising a computer operations team consisting of a mix of Computer System Technicians (CST) and Computing System Professionals (CSP). The team is responsible for around-the-clock monitoring, operations, and production support of the Laboratory’s state-of-the-art centralized computing systems including data storage. CSPs provide system maintenance and support, including on-call support that includes a full spectrum of system, network, and file system administration duties. The team is also responsible for providing hardware maintenance for all current and future high-performance computing platforms. These systems are critical to the Laboratory’s mission and support all major programs throughout the institution including Advanced Simulation and Computing (ASC) as well as the Tri-lab user community.
Team Leader responsibilities include: providing leadership, which includes working closely with the technical staff and computer users, serving as the day-to-day-supervisor and principal authority on the monitoring, operations, and hardware maintenance of computer resources related to high-performance computing and data storage, managing and scheduling staff, assigning work, hiring, mentoring, training, and planning the operations and support activities of the team in the performance of its duties on a 24/7 work schedule. The successful candidate will provide counsel, advice, and assistance to group management in support of decision processes and will be responsible for planning and execution of the function including input to the group office planning and discussion activities, development of strategic vision for the team, budget execution, staffing plans, developing performance standards and objectives for the team, providing input for employee performance assessments, disciplinary actions, reclassifications, salary management, and employee skill development, as required. The Team Leader frequently interacts with vendors, subordinate project leads, users, and functional peer groups. The successful candidate will have full first line manager safety and security responsibility and authority, especially during the off-shifts and weekends, to develop procedures, make decisions, and take corrective action to ensure the reliability and security of the high-performance computing resources at the Laboratory. This responsibility is especially critical during the off shifts and weekends when the computer operation’s staff is the first line of contact for users, system managers, and facility staff. Serve as a working member of high-level division projects.
The successful candidate should have comprehensive understanding and wide application of technical principle, theories, and concepts in the field; will work independently and interactively with computer operators when managing day-to-day operations and problem solving on various supercomputing systems; and will provide technical solutions to a wide range of difficult problems.
What You Need
Minimum Job Requirements:
- Strong interpersonal and communication skills.
- Proven record of success in leading, directing or supervising staff in a computing organization.
- Demonstrated knowledge of and ability to implement and comply with LANL safety and security practices.
- Advanced technical knowledge, training, and experience in monitoring, operating, hardware maintenance, and supporting the Lab’s state-of-the-art high-performance computing systems.
- Demonstrated knowledge and experience with LANL business and administrative policies, practices and tools.
- Demonstrated ability to lead and manage projects.
- Demonstrated professional-level expertise as a team member interacting with system engineers, analysts, team members, customers, management, and vendors in the deployment and integration of computing systems.
- Proven record of success in meeting customer needs.
- Demonstrated advanced knowledge of systems utilizing diagnostic tests, system messages, scripts and monitoring tools.
- Demonstrated record of successful collaboration with direct reports, technical leaders or vendor representatives as required to monitor, maintain and repair compute and storage systems.
- Active DOE Q clearance required.
- Expert knowledge of the UNIX, LINUX and Microsoft computer operating systems.
- Record of teamwork and consensus-building and demonstrated aptitude for mentoring, developing and motivating employees.
- Demonstrated skills in presentation, negotiation and written/oral communications.
- Extensive knowledge of system shut down procedures for a range of systems in order to keep critical data from being lost, and/or physical damage to extremely expensive equipment.
- Demonstrated ability to develop and implement complex policies and procedures such as shift management, team-specific training management, employee work plans, etc.
- Broad knowledge of hardware/software and communications issues related to high-performance computing.
- Advanced skills in identifying the need for diagnostic tools to monitor and diagnose system failures and the ability to develop these tools.
- Budget management experience.
- A bachelor’s degree in computer science or information systems from an accredited college or university or an equivalent combination of education and experience.
Clearance: Q (Position will be cleared to this level). Applicants selected will be subject to a Federal background investigation and must meet eligibility requirements* for access to classified matter.
*Eligibility requirements: To obtain a clearance, an individual must be at least 18 years of age; U.S. citizenship is required except in very limited circumstances. See DOE Order 472.2 for additional information.
New-Employment Drug Test: The Laboratory requires successful applicants to complete a new-employment drug test and maintains a substance abuse policy that includes random drug testing.
Regular Position: Term status Laboratory employees applying for regular-status positions are converted to regular status.
Equal Opportunity: Los Alamos National Laboratory is an equal opportunity employer and supports a diverse and inclusive workforce. All employment practices are based on qualification and merit, without regards to race, color, national origin, ancestry, religion, age, sex, gender identity, sexual orientation or preference, marital status or spousal affiliation, physical or mental disability, medical conditions, pregnancy, status as a protected veteran, genetic information, or citizenship within the limits imposed by federal laws and regulations. The Laboratory is also committed to making our workplace accessible to individuals with disabilities and will provide reasonable accommodations, upon request, for individuals to participate in the application and hiring process. To request such an accommodation, please send an email to [email protected] or call 1-505-665-4444 option 1.
Where You Will Work
Located in northern New Mexico, Los Alamos National Laboratory (LANL) is a multidisciplinary research institution engaged in strategic science on behalf of national security. LANL enhances national security by ensuring the safety and reliability of the U.S. nuclear stockpile, developing technologies to reduce threats from weapons of mass destruction, and solving problems related to energy, environment, infrastructure, health, and global security concerns.
The High Performance Computing (HPC) Division provides production high performance computing systems services to the Laboratory. HPC Division serves all Laboratory programs requiring a world-class high performance computing capability to enable solutions to complex problems of strategic national interest. Our work starts with the early phases of acquisition, development, and production readiness of HPC platforms, and continues through the maintenance and operation of these systems and the facilities in which they are housed. The Division directly supports the Laboratory’s HPC user base and aids, at multiple levels, in the effective use of HPC resources to generate science. Additionally, we engage in research activities that we deem important to our mission. The HPC Systems Group (HPC-SYS) manages the network, parallel file systems, storage, and provides system administration of LANL’s production HPC platforms.
The High Performance Computing Support Team (HPCST) within HPC-SYS consists of a twenty-four hour day, seven days a week, 365 days a year work schedule. The selected candidate may occasionally work off shifts (grave and swing), weekends and holidays. This position is considered essential personnel and as such the HPCST Lead (i.e., IT Manager) will be required to report to work during inclement weather and winter closures.