About the role
AI summarisedThe HPC System Administrator will manage the day-to-day operations of high-performance computing (HPC) systems, ensuring optimal stability, security, and performance for NSCC's supercomputing environment.
ResearchOnsiteNational Supercomputing Centre
Key Responsibilities
- Administer HPC compute nodes, storage systems, and internal networks.
- Monitor system health using tools like Grafana, Prometheus, and custom scripts.
- Apply patches, updates, and configuration changes to ensure system stability.
- Manage user accounts, access controls, and authentication mechanisms.
- Monitor job queues and assist users with job submission and scheduling issues.
- Respond to system alerts and user-reported incidents, documenting resolutions.
- Perform regular security checks and vulnerability assessments to ensure compliance.
- Maintain system operation logs and configuration documentation.
Requirements
- Degree in Computer Science, Engineering, IT or related field.
- Minimum 2 years of experience in Linux system administration, preferably in HPC environments.
- Proficiency in scripting using Python or Bash.
- Familiarity with cluster management tools (xCAT, BCM, HPCM).
- Experience with job schedulers such as PBS Pro or Slurm.
- Basic understanding of parallel file systems (Lustre, GPFS, BeeGFS).
- Understanding of basic network protocols (DHCP, DNS, TFTP, SMTP).