A*STAR

HPC System Engineer, System, NSCC

A*STAR
ResearchSingaporeOnsitePosted 4 weeks ago

About the role

AI summarised

The HPC System Engineer will design, optimize, and maintain the high-performance computing (HPC) system architecture, including compute, interconnect, and storage components. This role is crucial for ensuring the scalability, reliability, and security of the supercomputing infrastructure through advanced performance tuning and resource planning.

ResearchOnsiteNational Supercomputing Centre

Key Responsibilities

  • Evaluate HPC system architecture, including compute, interconnect, and storage components, collaborating with System Administrators to ensure reliability.
  • Assist in performance tuning and root-cause analysis for complex system-level issues.
  • Develop and maintain utility tools for system diagnostics and performance profiling.
  • Configure and optimize job schedulers (e.g., Slurm, PBS Pro) to maximize resource utilization and throughput.
  • Develop and enforce policies for resource allocation and workload prioritization.
  • Assess future computational requirements and contribute to HPC system architecture design.
  • Evaluate emerging technologies such as processors, accelerators, interconnects, and storage solutions.
  • Define and implement security policies in collaboration with administrators and conduct regular compliance checks.

Requirements

  • Degree in Computer Science, Engineering, IT, or a relevant field.
  • At least 3 years of experience managing HPC systems.
  • High proficiency in UNIX/Linux environments and command line interface (CLI).
  • Strong knowledge of HPC storage principles and experience managing parallel file systems (Lustre, GPFS, BeeGFS).
  • Strong knowledge of RDMA-based interconnects (InfiniBand, RoCE).
  • Experience with job scheduling and workload management software (Slurm or PBS Pro).
  • Good knowledge of scripting languages like Python, Bash, or Perl.
  • Ability to analyze complex issues and develop effective solutions.