A*STAR

HPC AI Engineer, Frontier, NSCC

A*STAR
ResearchSingaporeFull-time1 months ago

About the role

AI summarised

HPC AI Engineer at NSCC, responsible for supporting and optimizing large-scale AI workloads on HPC systems, collaborating with researchers, and developing HPC software best practices. Requires a bachelor's degree in computer science or related field, 3+ years of AI development experience, and strong Python skills.

ResearchFull-timeNational Supercomputing Centre

Key Responsibilities

  • Provide HPC and scientific domain advice to users of NSCC systems.
  • Engage and collaborate with new researchers, communities, and disciplines with computationally intensive requirements.
  • Support and optimise large-scale AI application workloads.
  • Work with HPC performance engineers to profile and build performance models of the AI applications and workflows.
  • Design, develop and implement HPC software best practices for AI applications and workflows.
  • Assist in the planning and design of future HPC systems, including benchmarking AI workloads on various platforms and recommending the most suitable architecture for the research community.
  • Analyse system and user job data for efficient resource allocation and management.
  • Develop HPC utilities, dashboards and automated testing tools for NSCC HPC systems.
  • Develop HPC user and best practice guides for NSCC HPC systems.
  • Get up-to-date with scientific domain research development, HPC system and software technology.

Requirements

  • Bachelor degree in the field of computer science, computer engineering, or other relevant areas.
  • Proven working knowledge of models and algorithms in at least one area of generative models, computer vision, graph neural networks, or AI for Science applications.
  • Ideally, 3 years of experience in developing codes for AI training and inference.
  • Experience in setting up AI software stacks, familiar with diversified AI software stacks.
  • Good knowledge in AI application performance optimisation and troubleshooting.
  • Strong programming skills in Python; familiar with C/C++ programming is a plus.
  • Familiar with the working and using of AI frameworks (e.g. PyTorch, Tensorflow, JAX) for research.
  • Familiar with GPU architectures and programming is highly desired.
  • Familiar with Linux environment, scripting languages, profiler and debugger tools.
  • Familiar with HPC job schedulers and container technologies.
  • Familiar with object storage (S3); familiar with HPC storage (Lustre) is a plus.
  • Demonstrated team player with strong problem-solving skills.
  • Demonstrated effective communication skills including the ability to articulate technical concepts to a diverse range of audiences.
  • Demonstrated ability and willingness to contribute novel ideas and approaches in support of the research community.
  • Demonstrated passion for continuous learning and exploring new technologies or domains.