About the role
AI summarisedThe HPC Storage Engineer will be responsible for managing the storage infrastructure within High-Performance Computing (HPC) environments, focusing on monitoring performance, optimization, and troubleshooting to ensure high availability and reliability.
ResearchOnsiteNational Supercomputing Centre
Key Responsibilities
- Administer and support HPC storage infrastructure in collaboration with Managed Services teams.
- Ensure high availability and reliability of all storage systems.
- Provide technical support and resolve complex, storage-related issues.
- Implement best practices for monitoring, alerting, and performance reporting.
- Track utilization and allocation trends to support capacity planning initiatives.
- Conduct performance testing and in-depth analysis of storage systems.
- Enhance system performance and scalability through collaboration with cross-functional teams.
- Maintain comprehensive documentation regarding infrastructure setup and operational processes.
- Optimize data placement strategies for maximum performance and efficiency.
- Support future storage expansion planning and HPC system design.
Requirements
- Degree in Computer Science, Engineering, IT, or a related field.
- At least 2 years of experience managing parallel file systems (Lustre, GPFS, BeeGFS, or similar).
- Strong proficiency in Linux and comfort with the command-line interface.
- Solid understanding of various Linux file systems (local: ext4, XFS; shared: NFS; parallel: Lustre, GPFS, BeeGFS).
- Proficiency in scripting using Bash and/or Python.
- Familiarity with RDMA-based interconnects such as InfiniBand or RoCE.
- Strong problem-solving abilities for troubleshooting complex storage issues.