About the role
AI summarisedThe HPC Storage Engineer manages storage infrastructure within high-performance computing environments at NSCC. This role involves optimizing parallel file systems like Lustre and GPFS, ensuring high availability, and conducting performance analysis to support future storage expansion.
ResearchOnsiteNational Supercomputing Centre
Key Responsibilities
- Collaborate with Managed Services teams to administer and support HPC storage infrastructure.
- Ensure high availability and reliability of storage systems.
- Provide technical support and resolve storage-related issues.
- Implement best practices for monitoring, alerting, and reporting.
- Track utilization and allocation trends to support capacity planning.
- Conduct performance testing and analysis of storage systems.
- Work with cross-functional teams to enhance performance and scalability.
- Maintain comprehensive documentation of infrastructure and operational processes.
- Optimize data placement strategies for performance and efficiency.
- Support future storage expansion and HPC system design.
Requirements
- Degree in a Computer Science, Engineering, IT or other relevant areas.
- Strong Linux skills and comfort with command-line interface.
- Solid understanding of Linux file systems, including local (e.g., ext4, XFS), shared (e.g., NFS), and parallel (e.g., Lustre, GPFS, BeeGFS) file systems.
- At least 2 years of experience managing parallel file systems such as Lustre, GPFS, BeeGFS, or similar technologies.
- Familiar with RDMA-based interconnect such as InfiniBand or RoCE.
- Proficient in scripting with Bash and/or Python.
- Strong problem-solving skills and ability to troubleshoot complex issues.