About the role
AI summarisedThe HPC Systems Engineer designs and optimizes compute cluster configurations for KLA systems, focusing on performance, reliability, and scalability. This role involves selecting and validating hardware components such as CPUs, memory, storage, networking, and accelerators, while collaborating across hardware, software, and systems engineering teams to ensure seamless integration. Responsibilities include documenting design decisions, participating in design reviews, and supporting cross-functional problem-solving efforts.
EquipmentOnsite
Key Responsibilities
- Design and develop compute cluster configurations optimized for performance, reliability, and scalability in KLA systems
- Select and validate hardware components including CPUs, memory, storage, networking, and specialized accelerators
- Collaborate with hardware, software, and systems engineering teams to ensure seamless integration of compute clusters into broader system architectures
- Document hardware design decisions, integration procedures, and diagnostic workflows for internal and cross-team use
- Participate in design reviews, integration planning, and collaborative problem-solving sessions with cross-functional teams
Requirements
- Doctorate (Academic) Degree and 0 years related work experience
- Master's Level Degree and related work experience of 3 years
- Bachelor's Level Degree and related work experience of 5 years
- Strong experience in computer hardware design, particularly in compute cluster or server environments
- Experience in networking design, including InfiniBand, Ethernet switches, with expertise in port mapping and configuration
- Familiarity with modern memory technologies (e.g., DDR4/DDR5, DIMM, LPDDR, HBM)
- Familiarity with Linux system administration and OS customization (preferably SUSE Linux)
- Understanding of system-level performance tuning and hardware-software interaction
- Excellent documentation and communication skills
- Experience with hardware validation and troubleshooting tools
- Knowledge of high-performance computing (HPC) or distributed systems
- Ability to work effectively in a collaborative, cross-functional engineering environment
- Test-driven development mindset and attention to detail
- Self-starter with a proactive approach to problem-solving and continuous improvement