About the role

AI summarised

Senior Infrastructure Reliability Engineer at AWS, responsible for driving reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment. The role involves root cause analysis of critical failures, continuous improvement, and collaboration with internal teams and suppliers to enhance datacenter availability.

BusinessFull-time

Key Responsibilities

Proactively driving the reliability risk identification, assessment and mitigation for datacenter infrastructure equipment (Example: Air Handling Units, LV Generator, MV Transformers, LV SWGR, Breakers, UPS, Chillers etc.)
Responsible for root cause analysis of critical equipment failures and drive the continuous improvements to improve datacenter availability for AWS customers
Work closely with both internal and outside partners including suppliers to drive key aspects of product specification, risk identification plan and execution
Using Physics-of-Failure based approach to develop and implement both analytical and empirical approaches for product quality/reliability risk identification and assessment during product design, manufacture as well as deployment stages
Drive AWS application-specific requirements in carrying out both lifecycle environmental and operational stress driven risk analysis, including thermal, electrical, chemical and mechanical stresses so to identify overstress and fatigue-related product weaknesses
Evaluate product design quality/reliability risks and assess electronics manufacture process related quality/reliability issues
Drive critical component identification and the associated vendor selection and qualification requirements
Develop datacenter system level reliability model and related reliability quantification and risk analysis for datacenter configuration optimization
Monitor product performance in the field and drive root cause analysis of any critical failures and the associated corrective and preventive actions
Drive effective vendor auditing and quarterly review process to drive the continuous improvements of datacenter availability

Requirements

Bachelor's degree in Electrical or Mechanical Engineering, Engineering Technology, Reliability Engineering, or 8+ years of managing, analyzing and communicating results to senior leadership experience
5+ years of root cause analysis and troubleshooting or problem solving experience
5+ years of product validation (Shock/Drop, cycle testing, environmental testing) experience
Experience in supply chain, commodity, and supplier management in a high volume, global sourcing and operations manufacturing environment with a global supply base of contract manufacturers
Knowledge of critical data center mechanical and electrical equipment
Experience managing multiple projects, prioritizing, planning, and managing time

Sr. Infrastructure Reliability Engineer, Infrastructure Reliability & Quality

About the role

Key Responsibilities

Requirements