About the role
AI summarisedSenior Infrastructure Reliability Engineer at AWS, responsible for driving reliability risk identification, assessment, and mitigation for datacenter infrastructure equipment. The role involves root cause analysis of critical failures, continuous improvement, and collaboration with internal teams and suppliers to enhance datacenter availability.
BusinessFull-time
Key Responsibilities
- Proactively driving the reliability risk identification, assessment and mitigation for datacenter infrastructure equipment (Example: Air Handling Units, LV Generator, MV Transformers, LV SWGR, Breakers, UPS, Chillers etc.)
- Responsible for root cause analysis of critical equipment failures and drive the continuous improvements to improve datacenter availability for AWS customers
- Work closely with both internal and outside partners including suppliers to drive key aspects of product specification, risk identification plan and execution
- Using Physics-of-Failure based approach to develop and implement both analytical and empirical approaches for product quality/reliability risk identification and assessment during product design, manufacture as well as deployment stages
- Drive AWS application-specific requirements in carrying out both lifecycle environmental and operational stress driven risk analysis, including thermal, electrical, chemical and mechanical stresses so to identify overstress and fatigue-related product weaknesses
- Evaluate product design quality/reliability risks and assess electronics manufacture process related quality/reliability issues
- Drive critical component identification and the associated vendor selection and qualification requirements
- Develop datacenter system level reliability model and related reliability quantification and risk analysis for datacenter configuration optimization
- Monitor product performance in the field and drive root cause analysis of any critical failures and the associated corrective and preventive actions
- Drive effective vendor auditing and quarterly review process to drive the continuous improvements of datacenter availability
Requirements
- Bachelor's degree in Electrical or Mechanical Engineering, Engineering Technology, Reliability Engineering, or 8+ years of managing, analyzing and communicating results to senior leadership experience
- 5+ years of root cause analysis and troubleshooting or problem solving experience
- 5+ years of product validation (Shock/Drop, cycle testing, environmental testing) experience
- Experience in supply chain, commodity, and supplier management in a high volume, global sourcing and operations manufacturing environment with a global supply base of contract manufacturers
- Knowledge of critical data center mechanical and electrical equipment
- Experience managing multiple projects, prioritizing, planning, and managing time