About the role

AI summarised

The Site Reliability Engineer ensures the stability, availability, and performance of enterprise systems within a managed services environment. The role focuses on governing operational activities, leading incident response as an incident commander, and driving automation to enhance system resilience across application and infrastructure layers.

ElectronicsOnsiteInformation Technology

Key Responsibilities

Own end-to-end service reliability, including availability, performance, and system stability
Define and track reliability metrics such as uptime, latency, and error rates
Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
Lead major incident management (P1/P2) as incident commander
Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
Assess operational risks associated with changes, releases, and patching activities
Govern Go/No-Go decisions and support rollback planning in case of service degradation
Monitor and optimise system performance and conduct capacity planning and forecasting
Drive automation of operational processes, including monitoring, recovery, and validation
Work closely with Application Engineers and Vendors for execution of operational activities and defect resolution

Requirements

Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
Strong understanding of cloud platforms (preferably AWS)
Experience with monitoring and observability tools
Strong troubleshooting capability across application and infrastructure layers
Experience in incident management and root cause analysis
Familiarity with ITIL processes (incident, problem, change management)
Experience in system integrator or managed services (Day 2 operations) environment
Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
Experience with automation and scripting (Python, PowerShell, etc.)
Knowledge of performance tuning and capacity planning
Strong analytical and problem-solving skills
Ability to lead during high-pressure incidents
Structured and governance-driven mindset
Proactive approach to reliability and continuous improvement

[LPS] Sr Operation Mgmt Specialist

About the role

Key Responsibilities

Requirements