Lenovo

[LPS] Sr Operation Mgmt Specialist

Lenovo
ElectronicsSINGAPORE, Central Singapore, SingaporeOnsitePosted 1 week ago

About the role

AI summarised

The Site Reliability Engineer ensures the stability, availability, and performance of enterprise systems within a managed services environment. The role focuses on governing operational activities, leading incident response as an incident commander, and driving automation to enhance system resilience across application and infrastructure layers.

ElectronicsOnsiteInformation Technology

Key Responsibilities

  • Own end-to-end service reliability, including availability, performance, and system stability
  • Define and track reliability metrics such as uptime, latency, and error rates
  • Design and implement monitoring, logging, and alerting frameworks across application and infrastructure layers
  • Lead major incident management (P1/P2) as incident commander
  • Perform end-to-end root cause analysis (RCA) across application, infrastructure, and vendor domains
  • Assess operational risks associated with changes, releases, and patching activities
  • Govern Go/No-Go decisions and support rollback planning in case of service degradation
  • Monitor and optimise system performance and conduct capacity planning and forecasting
  • Drive automation of operational processes, including monitoring, recovery, and validation
  • Work closely with Application Engineers and Vendors for execution of operational activities and defect resolution

Requirements

  • Experience in Site Reliability Engineering, DevOps, or production operations in enterprise environments
  • Strong understanding of cloud platforms (preferably AWS)
  • Experience with monitoring and observability tools
  • Strong troubleshooting capability across application and infrastructure layers
  • Experience in incident management and root cause analysis
  • Familiarity with ITIL processes (incident, problem, change management)
  • Experience in system integrator or managed services (Day 2 operations) environment
  • Exposure to enterprise applications (e.g., IWMS platforms such as Archibus or similar)
  • Experience with automation and scripting (Python, PowerShell, etc.)
  • Knowledge of performance tuning and capacity planning
  • Strong analytical and problem-solving skills
  • Ability to lead during high-pressure incidents
  • Structured and governance-driven mindset
  • Proactive approach to reliability and continuous improvement