About the role
AI summarisedThis role is a senior technical lead position focused on owning and driving the diagnostics strategy for AMD's next-generation Data Center GPU products. The individual will provide end-to-end ownership of diagnostics quality, coverage, and completeness across pre-silicon and post-silicon phases, lead cross-functional investigations into system-level failures, and collaborate with global teams to ensure high-quality product delivery. The role requires deep SoC architecture knowledge, strong debugging skills, and experience in software development and technical leadership.
FablessOnsiteEngineering
Key Responsibilities
- Serve as the SoC Diagnostics Technical Lead for DCGPU programs, providing primary local ownership and global technical leadership for silicon and manufacturing quality issues across Singapore/Tai and other sites, with end‑to‑end accountability for the quality, coverage, and completeness of diagnostics solutions
- Work closely with the Diagnostics PM to define and drive end‑to‑end diagnostics strategy by translating program and customer requirements into clear priorities and execution plans across pre‑silicon and post‑silicon phases.
- Proactively articulate diagnostics objectives, strategic direction, risks, and tooling/framework requirements to PMs, managers, IP and framework architects to influence test coverage strategy, planning, and cross‑team alignment
- Own the diagnostics pre‑silicon emulation strategy and planning across software‑based and FPGA‑based emulation models, including RTL coverage requirements before silicon tape‑out and diagnostics verification requirements before silicon back
- Own the SoC system‑level feature validation methodology and planning for diagnostics
- Drive the technical requirements needed to achieve feature coverage and hardware bug capture targets, ensuring that diagnostics content supports both engineering debug and manufacturing/field health checks
- Lead and coordinate complex SoC/system‑level investigations (e.g., SLT/Board Production failures, field issues), analyze logs and symptoms, form hypotheses, and work with IP, platform, firmware and software teams to converge on root cause and corrective action
- Exercise horizontal leadership and collaboration with cross‑functional teams such as platform validation, ROCm/SW, HW architects, product engineering, manufacturing, and other stakeholders to achieve key program milestones (bring‑up, feature enablement, performance profiling, production support) with the desired coverage metrics from diagnostics
- Collaborate with the Product Engineering Organization to enable the product with high quality to customers; debug defects and help improve yield, coverage, and test time during NPI and volume production
- Provide diagnostics support to contract manufacturers and board engineering teams, particularly for SLT/BP and system‑level test flows and ensure that Diagnostics content is usable and effective in manufacturing environments
Requirements
- Proven experience with IP and SoC validation, diagnostics, and system Bring-up, with the ability to closely interact with hardware designers, validation, manufacturing and software teams
- Excellent understanding of SoC architecture, including processor, GPU compute, system IO and memory/HBM, and security blocks, to identify critical areas for SoC & IP verification and diagnostics focus
- Strong system‑level debugging and testing skills, with the capability to quickly identify problems, perform structured root‑cause analysis, and provide robust solutions
- Excellent communication and interpersonal skills, with the ability to collaborate effectively across global teams and can clearly explain complex technical issues to both technical and non‑technical stakeholders
- Demonstrated ability to work under pressure and manage competing priorities in tight project timelines while maintaining professionalism and quality
- Knowledge and experience in developing or enabling applications on industry compute platforms such as ROCm, OpenCL, or CUDA is an asset
- Familiar with Linux, knowledge and experience of device driver or software development is preferred
- Knowledge and experience with Manufacturing ATE/Wafer Sort Test and System Level Test a bonus
- Experienced with source controls systems like Perforce and GIT
- Hands‑on experience with SoC Bring-up and working in lab environments is a plus
- Prior experience in software development (e.g., object‑oriented C++, modern C++, system software or drivers), software development lifecycle; able to read and review code, understand architecture, and guide engineers in debug
- Experience developing machine learning, HPC or general‑purpose GPU compute applications is a bonus
- BS or MS in Computer Science, Computer Engineering or Electrical Engineering preferred