Google

Senior Customer Engineer, AI Infrastructure, Google Cloud

Google
BusinessSingaporeFull-time5 days ago

About the role

AI summarised

Senior Customer Engineer for AI Infrastructure at Google Cloud, responsible for designing and implementing AI training and inferencing solutions on Google Cloud TPUs, optimizing performance, and advising customers on best practices. The role involves supporting sales teams, deploying AI/ML accelerators, and guiding customers on network topologies and compute/storage deployments, including data center visits.

BusinessFull-timeGeneral

Key Responsibilities

  • Design and implement complex, multi-host AI training and inferencing solutions on Google Cloud TPUs, focusing on scalability and performance tuning.
  • Conduct in-depth performance profiling and optimization of customer models and data pipelines specifically for the TPU architecture, identifying and resolving bottlenecks.
  • Advise customers on best practices for integrating their ML operations workflows with the Google Cloud AI platform ecosystem for seamless TPU utilization.
  • Support Google Cloud Sales teams to deploy AI/ML accelerators (e.g., TPU/GPU) at AI innovators, large enterprises, and early-stage AI startups.
  • Help customers innovate faster with solutions using Google Cloud's flexible and open AI infrastructure.
  • Work with Google customers on AI Infrastructure server and networking deployments.
  • Guide customer discussions on network topologies and compute/storage, and support bring-up of the server, network, cluster, or cooling deployments as it will include visits to the customer data center during the bring up phase.
  • Liaise with product marketing management and engineering teams to stay on top of industry trends and devise enhancements to Google Cloud products.

Requirements

  • Bachelor's degree or equivalent practical experience.
  • 10 years of experience in developing and deploying models using deep learning frameworks (e.g., TensorFlow, PyTorch, or JAX) specifically on TPU hardware.
  • Experience in networking principles, including concepts like collective communication, inter-chip interconnects, and their impact on distributed AI training.
  • Experience with lower-level performance tools and techniques (e.g., custom kernel development, XLA compiler familiarity) relevant to optimizing code for Google's TPU chips.
  • Experience with leveraging AI hardware and software stacks and platforms to bring up and deploy AI compute clusters.
  • Knowledge of AI accelerator hardware (e.g., specific GPU generations) to effectively articulate the architectural differentiation and value proposition of cloud TPUs.
  • Knowledge of the AI infrastructure market, including main technology providers, differentiators, and trends.