Micron Technology

Principal Engineer, Machine Learning, SMAI

Micron Technology
Integrated Device ManufacturingSingapore, SingaporeOnsitePosted 1 week ago

About the role

AI summarised

The Principal Engineer, Machine Learning role at Micron Technology's Smart Manufacturing and AI team focuses on developing and deploying scalable AI/ML solutions, including large language models and autonomous AI agents, to enhance manufacturing processes. The position requires deep expertise in GPU-accelerated computing, distributed training, MLOps, and full-stack AI system development in cloud environments.

IDMOnsiteSmart MFG/AI

Key Responsibilities

  • Architect and execute large-scale custom model training and fine-tuning jobs (SFT, RLHF) on multi-node, multi-GPU clusters
  • Optimize training throughput and memory efficiency using distributed training strategies (FSDP, DeepSpeed, Megatron-LM) and mixed-precision techniques (FP16/BF16)
  • Design and develop autonomous AI Agents capable of multi-step reasoning, planning, and tool execution to automate complex manufacturing workflows
  • Implement Agentic frameworks (e.g., LangChain, LangGraph, CrewAI) to orchestrate LLM interactions with internal APIs, databases, and software tools
  • Profile and debug GPU performance bottlenecks using tools like Nsight Systems or PyTorch Profiler to maximize hardware utilization
  • Build and maintain data/solution pipelines that feed machine learning models and GenAI applications
  • Design and optimize data structures in data management systems (Snowflake, and Google Cloud platforms) to enable AI/ML and Agentic solutions
  • Create/Maintain CI/CD pipelines of machine learning and AI Agent solutions in the cloud

Requirements

  • Technical Degree required. Computer Science or Statistics background highly desired
  • Deep understanding of GPU architecture (memory hierarchy, tensor cores, interconnects like NVLink) and experience managing GPU resources in both cloud environments and on-prem
  • Hands-on experience with Distributed Data Parallel (DDP), Fully Sharded Data Parallel (FSDP), and model parallelism techniques
  • Proficiency in fine-tuning Large Language Models using PEFT techniques (LoRA, QLoRA) and optimizing inference engines (vLLM, TensorRT-LLM)
  • Experience developing GenAI applications and AI Agents using frameworks like LangChain, LangGraph, LlamaIndex, or AutoGen
  • Proficiency with Large Language Models (LLMs), including prompt engineering, function calling/tool use, and Chain-of-Thought (CoT) reasoning
  • Experience in building and executing end-to-end ML systems automating training, testing and deploying Machine Learning models
  • Familiarity with machine learning frameworks (PyTorch is required, TensorFlow, scikit-learn, etc.)
  • Software development skills and the desire to work on cutting edge development in a Cloud environment
  • Strong scripting and programming skills in one of the following, Python or Java (Python preferred)
  • Experience with continuous integration/continuous delivery (CI/CD) tools (Jenkins, Git, Docker, Kubernetes)
  • 9+ years building scalable ETL pipelines
  • 9+ years of experience with big data processing and/or developing applications and data sources
  • Outstanding analytical thinking, interpersonal, oral and written communication skills
  • Ability to prioritize and meet critical project timelines in a fast-paced environment