About the role
AI summarisedThe ML Evaluation & Insights Engineer at Apple Services Engineering designs and develops automated benchmarking methodologies to evaluate AI/ML models, particularly LLMs and program synthesis systems. The role involves creating evaluation frameworks, analyzing model behavior, and collaborating with cross-functional teams to ensure AI experiences are safe, reliable, and aligned with human expectations. The engineer will generate benchmark datasets and insights to drive product and engineering improvements.
TechnologyOnsiteSoftware and Services
Key Responsibilities
- Lead the design and continuous development of automated benchmarking methodologies
- Investigate the behavior of media-related agents
- Craft rigorous evaluation frameworks and techniques
- Establish scientific standards for assessing quality features
- Generate benchmark datasets and evaluation methodologies for model and application outputs at scale
- Enable engineering teams to translate insights into actionable engineering and product improvements
- Work cross-functionally with Engineering, Project Managers, Product, Safety, and Editorial teams
- Develop a suite of technologies to ensure AI experiences are reliable, safe, and aligned with human expectations
- Take a proactive approach to working independently and collaboratively on a wide range of projects
- Collaborate with ML and data scientists, software developers, project managers, and other teams to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks
Requirements
- Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience
- Min 1+ years of work experience either as a postdoc or in the industry
- Strong research background in empirical evaluation, experimental design, or benchmarking
- Strong proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.)
- Deep familiarity with software engineering workflows and developer tools
- Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems
- Strong analytical and communication skills, including the ability to write clear reports
- Experience working with large datasets, annotation tools, and model evaluation pipelines
- Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns
- Ability to design taxonomies, categorization schemes, and structured labeling frameworks
- Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights
- Strong ability to stitch together qualitative and quantitative insights into actionable guidance
- Strong ability to communicate complex architectures and systems to a variety of stakeholders
- Education in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field
- Fluent in English and either Korean, Chinese, Japanese, French, Spanish, Portuguese, Hindi, Tamil