About the role

AI summarised

The ML Evaluation & Insights Engineer at Apple Services Engineering designs and develops automated benchmarking methodologies to evaluate AI/ML models, particularly LLMs and program synthesis systems. The role involves creating evaluation frameworks, analyzing model behavior, and collaborating with cross-functional teams to ensure AI experiences are safe, reliable, and aligned with human expectations. The engineer will generate benchmark datasets and insights to drive product and engineering improvements.

TechnologyOnsiteSoftware and Services

Key Responsibilities

Lead the design and continuous development of automated benchmarking methodologies
Investigate the behavior of media-related agents
Craft rigorous evaluation frameworks and techniques
Establish scientific standards for assessing quality features
Generate benchmark datasets and evaluation methodologies for model and application outputs at scale
Enable engineering teams to translate insights into actionable engineering and product improvements
Work cross-functionally with Engineering, Project Managers, Product, Safety, and Editorial teams
Develop a suite of technologies to ensure AI experiences are reliable, safe, and aligned with human expectations
Take a proactive approach to working independently and collaboratively on a wide range of projects
Collaborate with ML and data scientists, software developers, project managers, and other teams to understand requirements and translate them into scalable, reliable, and efficient evaluation frameworks

Requirements

Advanced degree (MS or PhD) in Computer Science, Software Engineering, or equivalent research/work experience
Min 1+ years of work experience either as a postdoc or in the industry
Strong research background in empirical evaluation, experimental design, or benchmarking
Strong proficiency in Python (pandas, NumPy, Jupyter, PyTorch, etc.)
Deep familiarity with software engineering workflows and developer tools
Experience working with or evaluating AI/ML models, preferably LLMs or program synthesis systems
Strong analytical and communication skills, including the ability to write clear reports
Experience working with large datasets, annotation tools, and model evaluation pipelines
Familiarity with evaluations specific to responsible AI and safety, hallucination detection, and/or model alignment concerns
Ability to design taxonomies, categorization schemes, and structured labeling frameworks
Ability to interpret unstructured data (text, transcripts, user sessions) and derive meaningful insights
Strong ability to stitch together qualitative and quantitative insights into actionable guidance
Strong ability to communicate complex architectures and systems to a variety of stakeholders
Education in Data Science, Linguistics, Cognitive Science, HCI, Psychology, Social Science, or a related field
Fluent in English and either Korean, Chinese, Japanese, French, Spanish, Portuguese, Hindi, Tamil

ML Evaluation & Insights Engineer - ASE

About the role

Key Responsibilities

Requirements