AI Evaluations Engineer

Role summary

This role sits at the centre of how we measure and improve AI systems in production.

You’ll define what good performance means across LLMs, ASR, TTS, and full speech-to-speech pipelines, and build the datasets, metrics, and evaluation systems that make AI quality measurable and comparable in the real world.

You’ll work closely with engineering and product teams to ensure model changes lead to real improvements in user experience, not just better offline benchmarks.

What you’ll do

  • Design and run evaluations across LLM, ASR, TTS, and speech-to-speech systems
  • Build real-world datasets and test cases from production behaviour and edge cases
  • Define metrics and scorecards for model and system quality
  • Benchmark internal models against external and frontier systems
  • Evaluate full pipelines (ASR → LLM → TTS), not just individual models
  • Build Python tools to automate evaluation workflows
  • Create internal leaderboards, red-teaming setups, and regression tests
  • Work with engineers and product teams to diagnose system failures
  • Turn vague product goals into measurable evaluation frameworks

What this role is about

  • Defining and measuring AI quality in production systems
  • Turning real user behaviour into structured evaluation signals
  • Ensuring model changes improve real-world performance
  • Understanding why AI systems fail, not just whether they do

What good looks like

  • You can translate improved quality into measurable metrics
  • You think in terms of system impact (before vs after), not just accuracy
  • You’re comfortable working across code, data, and production systems
  • You care about real-world behaviour, not just benchmarks

Core skills

  • Strong Python (scripting, data analysis, tooling)
  • Experience with ML systems, evaluation, or experimentation
  • Understanding of LLMs or speech systems (ASR / TTS)
  • Ability to design test cases and structured datasets
  • Comfortable working with engineers and product teams

Nice to have

  • Experience with LLM evaluation or benchmarking
  • Exposure to speech or multimodal systems
  • Familiarity with production APIs or ML systems
  • Experience with automated testing or CI-style workflows

Job Details

Company
ConnexAI
Location
Manchester Area, United Kingdom
Posted