ML Infrastructure Engineer

This AI/ML infrastructure role is with a seriously exciting scale up. The robots are deployed, the data is flowing, and the team is growing. They need someone to own the operational infrastructure that powers the robot Learning team, from demonstration capture through to deployed behaviour on the fleet.

This is not generic MLOps. The infrastructure you build directly determines how fast the team can iterate on new manipulation capabilities.

The Role

You will own training infrastructure, data pipelines, model deployment, and developer experience for the Robot Learning team.

The key responsibilities

  • Develop reproducible, containerised training environments for diffusion policies, VLAs, and transformer-based architectures
  • Cloud GPU orchestration with spot fault tolerance and cost guardrails, scaling from single-GPU to multi-node distributed training
  • Full provenance from dataset version and training configuration through to deployed checkpoint
  • On-robot edge inference: optimised model export (ONNX, TensorRT), latency profiling, deployed-policy monitoring
  • Staged rollout to the robot fleet with rollback capability

What We're Looking For

Essential:

  • 3+ years building ML infrastructure or MLOps pipelines in production
  • Strong Python; PyTorch training pipelines and distributed training (DDP/FSDP, DeepSpeed, or similar)
  • Docker containerisation and multi-stage builds
  • Cloud infrastructure (GCP preferred) and Infrastructure-as-Code (Terraform preferred)
  • Experiment tracking (MLflow preferred) and CI/CD (GitHub Actions preferred)
  • Multi-modal data pipelines

Useful:

  • ML infrastructure for robotics, autonomous vehicles, or embodied AI
  • Cloud GPU orchestration tools (SkyPilot, Kubeflow, or similar)
  • Edge GPU deployment (ONNX, TensorRT)
  • Familiarity with behaviour cloning, diffusion policies, or VLA architectures (as a consumer)
  • Event-driven data architectures, serverless compute
  • Simulation environments (MuJoCo, Isaac Sim) and sim-to-real data pipelines

Key contributions areas

Training Infrastructure

  • Containerised training environments for policy learning workloads
  • GPU orchestration: spot tolerance, cost control, multi-node scaling
  • Experiment tracking and model registry with full provenance
  • Mixed precision, FSDP, checkpoint management, cold-start reduction

Data Pipelines

  • Automated pipelines from raw robot demonstrations to training-ready datasets
  • Data versioning so every model traces back to its source data
  • Quality monitoring: episode scoring, diversity analysis, outlier detection, failure-mode clustering
  • Triggers connecting collection, conversion, validation, and training into a cohesive workflow

Model Deployment & Serving

  • Evaluation harnesses benchmarking manipulation success rates across tasks
  • A/B comparison of model versions before deployment
  • Optimised export for edge devices, latency profiling
  • Staged rollout with rollback

Developer Experience & CI/CD

  • Tooling that lets researchers launch training, evaluate checkpoints, and compare experiments with minimal friction
  • GPU CI testing and nightly regression pipelines that catch inference regressions before they reach robots
  • ML-specific CI: model format checks, latency regression tests, checkpoint compatibility

What's On Offer

  • Join a team with world class applied research scientists, ML engineers, and robotics software engineers
  • A genuinely interesting technical problem at the frontier of embodied AI
  • Competitive compensation

Apply or send your CV to - Imogen@waverecruitment.co.uk

Job Details

Company
Wave Recruitment
Location
Greater Bristol Area, United Kingdom
Posted