ML Infrastructure Engineer

This AI/ML infrastructure role is with a seriously exciting scale up. The robots are deployed, the data is flowing, and the team is growing. They need someone to own the operational infrastructure that powers the robot Learning team, from demonstration capture through to deployed behaviour on the fleet.

This is not generic MLOps. The infrastructure you build directly determines how fast the team can iterate on new manipulation capabilities.

The Role

You will own training infrastructure, data pipelines, model deployment, and developer experience for the Robot Learning team.

The key responsibilities

Develop reproducible, containerised training environments for diffusion policies, VLAs, and transformer-based architectures
Cloud GPU orchestration with spot fault tolerance and cost guardrails, scaling from single-GPU to multi-node distributed training
Full provenance from dataset version and training configuration through to deployed checkpoint
On-robot edge inference: optimised model export (ONNX, TensorRT), latency profiling, deployed-policy monitoring
Staged rollout to the robot fleet with rollback capability

What We're Looking For

Essential:

3+ years building ML infrastructure or MLOps pipelines in production
Strong Python; PyTorch training pipelines and distributed training (DDP/FSDP, DeepSpeed, or similar)
Docker containerisation and multi-stage builds
Cloud infrastructure (GCP preferred) and Infrastructure-as-Code (Terraform preferred)
Experiment tracking (MLflow preferred) and CI/CD (GitHub Actions preferred)
Multi-modal data pipelines

Useful:

ML infrastructure for robotics, autonomous vehicles, or embodied AI
Cloud GPU orchestration tools (SkyPilot, Kubeflow, or similar)
Edge GPU deployment (ONNX, TensorRT)
Familiarity with behaviour cloning, diffusion policies, or VLA architectures (as a consumer)
Event-driven data architectures, serverless compute
Simulation environments (MuJoCo, Isaac Sim) and sim-to-real data pipelines

Key contributions areas

Training Infrastructure

Containerised training environments for policy learning workloads
GPU orchestration: spot tolerance, cost control, multi-node scaling
Experiment tracking and model registry with full provenance
Mixed precision, FSDP, checkpoint management, cold-start reduction

Data Pipelines

Automated pipelines from raw robot demonstrations to training-ready datasets
Data versioning so every model traces back to its source data
Quality monitoring: episode scoring, diversity analysis, outlier detection, failure-mode clustering
Triggers connecting collection, conversion, validation, and training into a cohesive workflow

Model Deployment & Serving

Evaluation harnesses benchmarking manipulation success rates across tasks
A/B comparison of model versions before deployment
Optimised export for edge devices, latency profiling
Staged rollout with rollback

Developer Experience & CI/CD

Tooling that lets researchers launch training, evaluate checkpoints, and compare experiments with minimal friction
GPU CI testing and nightly regression pipelines that catch inference regressions before they reach robots
ML-specific CI: model format checks, latency regression tests, checkpoint compatibility

What's On Offer

Join a team with world class applied research scientists, ML engineers, and robotics software engineers
A genuinely interesting technical problem at the frontier of embodied AI
Competitive compensation

Apply or send your CV to - Imogen@waverecruitment.co.uk

Apply Now

ML Infrastructure Engineer

Job Details