ML Infrastructure Engineer
This AI/ML infrastructure role is with a seriously exciting scale up. The robots are deployed, the data is flowing, and the team is growing. They need someone to own the operational infrastructure that powers the robot Learning team, from demonstration capture through to deployed behaviour on the fleet.
This is not generic MLOps. The infrastructure you build directly determines how fast the team can iterate on new manipulation capabilities.
The Role
You will own training infrastructure, data pipelines, model deployment, and developer experience for the Robot Learning team.
The key responsibilities
- Develop reproducible, containerised training environments for diffusion policies, VLAs, and transformer-based architectures
- Cloud GPU orchestration with spot fault tolerance and cost guardrails, scaling from single-GPU to multi-node distributed training
- Full provenance from dataset version and training configuration through to deployed checkpoint
- On-robot edge inference: optimised model export (ONNX, TensorRT), latency profiling, deployed-policy monitoring
- Staged rollout to the robot fleet with rollback capability
What We're Looking For
Essential:
- 3+ years building ML infrastructure or MLOps pipelines in production
- Strong Python; PyTorch training pipelines and distributed training (DDP/FSDP, DeepSpeed, or similar)
- Docker containerisation and multi-stage builds
- Cloud infrastructure (GCP preferred) and Infrastructure-as-Code (Terraform preferred)
- Experiment tracking (MLflow preferred) and CI/CD (GitHub Actions preferred)
- Multi-modal data pipelines
Useful:
- ML infrastructure for robotics, autonomous vehicles, or embodied AI
- Cloud GPU orchestration tools (SkyPilot, Kubeflow, or similar)
- Edge GPU deployment (ONNX, TensorRT)
- Familiarity with behaviour cloning, diffusion policies, or VLA architectures (as a consumer)
- Event-driven data architectures, serverless compute
- Simulation environments (MuJoCo, Isaac Sim) and sim-to-real data pipelines
Key contributions areas
Training Infrastructure
- Containerised training environments for policy learning workloads
- GPU orchestration: spot tolerance, cost control, multi-node scaling
- Experiment tracking and model registry with full provenance
- Mixed precision, FSDP, checkpoint management, cold-start reduction
Data Pipelines
- Automated pipelines from raw robot demonstrations to training-ready datasets
- Data versioning so every model traces back to its source data
- Quality monitoring: episode scoring, diversity analysis, outlier detection, failure-mode clustering
- Triggers connecting collection, conversion, validation, and training into a cohesive workflow
Model Deployment & Serving
- Evaluation harnesses benchmarking manipulation success rates across tasks
- A/B comparison of model versions before deployment
- Optimised export for edge devices, latency profiling
- Staged rollout with rollback
Developer Experience & CI/CD
- Tooling that lets researchers launch training, evaluate checkpoints, and compare experiments with minimal friction
- GPU CI testing and nightly regression pipelines that catch inference regressions before they reach robots
- ML-specific CI: model format checks, latency regression tests, checkpoint compatibility
What's On Offer
- Join a team with world class applied research scientists, ML engineers, and robotics software engineers
- A genuinely interesting technical problem at the frontier of embodied AI
- Competitive compensation
Apply or send your CV to - Imogen@waverecruitment.co.uk