ML Infrastructure Engineer

Apply Now

ML Infrastructure Engineer (Senior / Staff / Principal / Lead)

Hybrid (2 days in Oxford / 1 day in London / 2 days WFH)

The Client:

Were partnering with a highly funded AI research company, poised to build the largest and most advanced AI team in Europe in the coming years. There aren't many opportunities where you get to work on addressing the problems of tomorrow in a don't be afraid to push boundaries and fail environment. Competing on a Deepmind-esque level, you'll be addressing some of humanitys most pressing and enduring challenges, including next-generation drug discovery, combating climate change, the future of sustainable agriculture, and various other humanity-positive missions! By joining their team, youll have the opportunity to contribute to research that directly shapes a better, more sustainable future for humanity. You'll be joining at an early stage, which means there are truly very few opportunities that can compete with this on a personal impact level!

The Role:

Youll be joining the team that powers the core of their research. This isnt a support role. This is the group that builds the compute backbone behind every major breakthrough. Youll shape how their scientists train models, test ideas, and push their work forward at scale. And because youre joining early, your impact will be felt across the whole organisation.

Youll work on problems that matter. Youll help build fast, reliable GPU systems that let researchers move from idea to result without friction. Youll have room to experiment, try new approaches, and design systems in a place that backs bold thinking.

Key Responsibilities:

Build, run, and improve high-performance GPU training and inference clusters with a focus on reliability and automation
Design and implement high-throughput data paths, including work on caching, I/O, and data locality across compute and storage
Benchmark, profile, and fix performance issues across compute, network, and orchestration layers
Set up clear observability, resilience, and security controls for sensitive research environments
Work with Research, Data, and Applied teams to plan GPU and storage capacity and support smoother ML experimentation

Technical Skills:

Strong experience designing and operating large-scale ML compute clusters
Good understanding of GPU architecture, high-speed networking, and performance tuning for distributed training
Experience with modern containerised systems and migrations from traditional schedulers
Knowledge of high-throughput storage systems for ML or HPC workloads
Solid experience with IaC and CI/CD (Terraform, Argo CD, or similar)

Whats on Offer:

Salary packages competitive with FAANG businesses
An opportunity to work on projects that will make a difference in the world, all projects are multi-decade programs that are orientated to improve society and peoples lives
A rare opportunity to take part in shaping the core ML infra team as it grows from the ground up
State-of-the-art resources, enabling you to push the boundaries of AI research and development quickly and ethically

Benefits:

Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme

Apply now or drop me a message if youd like to hear more

Keywords: ML Infra, ML Infrastructure, Machine Learning Infrastructure, Machine Learning Infra, ML Infrastructure Engineer, GPU, HPC, AI, Terraform, Argo CD, CI/CD, IaC, Terraform, Containers, Container Orchestration, Kubernetes, Linux

Company: Cubiq Recruitment
Location: United Kingdom, UK
Hybrid / WFH Options
Employment Type: Part-time
Posted: 12 hours ago

Apply Now

Company: Cubiq Recruitment
Location: United Kingdom, UK
Hybrid / WFH Options
Employment Type: Part-time
Posted: 12 hours ago