ML Infrastructure Engineer

ML Infrastructure Engineer (Senior / Staff / Principal / Lead)

Hybrid (2 days in Oxford / 1 day in London / 2 days WFH)

The Client:

Were partnering with a highly funded AI research company, poised to build the largest and most advanced AI team in Europe in the coming years. There aren't many opportunities where you get to work on addressing the problems of tomorrow in a don't be afraid to push boundaries and fail environment. Competing on a Deepmind-esque level, you'll be addressing some of humanitys most pressing and enduring challenges, including next-generation drug discovery, combating climate change, the future of sustainable agriculture, and various other humanity-positive missions! By joining their team, youll have the opportunity to contribute to research that directly shapes a better, more sustainable future for humanity. You'll be joining at an early stage, which means there are truly very few opportunities that can compete with this on a personal impact level!

The Role:

Youll be joining the team that powers the core of their research. This isnt a support role. This is the group that builds the compute backbone behind every major breakthrough. Youll shape how their scientists train models, test ideas, and push their work forward at scale. And because youre joining early, your impact will be felt across the whole organisation.

Youll work on problems that matter. Youll help build fast, reliable GPU systems that let researchers move from idea to result without friction. Youll have room to experiment, try new approaches, and design systems in a place that backs bold thinking.

Key Responsibilities:

  • Build, run, and improve high-performance GPU training and inference clusters with a focus on reliability and automation
  • Design and implement high-throughput data paths, including work on caching, I/O, and data locality across compute and storage
  • Benchmark, profile, and fix performance issues across compute, network, and orchestration layers
  • Set up clear observability, resilience, and security controls for sensitive research environments
  • Work with Research, Data, and Applied teams to plan GPU and storage capacity and support smoother ML experimentation

Technical Skills:

  • Strong experience designing and operating large-scale ML compute clusters
  • Good understanding of GPU architecture, high-speed networking, and performance tuning for distributed training
  • Experience with modern containerised systems and migrations from traditional schedulers
  • Knowledge of high-throughput storage systems for ML or HPC workloads
  • Solid experience with IaC and CI/CD (Terraform, Argo CD, or similar)

Whats on Offer:

  • Salary packages competitive with FAANG businesses
  • An opportunity to work on projects that will make a difference in the world, all projects are multi-decade programs that are orientated to improve society and peoples lives
  • A rare opportunity to take part in shaping the core ML infra team as it grows from the ground up
  • State-of-the-art resources, enabling you to push the boundaries of AI research and development quickly and ethically

Benefits:

  • Enhanced holiday pay
  • Pension
  • Life Assurance
  • Income Protection
  • Private Medical Insurance
  • Hospital Cash Plan
  • Therapy Services
  • Perk Box
  • Electric Car Scheme

Apply now or drop me a message if youd like to hear more

Keywords: ML Infra, ML Infrastructure, Machine Learning Infrastructure, Machine Learning Infra, ML Infrastructure Engineer, GPU, HPC, AI, Terraform, Argo CD, CI/CD, IaC, Terraform, Containers, Container Orchestration, Kubernetes, Linux

Company
Cubiq Recruitment
Location
United Kingdom, UK
Hybrid / WFH Options
Employment Type
Part-time
Posted
Company
Cubiq Recruitment
Location
United Kingdom, UK
Hybrid / WFH Options
Employment Type
Part-time
Posted