ML Ops Engineer (£100k + Stock)

This is one of those rare roles where the infrastructure problem is just as hard and interesting as the machine learning itself.

There is already a working system pushing the boundaries of AI-driven engineering.

There are real users, real datasets, and real performance constraints.

There is momentum.

What’s missing is an engineer to build the infrastructure that allows the ML to scale, perform, and reliably operate in production.

We are building AI systems that replace traditional physics simulations with high-speed generative models, solving complex problems across advanced engineering domains. The models are only part of the story. The real challenge is building the infrastructure that makes them usable, scalable, and production-ready.

The Opportunity

This is a foundational role focused on ML infrastructure.

You will not be tuning models in isolation. You will build the systems that enable training, deployment, monitoring, and iteration at scale.

You will work closely with ML researchers and engineers to turn experimental models into robust, production-grade systems.

If you enjoy being at the intersection of machine learning and distributed systems, this role will suit you.

What You Will Build

• Scalable training pipelines for large, complex datasets

• Infrastructure for distributed training and high-performance compute workloads

• Model deployment systems, including inference services and batch pipelines

• Data pipelines for ingestion, transformation, and versioning of large datasets

• Monitoring, observability, and evaluation systems for ML performance in production

This is a deeply technical, hands-on role focused on building systems that make ML actually work in the real world.

What Success Looks Like

First few months:

• You take ownership of key parts of the ML infrastructure stack

• Training and inference workflows become faster, more reliable, and easier to iterate on

• Researchers can move from idea to production significantly faster

First year:

• You define the core ML infrastructure architecture

• You improve system performance, scalability, and cost efficiency

• You become a key bridge between research and production engineering

What We’re Looking For

You are an engineer who:

• Has strong Python skills and experience building backend or distributed systems

• Has worked with ML frameworks such as PyTorch, TensorFlow, or JAX

• Understands training pipelines, data workflows, and model deployment

• Has experience with cloud infrastructure (AWS/GCP) and containerization (Docker/Kubernetes)

• Thinks in terms of systems, scalability, and reliability, not just models

• Enjoys working in fast-moving environments with high ownership

Nice to Have

• Experience with distributed training (multi-GPU, multi-node systems)

• Familiarity with MLOps tooling and workflows

• Background in high-performance computing or data-intensive systems

• Exposure to scientific computing or simulation workloads

Why Join

• Full ownership: you will shape how ML is built, deployed, and scaled

• High impact: your work directly unlocks real-world applications of advanced AI

• Elite team: work alongside engineers and researchers solving genuinely hard problems

If you are interested, apply or reach out to learn more.

Job Details

Company
Nexa
Location
City of London, London, United Kingdom
Posted