MLOps Engineer | Python | Airflow | AWS | MLFlow | Docker | Kubernetes | London, Hybrid

Position Overview

We are seeking an experienced ML Ops Engineer to own the infrastructure and operational lifecycle of machine learning systems powering a large-scale clinical monitoring platform. You will build and maintain production ML pipelines, deployment infrastructure, and monitoring systems that enable predictive models to identify early signs of clinical deterioration.

Working closely with ML, backend, data, and clinical teams, you will ensure models are reliably trained, versioned, deployed, and monitored across both cloud and edge environments. You will help elevate ML engineering practices across the organisation, including reproducibility, experiment tracking, CI/CD for models, and operational observability.

This is a high-ownership role within a fast-paced environment where production reliability, rapid iteration, and pragmatic engineering are essential. Your work will directly contribute to improving patient outcomes through reliable and scalable machine learning systems.

Key Responsibilities

ML Pipeline Orchestration & Automation

Own and extend ML pipeline orchestration workflows using Apache Airflow, including training, evaluation, and deployment workflows.
Build and maintain automated pipelines for model retraining, validation, and promotion across development, staging, and production environments.
Implement pipeline monitoring, alerting, and failure recovery mechanisms to ensure operational reliability.
Design pipeline architectures that support rapid experimentation while maintaining reproducibility.

Model Deployment & Serving

Deploy and manage ML models on AWS infrastructure for batch and production inference workloads.
Support deployment of models to edge devices in collaboration with firmware and embedded engineering teams.
Manage model versioning, promotion, and rollback workflows using MLflow or equivalent tooling.
Evaluate and implement strategies for safe model rollouts, such as shadow deployments and canary releases.

Experiment Tracking & Model Registry

Maintain and improve experiment tracking and model registry infrastructure.
Establish conventions for experiment logging, artifact storage, metadata management, and lineage tracking.
Enable seamless workflows from experimentation to production deployment.

Data & Model Versioning

Implement and maintain data versioning and dataset management practices to ensure reproducibility.
Track dataset lineage, labeling provenance, and feature dependencies alongside model versions.
Collaborate with ML and data engineering teams to formalise dataset release and validation workflows.

Monitoring, Observability & Data Quality

Build monitoring systems for model performance in production, including drift detection and prediction quality tracking.
Implement operational dashboards for pipeline health, compute utilisation, and deployment status.
Collaborate with data engineering teams to ensure upstream data quality and pipeline reliability.
Develop incident response procedures and operational runbooks for ML system failures.

Infrastructure & Cost Optimisation

Manage and optimise AWS compute resources used for model training and inference.
Design infrastructure-as-code solutions for reproducible ML environments.
Drive cost optimisation initiatives across ML compute, storage, and data transfer.
Support integrations with cloud data warehouse platforms for feature generation and training pipelines.

Elevating ML Practice

Champion ML engineering best practices including CI/CD for models, automated testing, and reproducible training workflows.
Build internal tooling and templates that accelerate the ML development lifecycle.
Document operational processes, architectural decisions, and onboarding materials.
Participate in architecture discussions and technical planning to ensure scalability.

Security & Compliance

Ensure ML pipelines and infrastructure meet healthcare security and privacy requirements.
Apply best practices for handling sensitive healthcare data in training, deployment, and inference workflows.
Maintain audit trails for model decisions, data access, and deployment history.

Required Qualifications

4+ years of experience in MLOps, ML Engineering, DevOps, or related infrastructure roles.
Strong proficiency in Python for ML pipeline development, tooling, and automation.
Hands-on experience with ML pipeline orchestration tools, particularly Apache Airflow.
Experience with model registries and experiment tracking platforms such as MLflow.
Experience deploying and operating ML workloads on AWS.
Strong understanding of the ML lifecycle, including training, evaluation, deployment, monitoring, and retraining.
Experience with containerisation technologies such as Docker and infrastructure-as-code practices.
Proficiency with Git and version control workflows.
Familiarity with SQL and modern data warehousing platforms.
Experience implementing monitoring, logging, and alerting for production systems.
Strong debugging and incident response skills for distributed systems.

Preferred Qualifications

Experience deploying models to edge or embedded devices.
Background in healthcare, medical devices, or clinical data systems.
Familiarity with model serving frameworks such as TorchServe, TensorFlow Serving, or Triton.
Experience with CI/CD systems such as GitHub Actions, Jenkins, or similar tools.
Experience with data versioning tools such as DVC or LakeFS.
Experience supporting data science or ML research teams in production environments.
Exposure to healthcare compliance and security best practices.
Experience with distributed compute frameworks such as Apache Spark or Dask.
Experience with streaming or real-time inference architectures.

What You Bring

Strong ownership mindset across the full ML infrastructure lifecycle.
A focus on reliability, reproducibility, and operational excellence.
Pragmatic thinking and a desire to build scalable ML platforms.
Comfort operating in a fast-paced, high-growth environment.
Strong communication skills across engineering, data science, and clinical stakeholders.
Motivation to work on technology that positively impacts patient care.

Why Join Us

You will have the opportunity to:

Work on real-world healthcare challenges with measurable patient impact.
Build data systems that support clinical-grade AI and ML applications.
Take ownership within a fast-growing, mission-driven environment.
Collaborate with a highly skilled, multidisciplinary team.

Apply Now

MLOps Engineer | Python | Airflow | AWS | MLFlow | Docker | Kubernetes | London, Hybrid

Job Details