HPC Engineer

About the Role

We are seeking a Senior HPC Engineer to design, implement, and scale the infrastructure that supports high-performance machine learning and AI-driven research workflows. You will play a critical role in bridging the gap between data science, bioinformatics, and engineering — ensuring seamless, secure, and reproducible deployment of ML models in production and research environments.

You'll collaborate closely with AI Scientists, Data Engineers, and DevSecOps teams, building automation pipelines that accelerate model development and deployment across distributed, cloud-native systems.

Key Responsibilities

Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.

Essential Skills and Experience

Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
Expertise with high-throughput storage systems for ML/HPC workloads
Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)

Terms of Appointment

Applicants must have the right to work permanently in the UK and be within commuting distance of Oxford.

Apply Now

Similar Jobs

HPC Engineer

Hiring Organisation: Hlx Life Sciences
Location: Oxford, Oxfordshire, UK
Employment Type: Full-time

About the Role We are seeking a Senior HPC Engineer to design, implement, and scale the infrastructure that supports high-performance machine learning and AI-driven research workflows. You will play a critical role in bridging the gap between data ...

Contract GCP DevOps Engineer (SC Cleared) Outside IR35 £550pd

Hiring Organisation: iO Associates
Location: Slough, Berkshire, UK
Employment Type: Full-time

Contract GCP DevOps Engineer (SC Cleared) - Outside IR35 - £450-£550/day Duration: Until March 2026 (likely extensions) 4-6 months Clearance: Active SC required IR35: Outside Location: Largely remote with the expectation to be onsite in London or Corsham when needed ...

Site Reliability Engineer (SRE) - AWS

Hiring Organisation: Xpertise Recruitment
Location: Slough, Berkshire, UK
Employment Type: Full-time

Site Reliability Engineer (SRE) – AWS Location: London Salary: £100,000 per annum + Bonus + Excellent Benefits I am looking for an SRE for a large-scale digital organisation in the middle of a major engineering modernisation journey. This is not a ...

Platform Engineer

Hiring Organisation: Movement8
Location: London, United Kingdom
Employment Type: Permanent

PlatformEngineer GCP, Terraform, Kubernetes- Hybrid (London) You will be joining a company that build software solutions fast to solve real problems within the workplace. The team builds tools that help engineering and product teams respond to incidents, reduce downtime, and ...

Contract GCP DevOps Engineer (SC Cleared) Outside IR35 £550pd

Hiring Organisation: IO Associates
Location: London, United Kingdom
Employment Type: Contract
Salary: £500 - £600 per day

Contract GCP DevOps Engineer (SC Cleared) - Outside IR35 - £450-£550/day Duration: Until March 2026 (likely extensions) 4-6 months Clearance: Active SC required IR35: Outside Location : Largely remote with the expectation to be onsite in London or Corsham when needed ...

Contract GCP DevOps Engineer (SC Cleared) Outside IR35 £550pd

Hiring Organisation: iO Associates
Location: London, UK
Employment Type: Full-time

Contract GCP DevOps Engineer (SC Cleared) - Outside IR35 - £450-£550/day Duration: Until March 2026 (likely extensions) 4-6 months Clearance: Active SC required IR35: Outside Location: Largely remote with the expectation to be onsite in London or Corsham when needed ...

DevOps Engineer

Hiring Organisation: Harnham - Data & Analytics Recruitment
Location: Leeds, West Yorkshire, England, United Kingdom
Employment Type: Full-Time
Salary: £40,000 - £45,000 per annum

DevOps ENGINEER £45,000 + BENEFITS LEEDS (Hybrid) Looking to sharpen your skills in Azure, work with emerging AI tools, and build cloud infrastructure that scales across modern data platforms? THE COMPANY: I'm working with a central data function in ...

AI Platform Engineer

Hiring Organisation: MRJ Recruitment
Location: Warwickshire, UK
Employment Type: Full-time

AI Platform Engineers: Stop building on slow, clunky systems. If you're tired of wrestling with infrastructure instead of shipping AI, this could be the role that changes everything. We're partnering with a high-growth tech company based in ...

Director of Next Generation Engineering

Hiring Organisation: Lynx Recruitment Ltd
Location: Manchester, Lancashire, United Kingdom
Employment Type: Permanent
Salary: GBP 150,000 Annual

Director of Next Generation Engineering Salary: Up to £150,000 + bonus + benefits Location: Manchester - Hybrid working About the Role We're working with a leading AI and technology innovation consultancy that helps organisations design and deliver intelligent, data-driven products ...

Intermediate Backend Engineer (Go), Verify: CI Functions Platform

Hiring Organisation: GitLab
Location: United Kingdom

GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Our mission is to enable everyone to contribute to and co-create the software that powers our ...

AI Platform Engineer

Hiring Organisation: MRJ Recruitment
Location: Leamington Spa, Warwickshire, UK
Employment Type: Full-time

AI Platform Engineers: Stop building on slow, clunky systems. If you're tired of wrestling with infrastructure instead of shipping AI, this could be the role that changes everything. We're partnering with a high-growth tech company based in ...