HPC Engineer

About the Role

We are seeking a Senior HPC Engineer to design, implement, and scale the infrastructure that supports high-performance machine learning and AI-driven research workflows. You will play a critical role in bridging the gap between data science, bioinformatics, and engineering — ensuring seamless, secure, and reproducible deployment of ML models in production and research environments.

You'll collaborate closely with AI Scientists, Data Engineers, and DevSecOps teams, building automation pipelines that accelerate model development and deployment across distributed, cloud-native systems.

Key Responsibilities

  • Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
  • Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
  • Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
  • Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
  • Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.

Essential Skills and Experience

  • Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
  • A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
  • Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
  • Expertise with high-throughput storage systems for ML/HPC workloads
  • Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
  • A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)

Terms of Appointment

Applicants must have the right to work permanently in the UK and be within commuting distance of Oxford.

Job Details

Company
Hlx Life Sciences
Location
Banbury, Oxfordshire, UK
Employment Type
Full-time
Posted