Lead HPC & AI Infrastructure Engineer

Your new company Step into the future of computing with a trailblazing organisation at the intersection of AI innovation and High Performance Computing (HPC). This company is redefining scalable infrastructure, building GPU-optimised environments that power advanced research and enterprise workloads. With a strong commitment to ethical computing and technical excellence, they're shaping the next generation of AI platforms. Your new role This is a fully remote, hands-on technical leadership role where you'll architect and deliver large-scale HPC and AI infrastructure from the ground up. You'll be the driving force behind the design, deployment, and optimisation of high-performance clusters - collaborating with internal engineering teams, OEMs, and external suppliers to build robust, scalable systems.

Key responsibilities include:

  • Designing end-to-end infrastructure solutions across compute, storage, and networking
  • Producing detailed technical documentation: hardware specs, data centre layouts, cabling, power and cooling
  • Installing and tuning Linux-based operating systems and configuring SLURM job schedulers
  • Optimising high-speed networking technologies (Infiniband, RoCE)
  • Automating deployments and maintenance using Ansible, Terraform, Bash, and Python
  • Troubleshooting complex distributed systems and mentoring junior engineers

This is a rare opportunity to lead infrastructure projects that directly support cutting-edge AI research and development. If you thrive in technically challenging environments and enjoy building systems that scale, this role is for you.

What you'll need to succeed

  • Proven experience designing and scaling large HPC clusters (hundreds to thousands of nodes)
  • Strong SLURM configuration skills - partitions, priorities, resource management
  • Advanced Linux administration and performance tuning
  • Expertise in high-performance networking (Infiniband, RoCE, RDMA)
  • Experience with distributed file systems (Lustre, Ceph, WEKA, VAST)
  • Proficiency in automation and scripting (Ansible, Terraform, Bash, Python)
  • A solid understanding of monitoring, resilience, and security compliance
  • Excellent documentation skills and a passion for mentoring and knowledge sharing

Desirable Experience

  • Containerisation in HPC (Singularity, Docker, Apptainer)
  • Familiarity with AI/ML workflows, GPU-aware MPI, NVLink
  • Experience in cloud, academic, or research environments
  • Vendor hardware validation and data centre planning

What you'll get in return

  • Share options and long-term incentives
  • Unlimited holiday policy
  • 100% remote working with flexible hours
  • A culture of internal promotion and career development
  • A collaborative, forward-thinking team
  • Enhanced family-friendly policies
  • A truly flexible and supportive workplace

What you need to do now If you're interested in this role, click 'apply now' to forward an up-to-date copy of your CV, or call us now.If this job isn't quite right for you, but you are looking for a new position, please contact us for a confidential discussion about your career.

Hays Specialist Recruitment Limited acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. By applying for this job you accept the T&C's, Privacy Policy and Disclaimers which can be found at hays.co.uk

Company
Hays Specialist Recruitment Limited
Location
Dorset, England, United Kingdom
Hybrid / WFH Options
Employment Type
Full-Time
Salary
£130,000 per annum
Posted
Company
Hays Specialist Recruitment Limited
Location
Dorset, England, United Kingdom
Hybrid / WFH Options
Employment Type
Full-Time
Salary
£130,000 per annum
Posted