Lead HPC & AI Infrastructure Engineer

Apply Now

Your new company Step into the future of computing with a trailblazing organisation at the intersection of AI innovation and High Performance Computing (HPC). This company is redefining scalable infrastructure, building GPU-optimised environments that power advanced research and enterprise workloads. With a strong commitment to ethical computing and technical excellence, they're shaping the next generation of AI platforms. Your new role This is a fully remote, hands-on technical leadership role where you'll architect and deliver large-scale HPC and AI infrastructure from the ground up. You'll be the driving force behind the design, deployment, and optimisation of high-performance clusters - collaborating with internal engineering teams, OEMs, and external suppliers to build robust, scalable systems.

Key responsibilities include:

Designing end-to-end infrastructure solutions across compute, storage, and networking
Producing detailed technical documentation: hardware specs, data centre layouts, cabling, power and cooling
Installing and tuning Linux-based operating systems and configuring SLURM job schedulers
Optimising high-speed networking technologies (Infiniband, RoCE)
Automating deployments and maintenance using Ansible, Terraform, Bash, and Python
Troubleshooting complex distributed systems and mentoring junior engineers

This is a rare opportunity to lead infrastructure projects that directly support cutting-edge AI research and development. If you thrive in technically challenging environments and enjoy building systems that scale, this role is for you.

What you'll need to succeed

Proven experience designing and scaling large HPC clusters (hundreds to thousands of nodes)
Strong SLURM configuration skills - partitions, priorities, resource management
Advanced Linux administration and performance tuning
Expertise in high-performance networking (Infiniband, RoCE, RDMA)
Experience with distributed file systems (Lustre, Ceph, WEKA, VAST)
Proficiency in automation and scripting (Ansible, Terraform, Bash, Python)
A solid understanding of monitoring, resilience, and security compliance
Excellent documentation skills and a passion for mentoring and knowledge sharing

Desirable Experience

Containerisation in HPC (Singularity, Docker, Apptainer)
Familiarity with AI/ML workflows, GPU-aware MPI, NVLink
Experience in cloud, academic, or research environments
Vendor hardware validation and data centre planning

What you'll get in return

Share options and long-term incentives
Unlimited holiday policy
100% remote working with flexible hours
A culture of internal promotion and career development
A collaborative, forward-thinking team
Enhanced family-friendly policies
A truly flexible and supportive workplace

What you need to do now If you're interested in this role, click 'apply now' to forward an up-to-date copy of your CV, or call us now.If this job isn't quite right for you, but you are looking for a new position, please contact us for a confidential discussion about your career.

Hays Specialist Recruitment Limited acts as an employment agency for permanent recruitment and employment business for the supply of temporary workers. By applying for this job you accept the T&C's, Privacy Policy and Disclaimers which can be found at hays.co.uk

Company: Hays Specialist Recruitment Limited
Location: Dorset, England, United Kingdom
Hybrid / WFH Options
Employment Type: Full-Time
Salary: £130,000 per annum
Posted: 5 hours ago

Apply Now

Company: Hays Specialist Recruitment Limited
Location: Dorset, England, United Kingdom
Hybrid / WFH Options
Employment Type: Full-Time
Salary: £130,000 per annum
Posted: 5 hours ago