servers and a working knowledge of the Windows operating system. - Familiar with HPC technologies, including provisioning, job schedulers, and low-latency interconnects. - Experience with HPC Job Schedulers such as SLURM, PBS, and IBM LSF. - Knowledge and experience in using one or more scripting languages, such as Bash, Python and SQL. - Experience with any cloud services such as Microsoft Azure More ❯
Hampshire, England, United Kingdom Hybrid / WFH Options
Hays Specialist Recruitment Limited
It's a great opportunity for someone who thrives in project-led infrastructure work and wants to help shape cutting-edge HPC solutions. What you'll need to succeed Slurm: Proven experience managing and tuning HPC job schedulers. Infiniband and RoCE: Deep knowledge of high-speed networking technologies. Ansible: Proficiency in using Ansible for automation and configuration management. Networking More ❯
systems across thousands of GPUs. Proven ability to architect and maintain large-scale distributed systems for data processing and delivery. Deep expertise in orchestration frameworks such as Kubernetes and SLURM with hands-on experience deploying and managing high-throughput workloads. Preferred Qualifications Practical experience on building pipelines and infrastructure with visual and multimodal datasets, including image/video pipelines. More ❯
London Department: Advanced Research Computing & AI The order of skillset/desirability for a candidate for this role is as follows: Linux System Administration (any flavour) Cluster computing/Slurm GPU/CUDA Cloud computing A rare opportunity has emerged to become a founding member of a newly established AI and high-performance computing (HPC) division at one of More ❯
We are seeking a highly skilled Senior HPC & DevOps Engineer with experience in managing both high-performance computing clusters and modern DevOps infrastructure. The ideal candidate combines expertise in Slurm-managed HPC clusters , GPU compute environments , CI/CD pipelines , and Kubernetes-based orchestration . This person thrives in collaborative, fast-paced environments, drives technical execution with minimal oversight … with minimal oversight. They bring a problem-solving mindset, strong communication skills, and a passion for building reliable, scalable systems. KEY RESPONSIBILITIES: Deploy, configure, and maintain HPC clusters using Slurm . Manage GPU compute nodes, high-speed interconnects, and parallel storage systems. Design and maintain CI/CD pipelines using Buildkite, GitHub Actions, Jenkins. Automate infrastructure provisioning and configuration … Monitor cluster health and performance; build dashboards with Grafana, Prometheus, Checkmk . Collaborate across teams to optimize workflows, troubleshoot issues, and document best practices. PREFERRED EXPERIENCE: Strong experience with Slurm or equivalent HPC schedulers . CI/CD, DevOps tools, and automation expertise. GPU compute and lifecycle management (CUDA/ROCm). Linux administration, shell scripting, and distributed systems More ❯