on-call rotations to support high-priority incidents and escalations. About You Skills & Experience Proven experience supporting HPC and/or AI workloads in production environments. Strong expertise with Slurmworkloadmanager, including tuning and troubleshooting. Proficiency with system-level debugging, including kernel modules and network interfaces. Experience with GPU compute platforms (NVIDIA and/or AMD … settings. Comfort operating in fast-paced, ambiguous, high-growth environments. Nice to have Experience with OpenStack and troubleshooting infrastructure in cloud environments. Kubernetes expertise, particularly in HPC or AI workload contexts. Familiarity with distributed file systems and advanced storage configurations. Understanding of GPU virtualization and multi-tenant HPC architecture. Exposure to machine learning frameworks and AI optimization workflows. Scripting More ❯
or equivalent proven track record) 4 + years working on large-scale ML codebases Hands-on with PyTorch, JAX or TensorFlow; comfortable with distributed training (DeepSpeed/FSDP/SLURM/K8s) Experience in deep learning, NLP or LLMs; bonus for CUDA or data-pipeline chops Strong software-design instincts: testing, code review, CI/CD Self-starter, low More ❯
capacity partners, balancing deep engineering discussions with high-level business context. Qualifications: Technical depth in GPU-cloud infrastructure: Experience with large-scale GPU clusters using Kubernetes and/or SLURM over InfiniBand; deep understanding of the NVIDIA driver stack, NCCL performance tuning, and benchmarking. Strong customer or partner-facing experience: Able to bridge technical and business conversations, explain complex More ❯
Hampshire, England, United Kingdom Hybrid / WFH Options
Hays Specialist Recruitment Limited
It's a great opportunity for someone who thrives in project-led infrastructure work and wants to help shape cutting-edge HPC solutions. What you'll need to succeed Slurm: Proven experience managing and tuning HPC job schedulers. Infiniband and RoCE: Deep knowledge of high-speed networking technologies. Ansible: Proficiency in using Ansible for automation and configuration management. Networking More ❯
and apply today! Responsibilities: Design scalable and secure infrastructure across Azure, on-prem, and possibly other cloud platforms. Architect and guide the setup/configuration of HPC clusters (eg, SLURM) to support large-scale statistical workloads. Design and support environments for Python, R, and SAS that meet compliance, reproducibility, and performance standards. Implement security, access control, and compliance practices … in life sciences). Skills/Must have: Hands-on experience with Azure and hybrid cloud environments, including understanding of infrastructure architecture and deployment. Proficient in HPC systems like SLURM, including installation, configuration, and optimization for performance-heavy workloads. Strong Python knowledge with experience in installation, configuration, and Scripting in Linux environments. Experience working with R and Python environments More ❯