at scale Experienced in Linux performance benchmarking, tuning, and troubleshooting Familiarity with distributed storage solutions like Lustre and Ceph Knowledgeable in networking technologies and protocols, including Ethernet and ideally Infiniband Proactive and solution-oriented mindset Excellent problem-solving skills Initiative-driven and able to take ownership What we offer Diverse and internationally distributed team : joining our team means becoming part More ❯
networking, virtualization, cloud, etc.). Strong technical troubleshooting in multi-platform, distributed environments. Strong understanding of distributed storage systems. Expertise in Linux/Unix administration. Deep understanding of networking (Infiniband, Ethernet, DPDK, UCX), cloud computing, and distributed storage. Proficiency in Python, Bash, and experience with automation scripting for system monitoring and troubleshooting. Knowledge of POSIX, NFS, S3 protocols, log management More ❯
Hampshire, England, United Kingdom Hybrid / WFH Options
Hays Specialist Recruitment Limited
in project-led infrastructure work and wants to help shape cutting-edge HPC solutions. What you'll need to succeed Slurm: Proven experience managing and tuning HPC job schedulers. Infiniband and RoCE: Deep knowledge of high-speed networking technologies. Ansible: Proficiency in using Ansible for automation and configuration management. Networking: Strong networking fundamentals, ideally with experience in complex environments. Data More ❯
Experience with Linux virtualization, networking or graphics stacks Experience with one or more of the follow Experience with Docker/OCI containers/K8s ing technologies: confidential computing, RDMA, Infiniband and high performance computing. Performance engineering, benchmarking and profiling What We Offer You We consider geographical location, experience, and performance in shaping compensation worldwide. We revisit compensation annually (and more More ❯
PyTorch, or Hugging Face Transformers. Good understanding of programming/scripting: (e.g., Python, Go) for customizing solutions, creating scripts, or automating tasks. Experience with AI relevant infrastructure, including Networking (InfiniBand and RoCE), Storage (FC, IP and scale out) and AI accelerators (GPUs etc). Excellent presentation skills - ability to value-sell and deliver engaging workshops to both technical and non More ❯
ensure compatibility and efficiency. Significant previous datacenter experience in deployment, design or operations. Familiarity with CMDB tooling such as NetBox. Nice to have: Working knowledge and experience of using Infiniband fabrics Working knowledge of fat tree or rail-optimised designs for AI workloads Ability to perform performance level diagnostics on AI fabric Please Note: This role will require 50%+ More ❯
CUTLASS, CUB, Thrust, cuDNN and cuBLAS Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters An understanding of the collective algorithms supporting distributed GPU training in More ❯
environments. This is a hands-on role requiring deep technical acumen, exceptional problem-solving ability, and comfort working across a diverse set of technologies including GPUs (NVIDIA and AMD), InfiniBand networking, and orchestration systems like Slurm. What You'll Be Doing Provide expert-level support for customer HPC and AI workloads running in production. Troubleshoot complex system-level issues across … with system-level debugging, including kernel modules and network interfaces. Experience with GPU compute platforms (NVIDIA and/or AMD) and associated libraries. Familiarity with MPI libraries (e.g., OpenMPI), InfiniBand, and high-speed Ethernet networking. Solid Linux administration skills and troubleshooting experience. Working knowledge of HPC container runtimes (e.g., Singularity, Apptainer). Exposure to provisioning and automation tools (e.g., Ansible More ❯
efforts: Manage multiple concurrent infrastructure validation cycles, define and track KPIs, and build repeatable processes. Monitor and troubleshoot distributed systems: Perform end-to-end diagnostics across compute, fabric (e.g., InfiniBand), and storage layers. Stay current with cutting-edge trends in AI infrastructure such as NVIDIA Hopper/Blackwell architectures, model-serving patterns, and emerging ML system designs and disseminate insights … balancing deep engineering discussions with high-level business context. Qualifications: Technical depth in GPU-cloud infrastructure: Experience with large-scale GPU clusters using Kubernetes and/or SLURM over InfiniBand; deep understanding of the NVIDIA driver stack, NCCL performance tuning, and benchmarking. Strong customer or partner-facing experience: Able to bridge technical and business conversations, explain complex systems to mixed More ❯
efforts: Manage multiple concurrent infrastructure validation cycles, define and track KPIs, and build repeatable processes. Monitor and troubleshoot distributed systems: Perform end-to-end diagnostics across compute, fabric (e.g., InfiniBand), and storage layers. Stay current with cutting-edge trends in AI infrastructure such as NVIDIA Hopper/Blackwell architectures, model-serving patterns, and emerging ML system designs and disseminate insights … balancing deep engineering discussions with high-level business context. Qualifications: Technical depth in GPU-cloud infrastructure: Experience with large-scale GPU clusters using Kubernetes and/or SLURM over InfiniBand; deep understanding of the NVIDIA driver stack, NCCL performance tuning, and benchmarking. Strong customer or partner-facing experience: Able to bridge technical and business conversations, explain complex systems to mixed More ❯
efforts: Manage multiple concurrent infrastructure validation cycles, define and track KPIs, and build repeatable processes. Monitor and troubleshoot distributed systems: Perform end-to-end diagnostics across compute, fabric (e.g., InfiniBand), and storage layers. Stay current with cutting-edge trends in AI infrastructure such as NVIDIA Hopper/Blackwell architectures, model-serving patterns, and emerging ML system designs and disseminate insights … balancing deep engineering discussions with high-level business context. Qualifications: Technical depth in GPU-cloud infrastructure: Experience with large-scale GPU clusters using Kubernetes and/or SLURM over InfiniBand; deep understanding of the NVIDIA driver stack, NCCL performance tuning, and benchmarking. Strong customer or partner-facing experience: Able to bridge technical and business conversations, explain complex systems to mixed More ❯