PyTorch, or Hugging Face Transformers. Good understanding of programming/scripting: (e.g., Python, Go) for customizing solutions, creating scripts, or automating tasks. Experience with AI relevant infrastructure, including Networking (InfiniBand and RoCE), Storage (FC, IP and scale out) and AI accelerators (GPUs etc). Excellent presentation skills - ability to value-sell and deliver engaging workshops to both technical and non More ❯
PyTorch, or Hugging Face Transformers. Good understanding of programming/scripting: (e.g., Python, Go) for customizing solutions, creating scripts, or automating tasks. Experience with AI relevant infrastructure, including Networking (InfiniBand and RoCE), Storage (FC, IP and scale out) and AI accelerators (GPUs etc). Excellent presentation skills - ability to value-sell and deliver engaging workshops to both technical and non More ❯
YOUR QUALIFICATIONS: 3+ years of experience in infrastructure operations, system administration, or technical support, ideally within HPC or GPU-accelerated environments. Strong troubleshooting skills with high-performance networking technologies (InfiniBand, RDMA, or similar). Familiarity with NVIDIA GPU technology, HPC architectures, storage solutions and high-performance file systems. Hands-on experience with monitoring tools and system management for large-scale More ❯
at scale Experienced in Linux performance benchmarking, tuning, and troubleshooting Familiarity with distributed storage solutions like Lustre and Ceph Knowledgeable in networking technologies and protocols, including Ethernet and ideally Infiniband Proactive and solution-oriented mindset Excellent problem-solving skills Initiative-driven and able to take ownership What we offer Diverse and internationally distributed team : joining our team means becoming part More ❯
CUTLASS, CUB, Thrust, cuDNN and cuBLAS Intuition about the latency and throughput characteristics of CUDA graph launch, tensor core arithmetic, warp-level synchronization and asynchronous memory loads Background in Infiniband, RoCE, GPUDirect, PXN, rail optimisation and NVLink, and how to use these networking technologies to link up GPU clusters An understanding of the collective algorithms supporting distributed GPU training in More ❯
ensure compatibility and efficiency. Significant previous datacenter experience in deployment, design or operations. Familiarity with CMDB tooling such as NetBox. Nice to have: Working knowledge and experience of using Infiniband fabrics Working knowledge of fat tree or rail-optimised designs for AI workloads Ability to perform performance level diagnostics on AI fabric Please Note: This role will require 50%+ More ❯
environments. This is a hands-on role requiring deep technical acumen, exceptional problem-solving ability, and comfort working across a diverse set of technologies including GPUs (NVIDIA and AMD), InfiniBand networking, and orchestration systems like Slurm. What You'll Be Doing Provide expert-level support for customer HPC and AI workloads running in production. Troubleshoot complex system-level issues across … with system-level debugging, including kernel modules and network interfaces. Experience with GPU compute platforms (NVIDIA and/or AMD) and associated libraries. Familiarity with MPI libraries (e.g., OpenMPI), InfiniBand, and high-speed Ethernet networking. Solid Linux administration skills and troubleshooting experience. Working knowledge of HPC container runtimes (e.g., Singularity, Apptainer). Exposure to provisioning and automation tools (e.g., Ansible More ❯
efforts: Manage multiple concurrent infrastructure validation cycles, define and track KPIs, and build repeatable processes. Monitor and troubleshoot distributed systems: Perform end-to-end diagnostics across compute, fabric (e.g., InfiniBand), and storage layers. Stay current with cutting-edge trends in AI infrastructure such as NVIDIA Hopper/Blackwell architectures, model-serving patterns, and emerging ML system designs and disseminate insights … balancing deep engineering discussions with high-level business context. Qualifications: Technical depth in GPU-cloud infrastructure: Experience with large-scale GPU clusters using Kubernetes and/or SLURM over InfiniBand; deep understanding of the NVIDIA driver stack, NCCL performance tuning, and benchmarking. Strong customer or partner-facing experience: Able to bridge technical and business conversations, explain complex systems to mixed More ❯