15 of 15 Permanent Slurm Workload Manager Jobs in London

Senior Engineering Lead, Chem-Bio London, UK

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

record of leading technical work in a team. Strong infrastructure and platform skills — experience with cloud environments (AWS), container orchestration (Kubernetes), and job scheduling (Slurm or similar). Demonstrated experience leading or managing engineers — whether through formal line management, tech-leading a team, or running hiring pipelines. ...

Enterprise Architect - AI

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

continuous, declarative platform delivery. Pipeline orchestration: Kubeflow Pipelines, Apache Airflow, or Argo Workflows to orchestrate multi-stage training, fine-tuning, and inference pipelines. Cluster & workload scheduling: Slurm, Run:ai, and NVIDIA Base Command Manager for GPU job scheduling; Kubernetes-native GPU scheduling including device plugins ...

Founding AI Infrastructure Engineer

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

technical direction of the company from day one. What we’re looking for Large‐scale distributed training. PyTorch and modern deep‐learning frameworks. Kubernetes, Slurm or GPU orchestration platforms. AWS and specialist GPU cloud providers. High‐performance computing and distributed systems. Training optimisation, memory management and networking. MLOps tooling ...

ML Infrastructure Engineer

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

Claude Code, Codex, Kimi Code, Pi Agent, Droid, or similar agentic coding systems as a development surface Experience with GPU clusters on Kubernetes, Slurm, Ray, custom schedulers, or cloud GPU orchestration NCCL, UCX, NVSHMEM, RDMA, InfiniBand, RoCE, or EFA Rust, C++, CUDA, Go, or systems‐level performance work ...

HPC Specialist Architect - Energy Industry (AWS)

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

more of the following programming languages: C++, Python, Cuda, or Bash. Experience in architecting an HPC platform with scheduling middleware (e.g., Slurm, Torque, Symphony or GridServer) and in deployment, tuning and management of HPC technologies in a multi‐user environment. High level understanding of the underlying infrastructure platform ...

Research Engineer, Pre-Training

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

Background in numerical computing, HPC, or distributed systems, including familiarity with GPUs/TPUs, high-performance networking (NVLink/InfiniBand), Kubernetes/Slurm, and OS internals Expertise in Python and deep experience with modern deep learning frameworks (PyTorch and/or JAX) Advanced degree (MS or PhD) in Computer ...

Lead AI Infrastructure & Distributed Systems Engineer

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

engineer who thrives in early stage startup environments and prefers broad systems ownership over narrow specialisation. Technical Expertise: Strong production background with AWS, Kubernetes, Slurm, PyTorch, and distributed training frameworks. Deep hands‐on experience with GPU compute optimisation, cluster scheduling, and high performance networking is essential. Relevant Background: Experience ...

Research Engineer, Machine Learning – Paris/London/Zurich/Warsaw

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

+ years working on large‐scale ML codebases. Hands‐on with PyTorch, JAX or TensorFlow; comfortable with distributed training (DeepSpeed/FSDP/SLURM/K8s). Experience in deep learning, NLP or LLMs; bonus for CUDA or data‐pipeline chops. Strong software‐design instincts: testing, code review ...

Senior AI Infrastructure Engineer - Scale Multi-GPU Training

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

will architect and optimize distributed training across multiple GPUs and machines in AWS, eliminate bottlenecks in the data path, and manage cluster orchestration with Slurm and Kubernetes. The role requires deep PyTorch expertise, familiarity with transformer models, and experience deploying production AI systems. #J-18808-Ljbffr ...

AI Infrastructure Engineer

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

will eliminate bottlenecks in the data path to ensure training is fast and as capital efficient as possible alongside managing cluster orchestration using slurm and Kubernetes while preparing to expand into specialised GPU providers. And finally you will master the stack from pytorch based learning libraries to complex data ...

AI Inference Engineer

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

multi-tenant serving or SLA-driven infrastructure. Background at a hyperscaler, frontier AI lab, or large-scale distributed inference system. Familiarity with Kubernetes/Slurm for cluster orchestration. Interest or experience in energy markets, grid systems, or sustainability-focused compute. Benefits Competitive salary and an equity sign-on bonus. ...

High Performance Computer Scientist /HPC Developer

Hiring Organisation: IT Graduate Recruitment
Location: London, South East, England, United Kingdom
Employment Type: Full-Time
Salary: £50,000 per annum

large-scale distributed systems. Research experience involving computational workloads. Experience with parallel programming (MPI, OpenMP, CUDA). Knowledge of scheduling systems such as Slurm, PBS or LSF. Contributions to technical projects, open source or research communities. Experience working with advanced computing environments. Academic Focus We are particularly interested … Parallel Programming, MPI, OpenMP, Multithreading, Concurrency, Algorithms, Data Structures, Systems Design, Kernel Development, Networking, Storage Systems, Distributed Storage, Automation, Infrastructure Automation, Shell Scripting, Bash, Slurm, PBS, LSF, Workload Scheduling, Resource Management, Linux Administration, Server Infrastructure, Cloud Infrastructure, AWS HPC, Azure HPC, Data Processing, Machine Learning Infrastructure, AI Infrastructure ...

Solution Architect - GPU & HPC

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

winning them requires more than a great sales team. The Solutions Architect sits at the intersection of sales, infrastructure, and the customer, translating complex workload requirements into technically sound, commercially viable solutions on the Hyperstack platform. You’ll be the primary technical authority through the sales cycle: engaging directly … proposal, and delivery handover — acting as the primary technical authority for GPU cloud solution design. Engage directly with prospective and existing customers to understand workload requirements, technical constraints, and commercial objectives, producing detailed solution designs including architecture diagrams, network topology, storage configurations, and GPU resource allocation models. Collaborate closely ...

Senior Software Engineer, Inference Platform

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

model lifecycle management is highly desirable Bonus/Good to Have HPC & Cluster Management: Experience handling large‐scale HPC clusters using Kubernetes and Slurm for job scheduling, resource allocation, and workload orchestration Data Engineering: Expertise with data pipelines, ETL systems, and large‐scale data processing frameworks Systems-Level ...

Senior Staff+ Software Engineer (Kubernetes Platform)

Hiring Organisation: Jobleads-UK
Location: Greater London, England, United Kingdom

controllers — so it stays responsive as object counts and node counts grow by orders of magnitude. And we build the core cluster services every workload depends on, like service discovery, so they hold up under the same pressure. We make sure the control plane is fast, correct, and always … accelerator fleets, including custom scheduling plugins and policies for gang scheduling, topology awareness, and preemption Scale the Kubernetes control plane (apiserver, etcd, controller-manager) to support clusters far beyond typical limits, and find the next bottleneck before it finds us Design, build, and operate core cluster services such ...