Ethernet), processors (Intel/AMD/ARM/NVIDIA), parallel file systems, and data center infrastructure. Additional skills in MPI, parallel job scheduling (e.g., SLURM), and management & monitoring tools (e.g., Icinga, Prometheus, Grafana) are advantageous. Requirements: Eligible and willing to undergo UK Govt. security clearance. Proven experience as a more »
with Kubernetes Familiar with inference servers such as multi-LoRa, LoRA Exchange, TitanML etc Experience creating/managing multi node HPC clusters Experience with Workload Managers like SLURM, Kepler, Moab etc Experience working with some of the more recent LLMs (OpenAI, Mistral, Claude, LLaMA etc) Whats in it more »
Saffron Walden, Essex, South East, United Kingdom Hybrid / WFH Options
EMBL-EBI
or more modalities Experience developing or integrating image visualisation systems Experience with NoSQL databases, such as MongoDB Experience with batch scheduling systems such as SLURM Experience with containerisation (e.g. Docker) and container orchestration (e.g. Kubernetes) Infrastructure-as-code deployment tools such as Ansible or Terraform Experience working in an more »
systems, CI/CD, etc.) Attention to detail needed to manage and debug production services. Experience with research clusters and implementing tools such as Slurmworkload manager. Job Duties Own the lifecycle of our Linux-based servers and applications across our multiple business environments. Automate and troubleshoot a more »
magnitude of training runs Explore novel synthetic data generation techniques Engineer robust, high-performance inference Experience Technical: Have experience operating orchestration systems such as SLURM, Ray, or similar. Experience in creating and managing multi-instance clusters for data and model parallel training across GPUs/TPUs, preferably using PyTorch more »
high-performance inference platforms Collaborate in defining and steering their evolving inference and training stack Experience Technical: Have experience operating orchestration systems such as SLURM, Ray, or similar. Experience in creating and managing multi-instance clusters for data and model parallel training across GPUs/TPUs, preferably using PyTorch more »
of your team, ideally for a l eading AI research laboratory, or a pioneering AI business Key Requirements: Python and PyTorch expertise Experience in SLURM, Ray, or similar Graphics Processing Units (GPUs) Experience in creating and managing HPC clusters for ML models Experience in efficiently serving large ML models more »
with key stakeholders for enterprise customers. Technical Experience High Performance Computers – (Supporting Users) Configuration, and management of HPC Infrastructure Linux MPI InfiniBand Job schedulers SLURM Contract Details: PAYE Contract - Competitive Rate 18 Months Contract Remote - UK Based Including Training and Upskilling It’s an amazing opportunity to be a more »
Python and Bash, expertise in automation tools like Ansible, and experience with operating platforms at scale using cluster management systems like Kubernetes, OpenStack and Slurm Additionally, you will be actively involved in troubleshooting networking issues, and deploying infrastructure as code using CI/CD pipelines. Key Responsibilities: Linux Administration … provisioning, configuration management, and application deployment. Platform Operations at Scale: Experience in operating platforms at scale, utilising cluster management systems such as Kubernetes or Slurm to manage high-performance computing workloads efficiently. Networking Skills: Strong networking skills including troubleshooting network issues, understanding network topology, protocols, and ensuring efficient traffic … adapt to a fast-paced, dynamic work environment and prioritise tasks effectively. Certifications such as Certified Kubernetes Administrator (CKA), Certified Openstack Administrator or Certified Slurm Administrator (CSA) would be advantageous. more »
provisioning, configuration management, and application deployment. Platform Operations at Scale: Experience in operating platforms at scale, utilising cluster management systems such as Kubernetes or Slurm to manage high-performance computing workloads efficiently. Networking Skills: Strong networking skills including troubleshooting network issues, understanding network topology, protocols, and ensuring efficient traffic … environment. Capacity to adapt to a fast-paced, dynamic work environment and prioritise tasks effectively. Certifications such as Certified Kubernetes Administrator (CKA) or Certified Slurm Administrator (CSA) would be advantageous. more »