or equivalent proven track record) 4 + years working on large-scale ML codebases Hands-on with PyTorch, JAX or TensorFlow; comfortable with distributed training (DeepSpeed/FSDP/SLURM/K8s) Experience in deep learning, NLP or LLMs; bonus for CUDA or data-pipeline chops Strong software-design instincts: testing, code review, CI/CD Self-starter, low More ❯
Chemistry and Biology Can communicate with ML engineers Demonstrates competence and rigor in software development. Has experience working with scientific computing/lab environments (e.g. has used or administered SLURM) Conversant with cloud computing; able to provide requirements to DevOps engineers ABOUT IAMBIC THERAPEUTICS Iambic is a clinical-stage life-science and technology company developing novel medicines using its More ❯
and microbial genomics. Clear communicator, curious learner, and team oriented problem solver. Desirable Knowledge, Skills and Experience: Experience with cloud platforms (e.g. OCI, AWS, GCP) or HPC environments (e.g. Slurm). Familiarity with both long and shortread technologies (e.g. ONT, Illumina). Basic knowledge of metagenomics, antimicrobial resistance analytics or public health monitoring. Interest in data visualisation or machine More ❯
the future of healthcare today. This company is on the hunt for HPC Engineers to power their 25 Petabyte system Sound good? Well there's more! Imagine working with Slurm clusters and GPFS storage, all while being an integral part of groundbreaking translational research. You will work in adynamic team of five, where your hands-on expertise will support More ❯
research engineer, you will play a pivotal role in managing and optimising a large-scale infrastructure. Your expertise in Linux systems, along with experience in High-Performance Computing (HPC), Slurmworkload management, and advanced storage solutions, will be essential to ensuring smooth and efficient operations. You'll be working alongside some of the brightest minds in research, directly More ❯
if you have: Extremely strong software engineering skills. Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR. Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray). Experience using large-scale distributed training strategies. Hands on experience on training large model at scale. Hands on experience with the post training phase More ❯
based on Nvidia and AWS infrastructure - Acceleration Technologies - Optimization of Nvidia and Tranium-based GPU cluster archirtectures for ML/AI applications using CUDA/Neuron/EKS/Slurm - Performance Tuning - Maximizing efficiency and throughput across compute-intensive tasks, with knowledge of Nvidia NVLink, AWS Neuron and AWS EFA technologies - Cost Optimization - Strategic resource allocation on a range More ❯
and apply today! Responsibilities: Design scalable and secure infrastructure across Azure, on-prem, and possibly other cloud platforms. Architect and guide the setup/configuration of HPC clusters (eg, SLURM) to support large-scale statistical workloads. Design and support environments for Python, R, and SAS that meet compliance, reproducibility, and performance standards. Implement security, access control, and compliance practices … in life sciences). Skills/Must have: Hands-on experience with Azure and hybrid cloud environments, including understanding of infrastructure architecture and deployment. Proficient in HPC systems like SLURM, including installation, configuration, and optimization for performance-heavy workloads. Strong Python knowledge with experience in installation, configuration, and Scripting in Linux environments. Experience working with R and Python environments More ❯
support for customers who require specialized security solutions for their cloud services. At AWS AI, we want to make it easy for our customers to train their deep learning workload in the cloud. With Amazon SageMaker Training, we are building customer-facing services to empower data scientists and software engineers in their deep learning endeavors. As our customers rapidly … next-generation AI compute platform that's optimized for LLMs and distributed training.At AWS AI, we want to make it easy for our customers to train their deep learning workload in the cloud. With Amazon SageMaker Training, we are building customer-facing services to empower data scientists and software engineers in their deep learning endeavors. As our customers rapidly … are essential to success in this role. You have solid experience in multi-threaded asynchronous C++ or Go development. You have prior experience in one of: resource orchestrators like slurm/kubernetes, high performance computing, building scalable systems, experience in large language model training. This is a great team to come to have a huge impact on AWS and More ❯
compute, storage, and interconnects. Technologies involved include RDMA fabrics, parallel filesystems, HPC batch schedulers, FUSE filesystems, internal Jump software, multi-vendor hardware, cybersecurity requirements, a challenging and unpredictable client workload, and high user expectations Solve problem reports and questions posed by members of Jump's research community, escalating as needed and managing the entire problem lifecycle Respond to alerts … desire for operational work as primary job function 2+ years of professional experience with Linux systems High performance computing (HPC), including parallel filesystems (e.g., Lustre, GPFS), batch systems (e.g., Slurm, Grid Engine), and high-performance network interconnects experience is a plus, but not required High proficiency with at least one programming/scripting language (e.g., Go, Python, C) and More ❯
a Lead HPC Engineer, you'll be at the forefront of designing, optimising, and managing advanced computational infrastructure. You'll have a solid grasp of all things HPC, Linux, Slurm, and storage systems (bonus points if you're familiar with GPFS). Your expertise will ensure the systems are reliable, scalable, and high-performing, ready to support researchers in … about emerging technologies will be key to keeping our infrastructure at the forefront of innovation. We're looking for someone with deep expertise in HPC environments, including: Linux systems, workload management, parallel storage, and high-speed networking. You'll also bring strong leadership skills, inspiring and managing teams, while rolling up your sleeves to tackle technical challenges. Clear communication More ❯