GPU/TPU acceleration, high-speed interconnects, and
parallel file systems. Manage HPC environments, including Linux-based clusters, schedulers (e.g., Slurm), and high-performance storage systems (e.g., Lustre, BeeGFS,
GPFS). Implement robust monitoring, fault-tolerance, and capacity management for high availability and reliability. Develop automation scripts and tools (Python, Bash, Ansible, Terraform, Go, Helm, etc) for provisioning, configuration, and … Proficiency in Linux
system administration, networking, and
parallel computing (MPI, OpenMP, CUDA, or ROCm). Experience with using HPC job schedulers (Slurm preferred) and
parallel file systems (Lustre, BeeGFS,
GPFS). At the senior level: Extensive experience designing, deploying, and managing HPC clusters (or cloud computing) in scientific or research settings. Strong proficiency in Linux
system administration, networking, and
parallel … computing (MPI, OpenMP, CUDA, or ROCm). Extensive expertise with administering HPC job schedulers (Slurm preferred) and
parallel file systems (Lustre, BeeGFS,
GPFS). At all levels: Familiarity with containerization, workflow automation, and orchestration tools used in bioinformatics and AI/ML. Skilled in scripting and automation using Python, Bash, and configuration management tools (Ansible, Terraform). Demonstrated experience profiling
More ❯