Platform Manager
Guernsey, UK
World Wide Technology
Nvidia GPU cluster. High availability and resilience: Implement failover strategies and manage maintenance events to minimize downtime. Resource allocation and optimization: Resource partitioning (GPU resources), workload scheduling, capacity planning Performance monitoring and troubleshooting: Performance analysis, monitoring (Realtime) with available Nvidia and HPE tools Incident response: node failure management, network issues, driver issues, troubleshooting common issues and … processes (ticketing, escalation, troubleshooting) Familiarity with cloud-based platforms, APIs, and distributed systems Understanding of AI/ML concepts and tooling (model training, inference, data pipelines basics) Experience with monitoring/logging tools (e.g., Grafana, Kibana, Splunk) Excellent communication skills to interface with both customers and internal/vendor teams Good understanding of tools requirements for ML engineers and … skills: System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes, and container orchestration Understanding of automation, monitoring, and security with GPU as a service Preferred experience Experience supporting HPE PCAI or other AI/HPC infrastructure and platforms. Experience with GPU resource allocation (across instances, GPUs More ❯
Employment Type: Part-time
Posted: