targets are met. W ork with customers and multiple stakeholders to understand requirements and challenges, provide reporting on usage, workflow and billing. Technical responsibilities Cluster Infrastructure management: Managing the Nvidia GPU cluster.? High availability and resilience: Implement failover strategies and manage maintenance events to minimize downtime. Resource allocation and optimization: Resource partitioning (GPU resources), workload scheduling, capacity planning Performance monitoring … and troubleshooting: Performance analysis, monitoring ( Realtime) with available Nvidia and HPE tools? Incident response: node failure management, network issues, driver issues, troubleshooting common issues and then working with vendor support to resolve any critical issues. Security and access control: Manage user permissions, RBAC, security hardening, data protection.? Required Skills & Experience 10 years of experience (or equivalent) in technical support, system … and data scientists , and how to optimize the experience. Core Technical skills: System administration experience with OS's like RHEL/CentOS, Ubuntu, tuning Linux kernel Proficiency with Ansible, Nvidia and CUDA toolkits, Kubernetes, and container orchestration Understanding of automation, monitoring, and security with GPU as a service Preferred experience Experience supporting HPE PCAI or other AI/HPC infrastructure More ❯
practices Support platform onboarding, configuration, and usage guidance Collaborate with infrastructure and AI engineering teams on integrations and performance improvements Maintain SLAs and ensure customer satisfaction Technical Focus Manage Nvidia GPU clusters and related infrastructure Implement failover, resilience, and resource optimization strategies Oversee capacity planning and workload scheduling Monitor performance using Nvidia and HPE tools Manage incident response, node failures More ❯
practices Support platform onboarding, configuration, and usage guidance Collaborate with infrastructure and AI engineering teams on integrations and performance improvements Maintain SLAs and ensure customer satisfaction Technical Focus Manage Nvidia GPU clusters and related infrastructure Implement failover, resilience, and resource optimization strategies Oversee capacity planning and workload scheduling Monitor performance using Nvidia and HPE tools Manage incident response, node failures More ❯