Senior AI Infrastructure Engineer
Senior AI Infrastructure Engineer (OpenStack & Kubernetes)
Location: Remote (UK or EU Preferred) Sector: High-Performance GPU Cloud Computing
The Opportunity
I am representing a fast-growing, international scale-up that is building next-generation GPU cloud infrastructure. This company is the powerhouse behind a high-performance platform designed specifically for the most demanding AI, Machine Learning, and HPC workloads.
As they scale their global footprint to meet massive demand, they are seeking a Senior Infrastructure Engineer who enjoys deep technical autonomy. This is a role for a specialist who wants to move fast, solve complex problems, and have direct ownership over the stability and scalability of business-critical systems.
What You’ll Be Doing
- Owning Infrastructure: Designing, deploying, and operating OpenStack and Kubernetes clusters optimized for multi-tenant GPU workloads.
- Driving Automation: Building and maintaining infrastructure-as-code and GitOps practices to ensure seamless scalability.
- Optimizing Performance: Enabling reliable workload scheduling through Kubernetes-native tooling, container runtime optimization, and NVIDIA integrations.
- Ensuring Resilience: Maintaining high availability and observability through proactive monitoring, logging, and incident response.
- Strengthening Security: Implementing strong controls, including RBAC and network policies, to ensure tenant isolation.
- Cross-Team Collaboration: Working closely with DevOps, AI, and Product teams to align infrastructure capabilities with customer needs.
The Ideal Profile
- OpenStack Expert: Significant hands-on experience operating OpenStack in a production environment.
- K8s Specialist: Strong experience running production-grade Kubernetes, ideally in bare-metal or private cloud setups.
- Systems Generalist: A solid grounding in Linux, networking, and storage with a practical approach to troubleshooting.
- Modern Workflows: Experience with infrastructure automation, CI/CD, and Git-based workflows.
- Scale-up Mindset: The ability to thrive in a fast-moving environment with a strong sense of accountability.
Nice to Have
- Exposure to GPU-based infrastructure, large-scale compute platforms, or HPC.
- Familiarity with advanced networking technologies.
- Contributions to open-source or cloud-native communities.
What’s on Offer?
- Impact: The opportunity to make a visible, meaningful impact on a platform used by teams running compute-heavy applications.
- Flexibility: Flexible working arrangements, including remote or hybrid options.
- Growth: Clear career progression and the chance to help shape the company's culture and future.
- Culture: A collaborative, transparent, and international culture built on trust.
- Benefits: Competitive salary, annual discretionary bonus, 25 days holiday (plus public holidays), and wellbeing benefits.