Senior DevOps Engineer
Senior DevOps Engineer – AI & Cloud Infrastructure
Type: Permanent / Full-Time (Employment or Contract considered)
Location: Remote or Hybrid
Time Zones: UK, Europe, North America–friendly
The OpportunityWe’re working with a high-growth tech-start up company building a next-generation AI cloud platform, focused on fast, reliable inference for large language models and other compute-intensive workloads.
The platform combines modern cloud infrastructure, Kubernetes, GPU clusters, and developer-first tooling to support mission-critical AI systems operating across multiple regions.
They’re now looking for a Senior DevOps Engineer to take ownership of the infrastructure backbone — someone who enjoys operating complex systems at scale and working closely with infrastructure, ML, and product engineering teams.
What You’ll Be Doing AI Cloud Infrastructure- Design, build, and operate highly available, secure infrastructure supporting AI inference, fine-tuning, and data processing workloads
- Manage multi-region Kubernetes clusters, including GPU-heavy environments
- Implement autoscaling strategies across heterogeneous compute fleets
- Own and evolve infrastructure-as-code using tools such as Terraform, Helm, and similar
- Automate provisioning of compute, networking, and storage
- Build tooling to spin environments up and down for experiments, benchmarks, and customer deployments
- Design and maintain CI/CD pipelines across backend, infrastructure, and ML components
- Implement safe deployment strategies (e.g. blue/green, canary releases)
- Partner with engineers to improve build speed, test reliability, and deployment confidence
- Build and operate observability stacks (metrics, logging, tracing)
- Define and monitor SLOs / SLAs for latency, availability, and reliability
- Create runbooks, playbooks, and incident response processes for production systems
- Implement best practices around secrets management, access control, and network security
- Support secure, multi-tenant environments for enterprise customers
- Help foster a culture of operational excellence, ownership, and reliability
- 4–8+ years’ experience in DevOps, SRE, Platform, or Infrastructure Engineering
- Strong experience running production systems on major cloud platforms (AWS, GCP, or Azure)
- Deep hands-on experience with Kubernetes in production
- Strong Infrastructure-as-Code skills (Terraform or equivalent)
- Proficiency in at least one scripting or programming language (e.g. Python, Go, Bash)
- Solid understanding of networking, security fundamentals, and distributed systems
- Proven experience building reliable, observable, automated systems
- Experience supporting GPU-based workloads or ML infrastructure
- Exposure to AI / ML platforms, inference systems, or data pipelines
- Familiarity with modern CI/CD tooling and GitOps approaches
- Experience with observability tooling (metrics, logs, tracing)
- Background in cloud platforms, AI infrastructure, or high-scale SaaS environments
- Work on core infrastructure powering cutting-edge AI systems
- High impact and ownership over architecture and tooling decisions
- Collaboration with senior engineers across infrastructure, ML, and product
- Competitive compensation, equity, and long-term growth potential
- Flexible remote / hybrid working