Senior DevOps Engineer

Senior DevOps Engineer – AI & Cloud Infrastructure

Type: Permanent / Full-Time (Employment or Contract considered)

Location: Remote or Hybrid

Time Zones: UK, Europe, North America–friendly



The Opportunity

We’re working with a high-growth tech-start up company building a next-generation AI cloud platform, focused on fast, reliable inference for large language models and other compute-intensive workloads.

The platform combines modern cloud infrastructure, Kubernetes, GPU clusters, and developer-first tooling to support mission-critical AI systems operating across multiple regions.

They’re now looking for a Senior DevOps Engineer to take ownership of the infrastructure backbone — someone who enjoys operating complex systems at scale and working closely with infrastructure, ML, and product engineering teams.

What You’ll Be Doing AI Cloud Infrastructure
  • Design, build, and operate highly available, secure infrastructure supporting AI inference, fine-tuning, and data processing workloads
  • Manage multi-region Kubernetes clusters, including GPU-heavy environments
  • Implement autoscaling strategies across heterogeneous compute fleets
Infrastructure as Code & Automation
  • Own and evolve infrastructure-as-code using tools such as Terraform, Helm, and similar
  • Automate provisioning of compute, networking, and storage
  • Build tooling to spin environments up and down for experiments, benchmarks, and customer deployments
CI/CD & Release Engineering
  • Design and maintain CI/CD pipelines across backend, infrastructure, and ML components
  • Implement safe deployment strategies (e.g. blue/green, canary releases)
  • Partner with engineers to improve build speed, test reliability, and deployment confidence
Observability, Reliability & SRE
  • Build and operate observability stacks (metrics, logging, tracing)
  • Define and monitor SLOs / SLAs for latency, availability, and reliability
  • Create runbooks, playbooks, and incident response processes for production systems
Security & Best Practices
  • Implement best practices around secrets management, access control, and network security
  • Support secure, multi-tenant environments for enterprise customers
  • Help foster a culture of operational excellence, ownership, and reliability


What They’re Looking For Essential
  • 4–8+ years’ experience in DevOps, SRE, Platform, or Infrastructure Engineering
  • Strong experience running production systems on major cloud platforms (AWS, GCP, or Azure)
  • Deep hands-on experience with Kubernetes in production
  • Strong Infrastructure-as-Code skills (Terraform or equivalent)
  • Proficiency in at least one scripting or programming language (e.g. Python, Go, Bash)
  • Solid understanding of networking, security fundamentals, and distributed systems
  • Proven experience building reliable, observable, automated systems
Nice to Have
  • Experience supporting GPU-based workloads or ML infrastructure
  • Exposure to AI / ML platforms, inference systems, or data pipelines
  • Familiarity with modern CI/CD tooling and GitOps approaches
  • Experience with observability tooling (metrics, logs, tracing)
  • Background in cloud platforms, AI infrastructure, or high-scale SaaS environments

Why Join
  • Work on core infrastructure powering cutting-edge AI systems
  • High impact and ownership over architecture and tooling decisions
  • Collaboration with senior engineers across infrastructure, ML, and product
  • Competitive compensation, equity, and long-term growth potential
  • Flexible remote / hybrid working

Job Details

Company
True North Group
Location
London, South East, England, United Kingdom
Hybrid / Remote Options
Employment Type
Full-Time
Salary
£80,000 - £130,000 per annum
Posted