Senior Site Reliability Engineer

High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.

Role Overview:

  • Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
  • Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
  • Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.

Responsibilities

  • Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
  • Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
  • Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
  • Troubleshooting across the full stack, including hardware, networking, and distributed systems
  • Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency

Participation in an on-call rotation required (approximately one week per month).

Key Attributes

  • Strong ownership mindset with focus on delivery and accountability
  • Experience building maintainable, well-documented systems in complex environments
  • Ability to operate effectively in ambiguous and rapidly evolving contexts
  • Clear and effective communication skills with collaborative, low-ego approach

Minimum Requirements

  • 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
  • Strong written and verbal communication skills in English
  • Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
  • Programming or scripting experience in Go, Python, or Bash
  • Familiarity with infrastructure automation and infrastructure-as-code tools
  • Strong technical foundation in computing or related discipline

Preferred Experience

  • Experience operating large-scale machine learning or AI-compute workloads
  • Background in multi-tenant distributed systems at scale
  • Hands-on experience with data centre or bare-metal infrastructure
  • Knowledge of high-performance networking technologies
  • Experience managing large-scale storage systems (commercial or open-source)

Compensation & Benefits

  • Competitive salary and equity package
  • Retirement or pension contributions aligned with local standards
  • Health coverage including medical, dental, and vision
  • Generous paid time off policy

Job Details

Company
Realm
Location
City of London, London, United Kingdom
Posted