Site Reliability Engineer

Role Overview

We are seeking highly skilled Site Reliability Engineers (SREs) to join a fast-paced infrastructure team supporting enterprise-scale platforms. This role sits at the intersection of Development and Operations, focusing on building scalable, resilient, and automated infrastructure systems.

The ideal candidate will be automation-first, comfortable working in production environments, and experienced in container orchestration, CI/CD pipelines, and Infrastructure as Code.

Key Responsibilities

  • Design, implement, and maintain scalable, highly available production systems
  • Automate operational tasks using Shell scripting (Bash/Zsh)
  • Contribute to and support Python-based application components
  • Manage and optimise Kubernetes clusters and containerised deployments
  • Build and maintain CI/CD pipelines using Spinnaker and GitHub Actions
  • Implement Infrastructure as Code (IaC) using Pulumi
  • Perform system monitoring, troubleshooting, and root cause analysis
  • Participate in on-call rotation and incident response
  • Improve system reliability, performance, and observability
  • Collaborate with development teams to enhance deployment and release processes

Required Skills & Experience

Programming & Scripting

  • Strong experience with Shell scripting (Bash/Zsh)
  • Solid Python programming experience
  • Automation mindset with experience eliminating manual processes

Containerisation & Orchestration

  • Strong hands-on experience with Kubernetes (K8s)
  • Docker containerisation expertise
  • Experience managing production-grade clusters

CI/CD & Deployment

  • Experience with Spinnaker
  • Hands-on experience with GitHub Actions
  • Strong understanding of modern DevOps practices

Infrastructure & Cloud

  • Infrastructure as Code using Pulumi
  • Strong understanding of cloud-native architecture principles
  • Experience managing scalable distributed systems

Version Control

  • Git
  • GitHub workflows and branching strategies

Preferred Experience

  • Experience working in large-scale enterprise or high-availability environments
  • Strong troubleshooting and production support experience
  • Familiarity with monitoring and observability tooling
  • Experience in high-traffic, performance-sensitive systems

Job Details

Company
ALOIS UK
Location
City of London, London, United Kingdom
Posted