Site Reliability Engineer (Mid / Senior)

Site Reliability Engineer (Mid / Senior)

South West London (Hybrid – 1–2 days onsite) Salary: Competitive + Benefits

We are looking for a Site Reliability Engineer to join a well-established small infrastructure team supporting a highly available, production environment. This is an exciting opportunity to work across a modern, self-hosted platform spanning Kubernetes, physical infrastructure and automation, with a strong focus on Ubuntu-based systems.

The Role

As an SRE, you will play a key role in ensuring the availability, performance, security and resilience of production systems. Working in a small, collaborative team, you’ll take ownership of day-to-day platform operations, incident response and continuous improvement, while partnering closely with development teams to deliver reliable and scalable services.

Key Responsibilities

Administer and maintain Linux (Ubuntu) server environments
Manage self-hosted Kubernetes clusters and supporting infrastructure
Support on-premise infrastructure including physical servers and virtualisation platforms
Administer storage solutions including NFS, iSCSI and object storage
Build and maintain automation using Ansible or similar IaC tools
Develop operational tooling using Bash and Python
Monitor system health using tools such as Prometheus, Grafana, Zabbix or Nagios
Investigate and resolve production incidents (on-call rota involved)
Implement security hardening and infrastructure best practices
Manage backup and disaster recovery processes and regular testing
Support and improve CI/CD pipelines and deployment processes
Collaborate with engineering teams to improve reliability and performance

Essential Skills

Strong Linux systems administration (Ubuntu preferred)
Experience running production Kubernetes environments
Solid understanding of networking (TCP/IP, DNS, routing, firewalls)
Experience with physical servers and virtualisation platforms
Hands-on experience with Ansible or other IaC tools
Scripting skills in Bash and Python
Experience with monitoring and alerting platforms
Knowledge of Linux storage technologies (NFS, iSCSI)
Experience with backup & disaster recovery
Exposure to Active Directory / Entra ID / endpoint management
Strong troubleshooting and problem-solving skills

Desirable Experience

Object storage, MariaDB or database administration
CI/CD tools such as Jenkins
AWS (S3, Lambda, CloudFront) exposure
Terraform or additional IaC tooling
Experience with Harvester or similar platforms
Knowledge of security, compliance or GDPR

Why Apply?

Work on complex, real-world infrastructure (not just cloud-native)
High ownership in a small, collaborative team
Exposure to a broad modern tech stack across infra, Kubernetes and automation
Hybrid working with a competitive salary package

Apply Now

Site Reliability Engineer (Mid / Senior)

Job Details