Site Reliability Engineer (Mid / Senior)
Site Reliability Engineer (Mid / Senior)
South West London (Hybrid – 1–2 days onsite) Salary: Competitive + Benefits
We are looking for a Site Reliability Engineer to join a well-established small infrastructure team supporting a highly available, production environment. This is an exciting opportunity to work across a modern, self-hosted platform spanning Kubernetes, physical infrastructure and automation, with a strong focus on Ubuntu-based systems.
The Role
As an SRE, you will play a key role in ensuring the availability, performance, security and resilience of production systems. Working in a small, collaborative team, you’ll take ownership of day-to-day platform operations, incident response and continuous improvement, while partnering closely with development teams to deliver reliable and scalable services.
Key Responsibilities
- Administer and maintain Linux (Ubuntu) server environments
- Manage self-hosted Kubernetes clusters and supporting infrastructure
- Support on-premise infrastructure including physical servers and virtualisation platforms
- Administer storage solutions including NFS, iSCSI and object storage
- Build and maintain automation using Ansible or similar IaC tools
- Develop operational tooling using Bash and Python
- Monitor system health using tools such as Prometheus, Grafana, Zabbix or Nagios
- Investigate and resolve production incidents (on-call rota involved)
- Implement security hardening and infrastructure best practices
- Manage backup and disaster recovery processes and regular testing
- Support and improve CI/CD pipelines and deployment processes
- Collaborate with engineering teams to improve reliability and performance
Essential Skills
- Strong Linux systems administration (Ubuntu preferred)
- Experience running production Kubernetes environments
- Solid understanding of networking (TCP/IP, DNS, routing, firewalls)
- Experience with physical servers and virtualisation platforms
- Hands-on experience with Ansible or other IaC tools
- Scripting skills in Bash and Python
- Experience with monitoring and alerting platforms
- Knowledge of Linux storage technologies (NFS, iSCSI)
- Experience with backup & disaster recovery
- Exposure to Active Directory / Entra ID / endpoint management
- Strong troubleshooting and problem-solving skills
Desirable Experience
- Object storage, MariaDB or database administration
- CI/CD tools such as Jenkins
- AWS (S3, Lambda, CloudFront) exposure
- Terraform or additional IaC tooling
- Experience with Harvester or similar platforms
- Knowledge of security, compliance or GDPR
Why Apply?
- Work on complex, real-world infrastructure (not just cloud-native)
- High ownership in a small, collaborative team
- Exposure to a broad modern tech stack across infra, Kubernetes and automation
- Hybrid working with a competitive salary package