Site Reliability Engineer
Role Overview
We are seeking highly skilled Site Reliability Engineers (SREs) to join a fast-paced infrastructure team supporting enterprise-scale platforms. This role sits at the intersection of Development and Operations, focusing on building scalable, resilient, and automated infrastructure systems.
The ideal candidate will be automation-first, comfortable working in production environments, and experienced in container orchestration, CI/CD pipelines, and Infrastructure as Code.
Key Responsibilities
- Design, implement, and maintain scalable, highly available production systems
- Automate operational tasks using Shell scripting (Bash/Zsh)
- Contribute to and support Python-based application components
- Manage and optimise Kubernetes clusters and containerised deployments
- Build and maintain CI/CD pipelines using Spinnaker and GitHub Actions
- Implement Infrastructure as Code (IaC) using Pulumi
- Perform system monitoring, troubleshooting, and root cause analysis
- Participate in on-call rotation and incident response
- Improve system reliability, performance, and observability
- Collaborate with development teams to enhance deployment and release processes
Required Skills & Experience
Programming & Scripting
- Strong experience with Shell scripting (Bash/Zsh)
- Solid Python programming experience
- Automation mindset with experience eliminating manual processes
Containerisation & Orchestration
- Strong hands-on experience with Kubernetes (K8s)
- Docker containerisation expertise
- Experience managing production-grade clusters
CI/CD & Deployment
- Experience with Spinnaker
- Hands-on experience with GitHub Actions
- Strong understanding of modern DevOps practices
Infrastructure & Cloud
- Infrastructure as Code using Pulumi
- Strong understanding of cloud-native architecture principles
- Experience managing scalable distributed systems
Version Control
- Git
- GitHub workflows and branching strategies
Preferred Experience
- Experience working in large-scale enterprise or high-availability environments
- Strong troubleshooting and production support experience
- Familiarity with monitoring and observability tooling
- Experience in high-traffic, performance-sensitive systems