Senior Site Reliability Engineer
High-growth infrastructure company focused on delivering large-scale compute, data centre capacity, and power solutions for advanced machine learning workloads. Platforms support leading research and industry teams requiring high-performance computing at significant scale. Fast-paced environment with emphasis on ownership, execution speed, and quality. Culture centred on pragmatic problem-solving, cross-functional collaboration, and full lifecycle responsibility.
Role Overview:
- Position operating across software, infrastructure, and operations to ensure reliability, scalability, and performance of a globally distributed compute platform.
- Close collaboration with networking, platform engineering, and physical infrastructure teams to design and operate systems supporting high-demand computational workloads.
- Hands-on engineering role requiring strong systems expertise, with responsibility for resolving complex production issues, improving system resilience, and enhancing platform observability.
Responsibilities
- Deployment and management of large-scale compute clusters using automation tooling, with adaptation to customer requirements
- Validation and optimisation of compute, storage, and networking systems in coordination with internal teams and vendors
- Execution of large-scale data migrations between cloud and on-premise environments with focus on efficiency and cost
- Troubleshooting across the full stack, including hardware, networking, and distributed systems
- Development of internal tooling and automation to improve deployment speed, reliability, and operational efficiency
Participation in an on-call rotation required (approximately one week per month).
Key Attributes
- Strong ownership mindset with focus on delivery and accountability
- Experience building maintainable, well-documented systems in complex environments
- Ability to operate effectively in ambiguous and rapidly evolving contexts
- Clear and effective communication skills with collaborative, low-ego approach
Minimum Requirements
- 5+ years of experience in site reliability engineering, DevOps, systems administration, or high-performance computing
- Strong written and verbal communication skills in English
- Experience deploying and operating container orchestration or workload scheduling systems (e.g. Kubernetes or similar)
- Programming or scripting experience in Go, Python, or Bash
- Familiarity with infrastructure automation and infrastructure-as-code tools
- Strong technical foundation in computing or related discipline
Preferred Experience
- Experience operating large-scale machine learning or AI-compute workloads
- Background in multi-tenant distributed systems at scale
- Hands-on experience with data centre or bare-metal infrastructure
- Knowledge of high-performance networking technologies
- Experience managing large-scale storage systems (commercial or open-source)
Compensation & Benefits
- Competitive salary and equity package
- Retirement or pension contributions aligned with local standards
- Health coverage including medical, dental, and vision
- Generous paid time off policy