Site Reliability Engineer - Data Centers

Site Reliability Engineer (SRE) – GPU Infrastructure Data Centres

Fully Remote Role - Work from home

The Site Reliability Engineer (SRE) is responsible for the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role ensures that all hardware, networking, and system components meet operational and reliability standards before customer workloads are deployed.

Working closely with global infrastructure and engineering teams, the SRE plays a critical role in maintaining the quality, stability, and integrity of high-performance compute environments.

Key Responsibilities

Cluster Validation & Testing

Validate GPU clusters of varying sizes to ensure hardware and system integrity prior to production release
Perform functional and reliability testing of GPUs, servers, and associated components
Verify network connectivity and performance, including high-speed interconnects where applicable

Orchestration & Benchmarking

Provision and configure GPU clusters using automated workflows
Execute and analyse performance and stability benchmarks orchestrated via a workload scheduler
Validate results against expected performance and reliability thresholds

Test Framework & Automation

Maintain and extend the automated validation framework built using Python and Ansible
Integrate new test cases to support additional hardware platforms and GPU generations
Improve test reliability, coverage, and execution efficiency

Remediation & System Integrity

Diagnose and remediate unhealthy nodes through configuration changes or software fixes
Coordinate with on-site support teams for hardware replacements when required
Ensure all issues are resolved and documented prior to handover to production operations

Documentation & Handover

Produce clear, accurate documentation of test results, hardware states, and remediation actions
Ensure smooth handovers to operations and engineering teams
Maintain up-to-date runbooks and validation procedures

Team Collaboration & Training

Work as part of a distributed, international infrastructure and engineering team
Participate in knowledge sharing, process improvement, and technical reviews
The working language is English; additional language skills are beneficial

Shift & Availability Requirements

Ability to work independently within a remote environment
Reliable internet connection and suitable home working setup
Role is fully remote; company hardware will be provided

Skills & Experience

Essential

Strong hands-on experience administering and troubleshooting Linux systems
Confident use of CLI tools for diagnostics, including analysis of kernel logs, drivers, and system services
Proven experience writing and maintaining Ansible playbooks
Proficiency in Python for automation, test execution, and parsing results
Strong analytical and problem-solving skills with attention to detail
Excellent written and verbal English communication skills
High standards for system reliability, consistency, and documentation

Preferred / Desirable

Experience working with GPU-based or high-performance compute environments
Familiarity with workload schedulers (e.g. Slurm or similar tools)
Understanding of data centre hardware lifecycle and server validation processes
Exposure to high-speed networking technologies
Experience working with distributed or remote infrastructure teams

Performance & Success Metrics

Accuracy and completeness of cluster validation prior to production release
Reduction in post-deployment hardware or configuration issues
Quality and clarity of validation documentation and handover materials
Effectiveness of remediation and coordination with on-site teams
Reliability and maintainability of automated test frameworks
Collaboration and communication quality with engineering and operations teams

Apply Now

Site Reliability Engineer - Data Centers

Job Details