Site Reliability Engineer - Data Centers
- Hiring Organisation
- TGS International Group
- Location
- Portsmouth, England, United Kingdom
Site Reliability Engineer (SRE) – GPU Infrastructure Data Centres Fully Remote Role - Work from home The Site Reliability Engineer (SRE) is responsible for the end-to-end validation, testing, and readiness of GPU compute clusters prior to production release. The role ensures that all hardware, networking … Improve test reliability, coverage, and execution efficiency Remediation & System Integrity Diagnose and remediate unhealthy nodes through configuration changes or software fixes Coordinate with on-site support teams for hardware replacements when required Ensure all issues are resolved and documented prior to handover to production operations Documentation & Handover Produce clear ...