Site Reliability Engineer - Data Centers
- Hiring Organisation
- TGS International Group
- Location
- Portsmouth, England, United Kingdom
clusters using automated workflows Execute and analyse performance and stability benchmarks orchestrated via a workload scheduler Validate results against expected performance and reliability thresholds Test Framework & Automation Maintain and extend the automated validation framework built using Python and Ansible Integrate new test cases to support additional hardware … platforms and GPU generations Improve test reliability, coverage, and execution efficiency Remediation & System Integrity Diagnose and remediate unhealthy nodes through configuration changes or software fixes Coordinate with on-site support teams for hardware replacements when required Ensure all issues are resolved and documented prior to handover to production operations ...