Lead Site Reliability Engineer Sunderland, UK
Sunderland, United Kingdom
Tombola
system uptime: Monitor and maintain the availability and reliability of critical systems and services, meeting all uptime SLAs (Service Level Agreements). Incident management: Quickly respond to incidents, investigate root causes, and ensure effective postmortems and continuous improvement processes are in place. Failure detection and response: Proactively identify potential failures or performance bottlenecks before they impact users, and respond … latency, request rates) to measure system health and performance. Incident Response Incident resolution: Work quickly to resolve incidents, minimize downtime, and restore service as fast as possible. Post-incident analysis: After resolving incidents, perform root cause analysis (RCS), including a post-incident review, and document findings to prevent similar issues in the future. Automation and Efficiency More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted: