Lead Site Reliability Engineer Sunderland, UK
Sunderland, United Kingdom
Tombola
our critical systems and services are always reliable, available, and performing at their best. What will yo u be doing? As an SRE, you'll be instrumental in implementing automation, monitoring, and incident response strategies to minimize downtime and optimize our operations. You'll collaborate closely with our development, infrastructure, and security teams, balancing exciting new feature delivery … before they impact users, and respond to failures and outages effectively. Monitoring and Alerting Implement monitoring systems: Set up and maintain robust monitoring systems (e.g., Dynatrace) for application performance, infrastructure health, and system metrics. Alerting: Create and manage alerting systems to notify us about issues or potential risks in a timely manner, minimizing impact on our players. Metrics collection … fast as possible. Post-incident analysis: After resolving incidents, perform root cause analysis (RCS), including a post-incident review, and document findings to prevent similar issues in the future. Automation and Efficiency Automate manual tasks: Automate repetitive operational tasks to boost efficiency, reduce human errors, and accelerate delivery. Infrastructure automation: Utilise Terraform, Git, and TeamCity to automate More ❯
Employment Type: Permanent
Salary: GBP Annual
Posted: