2 of 2 Incident Response Jobs in South London

Senior Azure SaaS Reliability & Support Engineer

Hiring Organisation
Reveal Media
Location
Kingston Upon Thames, England, United Kingdom
error budgets across all deployments. Designing automation and tooling to improve reliability and reduce manual work. Your Responsibilities and Tasks 1. Environment Health & Incident Response Monitor ST and MT environments for server performance, response times, error rates, and application health. Detect and resolve database issues, stalled file … 4. Monitoring & Reporting Implement and maintain Azure Monitor/Application Insights/Log Analytics dashboards for: Environment uptime & performance SLA compliance & error budget tracking Incident trends and recurring issue analysis Provide regular reliability reports and improvement recommendations to stakeholders. 5. Continuous Improvement & Knowledge Sharing Feed recurring issues and systemic

Senior Infrastructure Support Engineer

Hiring Organisation
Nscale
Location
South London, UK
Employment Type
Full-time
innovation, and environmental responsibility. At Nscale, our Support and Operations team plays a critical role in maintaining service availability, driving service reliability and rapid response to customer tickets We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work … . Practical experience with GPU drivers and GPU logs investigation tools, e.g. nvidia-smi. Performance diagnostics using NCCL on large scale clusters. Observability and incident response. Build and use alerting stacks and dashboards, interpret metrics and alerts, and drive runbooks to resolution; contribute to SLOs and post‐incident