SRE Lead (Banking/Financial)
Job Description:
- Our client is transforming their production support function into a full Site Reliability Engineering (SRE) model, and we’re looking for a hands-on SRE Lead to help establish and lead the SRE capability. We are looking for a hands-on SRE Lead to establish and lead the SRE function, ensuring operational excellence across production systems.
Key Responsibilities:
- Lead the SRE function across the engineering organisation and drive operational excellence across production systems.
- Define and implement the observability and monitoring strategy, including dashboards, alerting, SLOs, SLAs, and error budgets.
- Establish comprehensive monitoring coverage to ensure visibility into system health, infrastructure, and business-critical workflows.
- Drive adoption of AI-driven tools and automation for proactive system troubleshooting, incident triage, and root cause analysis.
- Lead and mentor a team of SRE Engineers embedded within engineering teams.
- Manage incident response processes, including on-call management and post-incident reviews.
- Collaborate with product and engineering teams to build reliability and observability into new systems.
- Monitor UI behaviour and end-to-end system performance, not just infrastructure metrics.
Essential Skills & Experience:
- Proven experience as an SRE Lead or Senior SRE in large-scale, high-availability production environments.
- Strong experience with observability and monitoring tools such as Datadog, Grafana, Prometheus, PagerDuty, or similar.
- Experience managing incident response, on-call processes, and post-incident reviews.
- Strong understanding of operational tooling for data ingestion and calculation pipelines, with the ability to detect anomalies in system behaviour.
- Ability to provide technical leadership and influence engineering stakeholders.
Nice to Have:
- Experience within financial data pipelines, index calculation, or capital markets systems.
- Exposure to AI/ML-based tools for anomaly detection and automated troubleshooting.
- Experience monitoring application-layer and UI behaviour, beyond infrastructure metrics.
- Experience building SRE practices in a greenfield or transformation environment.