Site Reliability Engineer (SRE) - Azure migration - Outside of IR35
Cloud Consulting have an urgent requirement for an experienced Site Reliability Engineer to improve the reliability posture of a large-scale, predominantly on-premises platform with some Azure integration.
The role is a hybrid one - 2 or 3 days on-site in Taunton, and 2 or 3 days remote, and is outside of IR35.
The role requires SC Clearance.
The core system is a.NET Framework 4.8 application hosted primarily on Windows Server/IIS with SQL Server Back End components. Our current Azure Well-Architected Reliability maturity is low, and this role is accountable for driving measurable improvement through both technical uplift and operational process maturity.
This is not a tooling role. It is a systems-level practical reliability engineering position focused on reducing operational risk across application, database, infrastructure, and process layers.
You will collaborate closely with development, database, architecture, QA, and DevOps teams to identify, prioritise, and deliver the changes that most effectively improve resilience and reduce incident frequency.
Key Responsibilities:
- Lead initiatives with a small focussed team to materially improve reliability maturity in a predominantly on-prem environment.
- Apply Azure Well-Architected Reliability principles pragmatically within hybrid and Legacy constraints.
- Define and embed SLIs, SLOs, and reliability targets for critical services and paths.
- Identify systemic failure patterns in Legacy .NET Framework code and prioritise remediation based on risk reduction.
- Improve monitoring, alert quality, and operational visibility across application, infrastructure, and database layers.
- Strengthen incident response processes, runbooks, and post-incident learning.
- Work with developers to improve resiliency patterns within Legacy code (retry logic, error handling, graceful degradation).
- Reduce operational toil through targeted automation (PowerShell, Scripting, pipeline improvements).
Required Skills & Experience:
- Proven experience in Site Reliability Engineering, Production Engineering, or reliability-focused platform roles.
- Strong experience operating and improving reliability in on-prem Windows Server/IIS environments.
- Experience supporting large .NET Framework applications.
- Strong understanding of SQL Server performance, availability patterns, and operational risk.
- Experience implementing structured incident management and root cause analysis.
- Familiarity with Azure monitoring and hybrid integrations.
- Experience applying Azure or AWS Well-Architected principles in real-world systems (cloud or hybrid).
- Strong cross-functional collaboration skills.
Desirable Skills:
- Experience modernising Legacy monolithic systems incrementally.
- Knowledge of DORA metrics and their relationship to change stability and MTTR.
- Experience introducing SLO-based reliability management in traditional enterprise environments.
- Familiarity with high-availability configurations and disaster recovery planning.
- Experience in regulated or secure environments.
If you're interested, please send a copy of your C.V in the first instance.