Site Reliability Engineer

Job Title: SRE

Location: London, UK(onsite)

Mode of Engagement: Permanent /Contract

JD:

SRE Role description

We need an experienced SRE to focus predominantly on automation, optimization,

and process re-engineering using AI for the Market Risk Platform. Success is

measured by capacity created 9toil eliminated, fewer manual steps, faster recovery,

safer/faster changes) not by being the primary BAU support resources. Strong

Python and provable agentic AI delivery

Primary Objectives:

Eliminate Operational toil and recurring manual work through durable

automation

Re-engineer support/change processes to reduce handoffs, approvals friction

and rerun complexity

Industrialize reliability operations so existing SREs spend less time firefighting

and more time engineering

Key Responsibilities (Automation & Process first)

Automation Engineering (Core)

Build production grade automation in Python(tools, services, workflows) to

remove repetitive work: environment checks, dependency validation,

automated reruns/reprocessing, safe restarts, drift detection, remediation

actions, and standardized operation tasks

Create self-service capabilities for common requests(guard railed, auditable,

repeatable)

Implement “automation with Safety”: idempotency, dry-run modes, approval

gates where needed, rollback/undo strategies, and clear audit trails

Process Re-engineering (Core)

Map current operation processes (incident/problem/change, release

readiness, rerun/recovery, access/entitlements, environment onboarding) and

redesign them to remove waster and reduce cycle time.

Standardize runbooks/playbooks into executable workflows, reduce tribal

knowledge via templates, checklists, and automated pre-flight controls

Defined and track operation KPIs (toil hours removed, alert volume reduction,

MTTR improvements, change failure rate reduction, rerun time reduction).

Agentic AI

Design and implement agentic workflows that take action using

tools/runbooks(e.g., diagnostics, evidence gathering, correlation, guided

remediation, change-risk checks, automated rerun orchestration)

Put strong controls in place: soped permissions, deterministic fallbacks,

human-in-the-loop approvals for risky actions, evaluation harnesses and

measurable outcomes.

Productionize with monitoring, logging and post incident learnings feeding

back into the agent/tooling

Observability (enablemen for automation)

Required skills & Experience

Senior SRE experience on distributed systems and batch/intraday workloads

in a production environment.

Strong Python

Provable agentic AI experience showing

o Tool integration, guard rails, evaluation approach

o Measurable impact (toil reduction, MTTR reduction, alert reduction etc)

Demonstrated process optimization ability (removing steps/handoffs,

standardizing workflows, implementing light weight controls with metrics)

Strong Linux and troubleshooting fundamentals across

application/system/network layers

Experience working across mixed estates ( On Pre VMs + Cloud, with some

Kubernetes exposure for operational monitoring/reruns)

Differentiators

Exposure to Banking/Finance Market Risk Domains

Experience and knowledge of Athena eco system familiarity or similar (Sec

DB Quartz)

Apply Now

Site Reliability Engineer

Job Details