Site Reliability Engineer
Job Title: SRE
Location: London, UK(onsite)
Mode of Engagement: Permanent /Contract
JD:
SRE Role description
We need an experienced SRE to focus predominantly on automation, optimization,
and process re-engineering using AI for the Market Risk Platform. Success is
measured by capacity created 9toil eliminated, fewer manual steps, faster recovery,
safer/faster changes) not by being the primary BAU support resources. Strong
Python and provable agentic AI delivery
Primary Objectives:
Eliminate Operational toil and recurring manual work through durable
automation
Re-engineer support/change processes to reduce handoffs, approvals friction
and rerun complexity
Industrialize reliability operations so existing SREs spend less time firefighting
and more time engineering
Key Responsibilities (Automation & Process first)
Automation Engineering (Core)
Build production grade automation in Python(tools, services, workflows) to
remove repetitive work: environment checks, dependency validation,
automated reruns/reprocessing, safe restarts, drift detection, remediation
actions, and standardized operation tasks
Create self-service capabilities for common requests(guard railed, auditable,
repeatable)
Implement “automation with Safety”: idempotency, dry-run modes, approval
gates where needed, rollback/undo strategies, and clear audit trails
Process Re-engineering (Core)
Map current operation processes (incident/problem/change, release
readiness, rerun/recovery, access/entitlements, environment onboarding) and
redesign them to remove waster and reduce cycle time.
Standardize runbooks/playbooks into executable workflows, reduce tribal
knowledge via templates, checklists, and automated pre-flight controls
Defined and track operation KPIs (toil hours removed, alert volume reduction,
MTTR improvements, change failure rate reduction, rerun time reduction).
Agentic AI
Design and implement agentic workflows that take action using
tools/runbooks(e.g., diagnostics, evidence gathering, correlation, guided
remediation, change-risk checks, automated rerun orchestration)
Put strong controls in place: soped permissions, deterministic fallbacks,
human-in-the-loop approvals for risky actions, evaluation harnesses and
measurable outcomes.
Productionize with monitoring, logging and post incident learnings feeding
back into the agent/tooling
Observability (enablemen for automation)
Required skills & Experience
Senior SRE experience on distributed systems and batch/intraday workloads
in a production environment.
Strong Python
Provable agentic AI experience showing
o Tool integration, guard rails, evaluation approach
o Measurable impact (toil reduction, MTTR reduction, alert reduction etc)
Demonstrated process optimization ability (removing steps/handoffs,
standardizing workflows, implementing light weight controls with metrics)
Strong Linux and troubleshooting fundamentals across
application/system/network layers
Experience working across mixed estates ( On Pre VMs + Cloud, with some
Kubernetes exposure for operational monitoring/reruns)
Differentiators
Exposure to Banking/Finance Market Risk Domains
Experience and knowledge of Athena eco system familiarity or similar (Sec
DB Quartz)