Application Support SRE

The Real Time Payments International team is looking for a Site Reliability Engineer (SRE) to drive application deployment readiness, manage day-to-day operational stability and support the reliability of critical payment platforms by implementing automation, leverage best practices and work with a high‐impact team responsible for driving production readiness, reliability, and DevOps automation across platforms.

This role plays a key part in incident management, change readiness, and platform operations, while contributing to continuous improvement initiatives.

Key Responsibilities

Platform Operations & Stability

  • Support end-to-end availability, monitoring, and performance of critical payment platforms.
  • Execute operational processes to ensure platform health and stability.
  • Participate in capacity checks, readiness validations, and environment monitoring.

Incident Management & Execution

  • Actively manage and coordinate incident triage and resolution.
  • Serve as incident commander driving medium to high-severity incidents.
  • Ensure timely updates, accurate impact assessment, and appropriate escalation.
  • Contribute to root cause analysis with clear identification of actions and ownership.

Change & Release Support

  • Participate in highlighting gaps and defining test cases required for a change in lower environments and validate lower environment test completeness.
  • Ensure adherence to change governance processes (test case reviews, checklists, approvals, rollback readiness).
  • Engage in creating change plans and support execution of production changes, deployments, and validations.

Technical Troubleshooting

  • Perform hands-on troubleshooting across:
  • Application behaviour and dependencies.
  • Infrastructure components (compute, network, storage).
  • Database and performance issues.
  • Collaborate with engineering, infrastructure and other technical teams to isolate and resolve issues efficiently.

Monitoring & Observability

  • Improve system health monitoring using observability tools and alerts.
  • Identify gaps in alerting and contribute to improving quality of alerting and dashboards.
  • Ensure proactive detection of anomalies using observability tools.

Automation & Process Improvement

  • Contribute to automation initiatives to reduce toil and errors.
  • Identify repetitive operational tasks and drive improvements.
  • Support implementation of DevOps best practices.
  • Leverage AI-driven tools to improve monitoring, incident detection, and operational efficiency, enabling faster troubleshooting and reduced manual effort in day-to-day operations.

Stakeholder Coordination

  • Work closely with engineering, program teams, and external partners during incidents and changes.
  • Provide structured updates to stakeholders with clarity and consistency.
  • Ensure alignment during critical activities.

Risk Identification

  • Highlight operational and platform risks including test coverage gaps, infrastructure constraints, dependency risks.
  • Escalate issues proactively and support mitigation tracking.

Team Contribution & Mentorship

  • Support onboarding and guidance of junior team members.
  • Contribute to runbooks, documentation, and knowledge sharing.
  • Drive consistency in execution and adherence to operational standards.

Job Details

Company
KBC Technologies Group
Location
City of London, London, United Kingdom
Posted