Application Support SRE
The Real Time Payments International team is looking for a Site Reliability Engineer (SRE) to drive application deployment readiness, manage day-to-day operational stability and support the reliability of critical payment platforms by implementing automation, leverage best practices and work with a high‐impact team responsible for driving production readiness, reliability, and DevOps automation across platforms.
This role plays a key part in incident management, change readiness, and platform operations, while contributing to continuous improvement initiatives.
Key Responsibilities
Platform Operations & Stability
- Support end-to-end availability, monitoring, and performance of critical payment platforms.
- Execute operational processes to ensure platform health and stability.
- Participate in capacity checks, readiness validations, and environment monitoring.
Incident Management & Execution
- Actively manage and coordinate incident triage and resolution.
- Serve as incident commander driving medium to high-severity incidents.
- Ensure timely updates, accurate impact assessment, and appropriate escalation.
- Contribute to root cause analysis with clear identification of actions and ownership.
Change & Release Support
- Participate in highlighting gaps and defining test cases required for a change in lower environments and validate lower environment test completeness.
- Ensure adherence to change governance processes (test case reviews, checklists, approvals, rollback readiness).
- Engage in creating change plans and support execution of production changes, deployments, and validations.
Technical Troubleshooting
- Perform hands-on troubleshooting across:
- Application behaviour and dependencies.
- Infrastructure components (compute, network, storage).
- Database and performance issues.
- Collaborate with engineering, infrastructure and other technical teams to isolate and resolve issues efficiently.
Monitoring & Observability
- Improve system health monitoring using observability tools and alerts.
- Identify gaps in alerting and contribute to improving quality of alerting and dashboards.
- Ensure proactive detection of anomalies using observability tools.
Automation & Process Improvement
- Contribute to automation initiatives to reduce toil and errors.
- Identify repetitive operational tasks and drive improvements.
- Support implementation of DevOps best practices.
- Leverage AI-driven tools to improve monitoring, incident detection, and operational efficiency, enabling faster troubleshooting and reduced manual effort in day-to-day operations.
Stakeholder Coordination
- Work closely with engineering, program teams, and external partners during incidents and changes.
- Provide structured updates to stakeholders with clarity and consistency.
- Ensure alignment during critical activities.
Risk Identification
- Highlight operational and platform risks including test coverage gaps, infrastructure constraints, dependency risks.
- Escalate issues proactively and support mitigation tracking.
Team Contribution & Mentorship
- Support onboarding and guidance of junior team members.
- Contribute to runbooks, documentation, and knowledge sharing.
- Drive consistency in execution and adherence to operational standards.