code reviews, sprint planning, and technical discussions. Identify performance bottlenecks and optimize application performance. Contribute to documentation and knowledge sharing within the team. Support production systems and participate in incidentresponse as needed. Required Skills & Experience: 3–5 years of professional Java development experience. Solid understanding of core Java (Java 8+), object-oriented principles, and design patterns. Experience More ❯
optimal resilience and scalability Collaborate with security, development, and product teams to ensure compliance and performance Automate operational processes and reduce manual interventions through scripting and tooling Contribute to incidentresponse planning, disaster recovery testing, and platform health checks Requirements/Experience Strong experience in cloud-native environments (AWS, GCP or Azure) with deep understanding of networking, security More ❯
managing and supporting the enterprise messaging infrastructure built on Solace PubSub+, ensuring high availability, optimal performance, and reliability across production and non-production environments. You will be working on incidentresponse, capacity planning, WAN optimization, and system observability so should have experience with tools such as Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances More ❯
environments. Strong understanding of network architecture , security best practices, and service segregation. Skilled in containerisation, cloud platforms , and modern CI/CD pipelines. Comfortable with on-call responsibilities and incident response. Collaborative and proactive in a fast-moving engineering environment. If you're passionate about modern infrastructure, security, and contributing to the future of digital finance, this is an More ❯
South East London, England, United Kingdom Hybrid / WFH Options
Unitary
our development teams to ensure our observability stack provides clear visibility into system health and performance. Optimise on-call processes, including creating and maintaining detailed runbooks that enable efficient incidentresponse and knowledge sharing across teams. Build self-healing systems using AI tools that automatically resolve common issues before they require human intervention. Develop automation tools and diagnostic … fixing urgent issues and investing in proactive system improvements. Communication is crucial, as you'll be working closely with multiple engineers and may need to coordinate during high-stress incident situations. We would love to hear from you if: Have worked with visualisation tools such as Grafana for creating and maintaining dashboards that provide meaningful insights into system performance … Are proficient with metrics platforms such as Prometheus, InfluxDB, or OpenTelemetry for collecting and analysing system data Have experience with incident management tools such as Incident.io for coordinating response efforts and recording follow-up learnings and actions Can demonstrate strong problem-solving skills and the ability to work autonomously Are confident in writing production code in languages such More ❯
identify performance gaps and risks. Troubleshoot and resolve issues across systems such as UPS, BMS, HVAC, and generators. Create and execute MOPs, SOPs and EOPs for critical maintenance and incident response. Lead and support project execution with minimal supervision. Manage vendor relationships during specialist maintenance and inspections. Assist with Incident and Change Management processes. You’ll Bring More ❯
operators - run white-boarding sessions, turn ambiguous requirements into concrete specs, demo weekly, and iterate fast. Champion observability & reliability - instrument services with OpenTelemetry, define SLIs/SLOs, and automate incident response. Contribute across the stack - build lightweight front-ends when needed and pair with ML engineers on inference and evaluation pipelines. You Might Be a Great Fit if You More ❯
they drive us forward every day. Role Purpose As a NOC Engineer based out of London, you will join a small, high-performing global team responsible for 24x7 monitoring, incidentresponse, and operational support of Subco’s international optical and IP networks. Working within a rotating shift roster, you’ll monitor network performance, resolve incidents, manage stakeholder communications … like MCP, Observium, and New Relic. Triage and resolve network events and incidents in accordance with SLAs, escalating appropriately via Jira Service Management. Maintain accurate and timely updates within incident and change tickets, including clear communication to internal stakeholders and external customers during outages or service degradations. Participate in the change management process, including peer review and execution of … MOPs to support routine maintenance and network upgrades. Collaborate closely with Australian-based Engineers during shift handovers, ensuring full context and documentation are preserved. Support continuous improvement by identifying incident trends, participating in post-incident reviews (PIRs), and helping evolve monitoring rules and automation alerts. Produce and maintain clear technical documentation, including runbooks, fault handling procedures, and change More ❯
South East London, England, United Kingdom Hybrid / WFH Options
Ocean Red Partners
and disaster recovery strategy in one of the most interconnected financial services environments out there. You’ll be trusted to lead , not just manage, designing the playbooks, driving the response process, and uplifting dev experience through tooling and automation. All in a culture that values autonomy, technical leadership, and delivery over red tape. The broader platform team runs modern … engineers and embedding continuous learning across the squad. What you’ll be doing Leading a platform ops squad to deliver site reliability across trading-critical systems Defining DR and incidentresponse strategy (think playbooks, post-mortems, mitigation) Building automation for config, monitoring, and CI/CD pipelines using Terraform, Ansible, Python Championing observability and cost visibility through tools More ❯
specialist subcontractors, including training and development Oversee PPM delivery, asset management systems, and emergency procedures Act as a primary point of contact for client liaison and technical reporting Support incidentresponse, root cause analysis, and continuous improvement initiatives Candidate Profile: Minimum of five years' experience in a critical environment or data centre operations High-voltage authorised person (HVAP More ❯
and system changes. Collaborate with engineering teams to review design and code for security vulnerabilities. Monitor and report on application security threats, metrics, and KPIs. Participate in the security incidentresponse team and work closely with the DevSecOps team. What you’ll need: At least 3 years of software engineering experience, with 2+ years focused on application security. More ❯
low latency trading and research platform. Core responsibilities: Engineering work across Routing, Switching, Security, Proxies and many other areas - Lots of greenfield project work Designing scalable Network solutions Network incidentresponse (l1/l2 escalation) hands on troubleshooting Adopting automation and figuring areas of improvement Working to tight timelines in a fast paced and dynamic environment Core skills More ❯
Purpose of the role To apply software engineering techniques, automation, and best practices in incidentresponse, to ensure the reliability, availability, and scalability of the systems, platforms, and technology through them. Accountabilities Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning. Resolution, analysis and response to system outages and disruptions, and More ❯