Lead Site Reliability Engineer to bring innovation, leadership, and technical excellence to our growing team. What You'll Do: Design and implement scalable, efficient systems for maximum reliability. Lead incidentresponse and implement monitoring solutions to maintain high system uptime. Optimize performance through in-depth analysis and continuous improvement. Develop preventive maintenance programs and carry out Root Cause More ❯
low latency trading and research platform. Core responsibilities: Engineering work across Routing, Switching, Security, Proxies and many other areas - Lots of greenfield project work Designing scalable Network solutions Network incidentresponse (l1/l2 escalation) hands on troubleshooting Adopting automation and figuring areas of improvement Working to tight timelines in a fast paced and dynamic environment Core skills More ❯
low latency trading and research platform. Core responsibilities: Engineering work across Routing, Switching, Security, Proxies and many other areas - Lots of greenfield project work Designing scalable Network solutions Network incidentresponse (l1/l2 escalation) hands on troubleshooting Adopting automation and figuring areas of improvement Working to tight timelines in a fast paced and dynamic environment Core skills More ❯
london (city of london), south east england, united kingdom
Hunter Bond
low latency trading and research platform. Core responsibilities: Engineering work across Routing, Switching, Security, Proxies and many other areas - Lots of greenfield project work Designing scalable Network solutions Network incidentresponse (l1/l2 escalation) hands on troubleshooting Adopting automation and figuring areas of improvement Working to tight timelines in a fast paced and dynamic environment Core skills More ❯
Infrastructure as Code) Work with virtualisation (VMware/vSphere, etc.) Configure/manage SAN/storage, Fibre Channel, zoning, LUN provisioning Participate in vulnerability assessments, patches, security hardening, and incidentresponse Required Skills & Experience NPPV3 clearance, either current or active within the last 12 months (non-negotiable) Strong track record with Windows 11 deployment (imaging, upgrade, Autopilot, Intune More ❯
remediation progress. Vulnerability Management Investigate unauthorised access attempts and ensure compliance with relevant legislation. Collaborate with security teams to identify, assess, and remediate vulnerabilities. Support access control monitoring and incidentresponse activities. Lifecycle & Infrastructure Operations Assist in the operation and control of IT infrastructure across hardware, software, and networks. Participate in change management processes for new or modified More ❯
for automation, cost savings, performance improvement, and scalability. Own capacity planning, infrastructure budgeting, and vendor management. Operational Excellence Ensure high availability, performance, and security of all infrastructure services. Oversee incidentresponse and root cause analysis for infrastructure-related issues. Monitor KPIs and SLAs, ensuring service delivery meets or exceeds expectations. Collaboration & Communication Work closely with cross-functional teams More ❯
monitoring, cost optimization, invoice reconciliation, and contract renewals. Monitor and remediate device compliance and security posture (encryption, passcode, OS version minimums, managed open-in, DLP); coordinate with Security for incidentresponse and hardening. Maintain accurate asset and SIM inventory; track chain of custody and ensure audit readiness. Create and maintain documentation, runbooks, and end-user guides for enrollment More ❯
monitoring, cost optimization, invoice reconciliation, and contract renewals. Monitor and remediate device compliance and security posture (encryption, passcode, OS version minimums, managed open-in, DLP); coordinate with Security for incidentresponse and hardening. Maintain accurate asset and SIM inventory; track chain of custody and ensure audit readiness. Create and maintain documentation, runbooks, and end-user guides for enrollment More ❯
is reliable, scalable, and secure. Ensure the reliability, availability, and scalability of the systems, platforms, and technology through the application of software engineering techniques, automation, and best practices in incident response. To be successful in this role as an Infrastructure Engineer - Production Network Engineering, you should possess the following skillsets: Extensive experience as an individual contributor in the design More ❯
As the IT Operations Centre Team Leader, you'll be at the core of digital operations, leading a skilled team of analysts responsible for system monitoring, availability, and rapid incidentresponse across one of the UK's largest and most complex university infrastructures. Within your role you will: Lead, coach and inspire your team to deliver reliable, responsive … IT monitoring and support. Embed ITIL best practices and ensure standards are met for incident, problem and change management. Act as a calm, confident escalation point during critical events, ensuring clear communication and quick resolution. Collaborate with experts and partners to optimise monitoring tools, drive automation, and improve service resilience. Champion continuous improvement, building a culture that values learning … experience leading IT operations or service monitoring teams (preferably in a 24/7 or mission-critical environment). Strong understanding of ITIL frameworks and operational processes such as incident, change and problem management. Hands-on experience with monitoring tools (e.g. SolarWinds, Zabbix, Nagios) Familiarity with CMDB management and configuration best practices. As a leader, you'll balance accountability More ❯
As the IT Operations Centre Team Leader, you'll be at the core of digital operations, leading a skilled team of analysts responsible for system monitoring, availability, and rapid incidentresponse across one of the UK's largest and most complex university infrastructures. Within your role you will: Lead, coach and inspire your team to deliver reliable, responsive … IT monitoring and support. Embed ITIL best practices and ensure standards are met for incident, problem and change management. Act as a calm, confident escalation point during critical events, ensuring clear communication and quick resolution. Collaborate with experts and partners to optimise monitoring tools, drive automation, and improve service resilience. Champion continuous improvement, building a culture that values learning … experience leading IT operations or service monitoring teams (preferably in a 24/7 or mission-critical environment). Strong understanding of ITIL frameworks and operational processes such as incident, change and problem management. Hands-on experience with monitoring tools (e.g. SolarWinds, Zabbix, Nagios) Familiarity with CMDB management and configuration best practices. As a leader, you'll balance accountability More ❯
performance and reliability standards. Automate operational tasks using tools such as Ansible, Terraform, or Python scripts. Build and maintain monitoring and alerting systems (eg, Prometheus, Grafana). Participate in incidentresponse and conduct root cause analysis for performance-related issues. Document performance benchmarks, testing procedures, and system configurations. If you are interested in this position and would like More ❯
St. Albans, Hertfordshire, England, United Kingdom
Method Resourcing
you'll do Lead the design, build, deployment, and operation of critical software systems. Architect and deliver the shift to an event-driven microservices environment. Improve automation, monitoring, and incidentresponse capability. Partner with Product and stakeholders to define and execute the roadmap. Mentor and develop engineers, driving a culture of quality and accountability. What you'll bring More ❯
St. Albans, Hertfordshire, South East, United Kingdom
Method-Resourcing
you'll do Lead the design, build, deployment, and operation of critical software systems. Architect and deliver the shift to an event-driven microservices environment. Improve automation, monitoring, and incidentresponse capability. Partner with Product and stakeholders to define and execute the roadmap. Mentor and develop engineers, driving a culture of quality and accountability. What you'll bring More ❯
Hayes, south east england, united kingdom Hybrid / WFH Options
The Electric Car Scheme
proactive monitoring, and identifying potential risks. Proven ability to lead technical initiatives from concept to completion, often involving multiple team members or complex integrations. Well versed in production operations, incidentresponse, and performance optimisation. You proactively identify and mitigate risks to ensure system stability and scalability. Benefits: Hybrid working with 2 days in the office (Hayes, London More ❯
Milton Keynes, Buckinghamshire, United Kingdom Hybrid / WFH Options
Rightmove PLC
metrics (CSAT, quality, speed, backlog health) to drive improvements. Analyse service data to identify trends, risks, and opportunities. Oversee resource planning and workload forecasting to maintain smooth operations. Manage incidentresponse standards and escalation processes, reducing friction across CX teams. Leading Teams Lead, coach, and support Team Leaders to build confident, high-performing teams. Take accountability for team More ❯
software development and systems engineering. A high bar for code and configuration quality and readability. A good understanding of current observability and reliability practices. Experienced and comfortable in running incident response. Big picture thinking - you can make trade offs on technical work streams against business impact. Fantastic communication skills. You're able to articulate what you're working on More ❯
london (city of london), south east england, united kingdom
Duffel
software development and systems engineering. A high bar for code and configuration quality and readability. A good understanding of current observability and reliability practices. Experienced and comfortable in running incident response. Big picture thinking - you can make trade offs on technical work streams against business impact. Fantastic communication skills. You're able to articulate what you're working on More ❯
software development and systems engineering. A high bar for code and configuration quality and readability. A good understanding of current observability and reliability practices. Experienced and comfortable in running incident response. Big picture thinking - you can make trade offs on technical work streams against business impact. Fantastic communication skills. You're able to articulate what you're working on More ❯
Days: As per business need Special Working Conditions: Occasional client site travel The Role As SOC Manager, you will: Establish goals and priorities with your team, focusing on: Improving incidentresponse times Reducing false positives and extraneous alerts Enhancing threat de click apply for full job details More ❯
ownership of their customer support operations. This is a fantastic opportunity for a hands-on, process-driven leader. Key Responsibilities: Oversee customer support operations and shift coverage Manage SLAs, incidentresponse, and escalations Maintain separate support flows for two brands Plan and resource staffing models and schedules Implement automation and AI to drive ticket deflection Own the knowledge More ❯
Leeds, West Yorkshire, Yorkshire, United Kingdom Hybrid / WFH Options
Fruition Group
practices and ensure compliance with ISO27001:2022 and internal governance standards. Performance Monitoring: Maintain logging, monitoring, and alerting tools (e.g., CloudWatch, Prometheus, Grafana) to ensure system reliability and improve incident response. Collaboration & Knowledge Sharing: Work with engineers, product managers, and QA teams to optimise deployments and continuously improve the platform. Incident Management: Troubleshoot platform issues, conduct root cause More ❯