SiteReliabilityEngineering (SRE) Manager page is loaded SiteReliabilityEngineering (SRE) Managerlocations: London, UKtime type: Full timeposted on: Posted Todayjob requisition id: R35765As a leading financial services and healthcare technology company based on revenue, SS&C is headquartered in Windsor, Connecticut, and has 27,000+ employees in 35 countries. Some 20,000 financial … name a few. SiteReliability Manager Locations : London, Surbiton, Essex Hybrid The Opportunity We are seeking a highly motivated and experienced SiteReliabilityEngineering (SRE) Manager to lead a team of SREs responsible for the reliability, scalability, and performance of our production systems. This role is pivotal in bridging the gap between development and … direction of infrastructure and reliability initiatives. Advocate for best practices in observability, CI/CD, and infrastructure as code. What You Will Bring: Proven experience managing or leading SRE, DevOps, or infrastructure teams. Strong background in systems engineering, cloud platforms (AWS, Azure), and container orchestration (Kubernetes). Proficiency in monitoring, alerting, and incident management tools (Prometheus, Grafana, PagerDuty More ❯
Preferred qualifications: Master's degree or PhD in Computer Science, or a related technical field. Experience as a cloud customer. About the job SiteReliabilityEngineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our … externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to … manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate More ❯
Job Title: Platform Engineer/SRE Work Location: Bromley/Chester, UK (Hybrid 3 days in a week) Job Description: We are seeking a Platform Engineer/SRE with a strong and diverse technical background. The ideal candidate will possess hands-on development experience along with SiteReliabilityEngineering (SRE) expertise. This role requires a proactive individual … who can lead by example, address platform stability issues, and develop resilient and reliable systems. Key Responsibilities: Provide hands-on technical leadership in platform engineering initiatives. Ensure platform stability and resilience by identifying and resolving reliability issues. … Collaborate with cross-functional teams to deliver scalable and robust system solutions. Key Skills Required: Strong development experience in Java (primary skill). SiteReliabilityEngineering ( SRE ) experience. Proficiency with Kafka , Mule , and Oracle Database . Ability to work at a managerial level while remaining hands-on with technical tasks. Nice to Have: Knowledge of payment systems More ❯
Job Title: Platform Engineer/SRE Work Location: Bromley/Chester, UK (Hybrid – 3 days in a week) Job Description: We are seeking a Platform Engineer/SRE with a strong and diverse technical background. The ideal candidate will possess hands-on development experience along with SiteReliabilityEngineering (SRE) expertise. This role requires a proactive individual … who can lead by example, address platform stability issues, and develop resilient and reliable systems. Key Responsibilities: Provide hands-on technical leadership in platform engineering initiatives. Ensure platform stability and resilience by identifying and resolving reliability issues. … Collaborate with cross-functional teams to deliver scalable and robust system solutions. Key Skills Required: Strong development experience in Java (primary skill). SiteReliabilityEngineering ( SRE ) experience. Proficiency with Kafka , Mule , and Oracle Database . Ability to work at a managerial level while remaining hands-on with technical tasks. Nice to Have: Knowledge of payment systems More ❯
Junior SiteReliability … Engineer We are currently working with a leading Financial Services company, who are looking for a Junior SiteReliability Engineer to join their ever-expanding platform/SRE team from their Shoreditch, London, Office where you will be expected to travel to the office 4 days a week. They are looking for you to have excellent cloud knowledge … ideally AWS as well as having experience of Powershell/Python. As the Junior SiteReliability Engineer, you will be a self-starter who has excellent stakeholder management experience who can show outcome based work. You will ideally have 2 years of commercial experience coming from an IT Operations/Cloud infrastructure background. Please note this is an More ❯
to debug, optimize code, and to automate routine tasks. Excellent problem-solving approach, with effective verbal and written communication skills. About the job SiteReliabilityEngineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our … externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to … manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate More ❯
Role Overview: We are seeking a highly skilled and motivated SiteReliability Engineer (SRE) to join our engineering team to support critical application deployments in a "follow-the-sun" environment. In this role, you will leverage your expertise in cloud provisioning, infrastructure as code, and container orchestration to ensure the reliability, scalability, and performance of our … and versioning. Containerization and Orchestration: Deploy, manage, and provide ongoing support for containerized applications using Kubernetes, including Amazon EKS (Elastic Kubernetes Service) and Azure Kubernetes Service (AKS), ensuring their reliability, availability, and performance. Monitoring and Alerting: Monitor application performance and system health through observability tools (e.g., Prometheus, Grafana, ELK stack), proactively identifying and resolving issues to ensure high availability … tasks and manage configurations. Load Balancing: Implement and maintain load balancing solutions to ensure optimal distribution of application traffic and high availability. Collaboration with Development Teams: Collaborate with software engineering teams to design, develop, and maintain robust systems and solutions, including RESTful APIs, ensuring seamless integration across platforms. Post-Mortem Analysis: Conduct comprehensive post-mortem analyses following incidents, identifying More ❯
CDW. JOB TITLE: Senior Automation Engineer II DEPARTMENT: DevOps Engineer ROLE PURPOSE: This role is to design, build, and scale enterprise cloud platforms with a strong focus on automation, reliability, and developer experience. As part of the Cloud Infrastructure & DevOps team, you will build multi-cloud infrastructure that powers hundreds of production services, including critical Salesforce DevOps pipelines. You … environments. Drive infrastructure compliance, DevSecOps, and policy-as-code practices. KNOWLEDGE, SKILLS AND EXPERIENCE: Minimum 5 years of experience in Platform Engineering, SiteReliabilityEngineering (SRE), or DevOps roles supporting cloud-native enterprise environments Proficient in Microsoft Azure and AWS platforms with hands-on experience in Kubernetes (AKS/EKS), Helm charts, and service mesh technologies … or HashiCorp Terraform Associate are advantageous Strong interpersonal skills including clear communication, collaboration across teams, adaptability in fast-paced environments, and a proactive mindset with a focus on reliability, performance, and developer enablement We make technology work so people can do great things. CDW is a leading multi-brand provider of information technology solutions to business, government, education and More ❯
an essential role in supporting AWS public cloud infrastructure while championing automation through Infrastructure as Code solutions such as Terraform. Your day-to-day activities will involve collaborating with SRE and engineering teams to enhance system observability, proactively managing operational risks, maintaining high standards of security compliance, and ensuring robust disaster recovery capabilities. You will be responsible for documenting … Maintain the reliability and security of cloud environments by implementing robust monitoring tools and adhering to industry best practices.* Enhance observability and telemetry within cloud-hosted environments using SRE methodologies to deliver on Service Level Agreements (SLAs), Objectives (SLOs), and Indicators (SLIs).* Document and regularly review operational risks within the cloud environment, ensuring that identified issues are tracked … for all cloud-hosted services through effective backup strategies and disaster recovery processes, including planning and conducting quarterly DR tests.* Collaborate closely with SiteReliabilityEngineering (SRE) and engineering teams to ensure optimal management of the cloud environment.* Support asset management processes throughout their lifecycle, ensuring compliance with end-of-service (EOS) and end-of-life More ❯
has helped build some of the world's largest companies. Our team in London is growing and we're looking for talented people to join us on our journey Engineering at Duffel We're building tools to simplify travel distribution, search and booking. What does this actually mean? It's one common and seamless API. This brings huge technical … experience to go with it. The tools used on the team include Elixir, Phoenix, Kubernetes and Google Cloud Platform. SiteReliabilityEngineering at Duffel As an SRE at Duffel, you'll be part of a small team within engineering that is responsible for the reliability, performance, and resilience of our infrastructure and applications. You will … be working closely with engineering teams to understand their needs and help meet the demands of our product as we scale globally. What we're looking for - An infrastructure and systems engineering generalist who is comfortable diving deep into the weeds on different issues. Some recent examples include: - A configuration issue between Google's Load Balancer and the More ❯
JOB TITLE Google Product SiteReliability Engineer LOCATION London HOURS Full-time - 35 hours per week WORKING PATTERN Our work style is hybrid, which involves spending at least two days per week, or 40% of our time, at our London office. About this opportunity We're modernising with cloud, a platform that is quick, secure and resilient for … customers and easy, modern and green for developers. We're looking for a Google Product SiteReliability Engineer to join our Public Cloud Platform. You'll have a unique opportunity to be part of an ambitious team with the purpose of driving our tech modernisation agenda and enable us to become the biggest Fintech in the UK. The … learn and develop your engineering skills It would be great if you also had Candidates with direct experience in cloud engineering, with understanding and demonstrable experience of SRE Principles Strong Linux background, including working with filesystems and processes Good understanding of the SDLC Experience building fault-tolerant systems and strong DR policies. You'll be able to demonstrate More ❯
Core, BCG X, and CT worldwide. This role is also accountable for embedding security within DevSecOps practices, enforcing automation at scale, and applying SiteReliabilityEngineering (SRE) principles across all security services. The role requires strong partnership with ISRM, with a focus on balancing and prioritizing security requirements, automation opportunities, user experience needs, and broader business outcomes. … that support modern work scenarios, remote access, zero-trust networking, and AI/ML workloads. Leverage automation frameworks and IaC to improve scalability and reduce manual intervention. Operational Security, SRE & Assurance: Ensure security platforms are resilient, continuously monitored, and designed for 24x7 support and incident response readiness. Embed security telemetry and observability to enable proactive threat detection and automated response. … Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define service level objectives (SLOs) and key performance indicators (KPIs) for all security services. Compliance, Governance & Risk Management: Ensure alignment with global compliance requirements such as ISO 27001, NIST, SOC 2, GDPR, and others. Partner with governance, legal More ❯
enabling innovation and agility across BCG Core, BCG X, and CT worldwide. This role is accountable for embedding security within DevSecOps practices, applying SiteReliabilityEngineering (SRE) principles across all security services, and aligning with privacy, compliance, and business leaders to maintain trust and regulatory compliance. Key Responsibilities: Strategic Leadership & Transformation: Define and execute a unified security … remote access, zero-trust networking, and protection of sensitive data in AI/ML workloads. Leverage automation frameworks and IaC to improve scalability and reduce manual intervention. Operational Security, SRE & Assurance: Ensure security platforms are resilient, continuously monitored, and designed for 24x7 support and incident response readiness. Embed security telemetry and observability to enable proactive threat detection and automated response. … Apply SRE principles to improve reliability, performance, and maintainability of security services. Define service level objectives (SLOs) and key performance indicators (KPIs) for all security services. Compliance, Governance & Risk Management: Ensure alignment with global compliance requirements such as ISO 27001, NIST, SOC 2, GDPR, and others. Partner with governance, legal, and ISRM teams to implement enforceable policies and standards More ❯
software-defined networking principles. Embed zero-trust principles and user-centric design into all remote connectivity services. Align remote connectivity architecture with broader enterprise network, security, and cloud strategies. Engineering & Operations: Lead the engineering, deployment, and lifecycle management of remote access solutions such as Cisco AnyConnect, Zscaler, and other mainstream VPN … platforms. Drive automation of remote access provisioning, policy enforcement, and configuration management through Infrastructure as Code (IaC) and zero-touch deployment practices. Apply SiteReliabilityEngineering (SRE) principles to improve performance, availability, and troubleshooting. Establish observability practices across all access points with real-time metrics, logs, and telemetry. Security, Compliance & Governance: Ensure compliance with corporate security and … segmentation, and endpoint-based access control. Proven ability to scale remote connectivity solutions to tens of thousands of users and devices. Experience with IaC, network automation, observability tooling, and SRE methodologies. Preferred Qualifications: Certifications such as CCNP, CCIE, PCNSE, Zscaler Certified, or equivalent. Familiarity with secure hybrid work and cloud networking models. Background in network performance optimization, user-centric design More ❯
become the UK's most loved retirement expert. Purpose As a Senior Application Support Engineer, you will play a crucial role in powering our Retail applications by partnering with engineering and business teams to build deep technical and business expertise. You'll be the go-to expert across a diverse, modern, and complex technology landscape, ensuring seamless support and … with a broad range of technologies, including: Practical experience with performance monitoring tools such as Dynatrace or equivalent. Skills & Knowledge Solid understanding of SiteReliabilityEngineering (SRE) principles, including incident management, monitoring, alerting, and performance tuning. Strong knowledge of Software Development Lifecycle (SDLC) processes. Familiarity with incident management platforms like ServiceNow, PagerDuty, or similar tools. Excellent analytical … e.g., annuities, equity release) is advantageous. Experience with automation and scripting to improve manual processes (e.g., PowerShell, Bash). Familiarity with agile methodologies and experience working in DevOps/SRE-driven environments. Company Benefits A Competitive Salary, Pension Scheme and Life Assurance Along with 25 Days Annual Leave plus an Additional Day on us for your Birthday Private Medical Cover More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Holland & Barrett International Limited
want to hear from you! Key Responsibilities: Security Strategy: Help define and execute the Holland & Barrett cloud security strategy, partnering with platform and SiteReliabilityEngineering (SRE) teams to build robust infrastructure that supports our business. Perimeter Security: Establish platform perimeter security by implementing controls at ingress and egress points, including creating and maintaining an edge network More ❯
Support teams to input into business reviews Be a visionary Ops champion for our internal teams Skills Bachelor's or Master's degree in a STEM field (Computer Science, Engineering, Mathematics, etc.) or equivalent experience Demonstrable experience in product management or product operations Strong product and technical background with proven ability to communicate effectively with engineers and technical team … management best practices-user research, market insights, goal setting, prioritisation, execution, and leadership Familiarity with monitoring tools, incident management protocols, and collaboration with SiteReliabilityEngineering (SRE) teams Proven ability to develop relationships and align teams across product, engineering, and leadership to ensure the effective execution of strategic priorities Hands-on experience analysing workflows and implementing … of improvement, develop solutions, and inspire change with autonomy The Interview Process Online interview with the Senior Talent Partner In-person interview with the Director of Product Operations and Engineering team member Online interview with Director of Product Operations and CPO At Reward Gateway Edenred, we are committed to ensuring an inclusive and accessible recruitment process for all candidates. More ❯
Staff SiteReliability Engineer/DevOps London or Remote About you An SRE or DevOps engineer with hands-on experience in high-traffic production systems Strong in Linux, databases (MySQL, Postgres, MongoDB, Redis), and networking fundamentals Comfortable with Kubernetes, CI/CD pipelines, and observability tools like Datadog A self-starter who thrives in scaling environments and can … work independently without PMs Pragmatic, able to balance prevention, maintenance, and firefighting when needed Your mission is to Take ownership of uptime and reliability for a platform serving 50M+ users Build robust monitoring, alerting, and incident response practices Improve CI/CD pipelines and enable safe deployments (blue-green, canary) Partner with engineers across teams to fix pain points … CD best practices Observability tools like Datadog, OpenTelemetry, or ELK stack Nice-to-haves: RabbitMQ, Kafka, Terraform, Ansible, GCP, Datadog What makes this role exciting Be the first senior SRE hire with ownership of reliability across the entire platform Shape infrastructure and processes for a scale-up growing beyond 100 FTE Work on a product serving millions of users More ❯
Overview Blip is a leading tech company focused on software engineering solutions for sports entertainment. We operate at scale. As part of Flutter Entertainment, we play an essential role in the Group's goal of becoming the global leader in online sports betting and iGaming, developing innovative products and platforms for over 14 million monthly customers worldwide. We are … ex. Deciding which technology, or pattern to create or leverage) Experience being "on-call" for a service, and familiarity with incident notification tooling (ex. Pagerduty, Opsgenie) Comprehensive understanding of SRE principles (ex. Working knowledge of the Google SRE book) Demonstrated strength in leading a project in a agile/scrum environment Thrives in a diverse work environment We'd Like … distributed dev environments) Built and maintained a system and culture that supported and implemented SLOs Has shown to be a thought leader contributing to the broader industry conversation about SRE principals and topics (ex. Speaking at conferences) Perks and Benefits This is what you should have. What do we have, you ask? Well, you can check our amazing perks & benefits More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Digital Realty (UK) Limited
Position Title: SiteReliability Engineer, Interconnection Service and Network Delivery Location: Hybrid: Austin, Dallas, Boston, Ashburn, Atlanta, London, or Amsterdam Your role In this role, you will be responsible for deploying and maintaining all Digital Realty interconnection fabric network infrastructure. The ideal candidate can demonstrate a unique blend of network engineering, network operations, and software understanding through … the application of engineering principals. You will focus on delivering operational discipline and embrace key operational principals including automation, agile development, and scripting. What youll do You will be part of the global Fabric Engineering organization and work in tandem with other teams to build and maintain a global network infrastructure. Ideal candidates for this role will bring … an understanding carrier class network infrastructure as well as experience working in a fast-paced development environment. What youll need 5+ years of operations and engineering experience Bachelors degree in Computer Science (or equivalent) preferred Strong experience with automation tools (Ansible, Terraform, etc) Strong experience working with Linux systems and tools Experience with Python (or equivalent high-level language More ❯
We offer an exciting opportunity to join a world-class network team in a dynamic environment that feels like a start-up. As a SiteReliability Engineer (SRE) , you will deploy, manage, troubleshoot, and innovate the tools, services, and components that enable our network engineers to automate and maintain network operations. Your internal customers are your network engineeringMore ❯
the globe. What you'll do: As a SiteReliability Engineer at Zefr, you'll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with Zefr's Engineering and Data Science teams ensuring the infrastructure … secure, resilient, scalable, and cost-efficient applications and systems/pipelines in AWS and GCP. Foster and push our DevOps culture and philosophy by encouraging continuous improvement across all engineering teams. Proactively maintain the health of production environments, including monitoring application performance and resource utilization. Participate in 24/7 on-call rotation, respond to system performance issues and … at the application and infrastructure level. Mature our CI/CD workflows and release process. Maintains a forward-thinking approach, actively researching and proposing new solutions. Propose and review Engineering Request for Comments (RFC) to drive Engineering architecture and practices. Technology Stack at Zefr: Core Infrastructure & Cloud Platforms: Cloud Providers: Google Cloud Platform (GCP), Amazon Web Services (AWS More ❯
applications, delivering scalable, secure, and data-driven solutions to global clients. Role Overview: We are looking for a highly motivated Mid/Senior DevOps Engineer to join our Platform Engineering team. This role plays a critical part in shaping and supporting the infrastructure that powers our data and AI-driven platforms. You will work closely with engineers, data scientists … cloud-native solutions, and enabling the deployment of complex applications, including AI/ML models. Key Responsibilities: Maintain and optimize our cloud infrastructure (primarily AWS) with a focus on reliability, scalability, and cost efficiency. Automate infrastructure provisioning using Infrastructure-as-Code (IaC) tools such as Terraform. Build and maintain CI/CD pipelines for application, data, and model deployment … workflows. Collaborate with engineering and data science teams to deploy and monitor machine learning models and analytical services. Implement and enforce security best practices across cloud and network environments. Troubleshoot deployment and performance issues across multiple environments. Set up and maintain observability tools for logging, monitoring, and alerting (e.g., Prometheus, Grafana, Loki). Contribute to internal tooling to streamline More ❯
edge software, platforms, and infrastructure. The Role Join us as a SiteReliability Engineer and help us build the future of data sovereignty! We're seeking an SRE passionate about creating high-performance, scalable, and reliable services for our production infrastructure. You'll have a direct impact, improving existing systems and developing innovative solutions to complex challenges. Our … small, collaborative engineering teams own the full lifecycle of their services, from development to production operations. We champion automation and empower you to choose the best tools for the job. If you thrive in a fast-paced environment where you can make a real difference, we want to hear from you! Required skills/expertise: Develop and implement a … and applications to support large concurrent user bases and sustained daily usage. This will involve performance tuning, capacity planning, and optimization of resource utilization. Collaborate closely with the product engineering team to influence the design and implementation of new products and features, ensuring they meet our reliability and scalability standards from the outset. Preferred Qualifications Bachelor's degree More ❯
Cloud Airgapped solutions. You will build expertise in deploying and operating these solutions at customer sites as well as internal reference implementations. Your expertise in Google Cloud architecture and engineering, combined with your leadership experience in guiding small teams, will ensure the successful delivery of robust and scalable cloud solutions for our enterprise clients. Minimum of 5 years of … Expertise in a wide range of Google Cloud products and services (Engine, App Engine, Cloud Storage, GKE, etc.) and broader IaaS solutions (Kubernetes, systems virtualization, etc.) Experience architecting and engineering technical cloud-based solutions to meet business and non-functional requirements Hands-on experience creating comprehensive technical documentation, including architecture diagrams, design specifications, and operational runbooks Experience implementing foundational … mentorship to junior team members Strong communication skills with the ability to articulate complex technical concepts to both internal and client technical, non-technical, and management stakeholders Experience in sitereliabilityengineering or IT production systems operations including troubleshooting and debugging live incidents Excellent problem-solving abilities with demonstrable examples of implementing technical innovation or process improvements More ❯