you want to login/join with: We have an exciting and rewarding opportunity for you to take your software engineering career to the next level. As a Lead SiteReliabilityEngineer at JPMorgan Chase within CCB, you are an integral part of an agile team that works to enhance, build, and deliver trusted market-leading technology More ❯
on our PSL and will contact one of those if we need to. About the role You will be responsible for managing a small team of experienced DevOps/SRE Engineers, to support, maintain, and consistently improve the development estate so the development teams can focus and innovate constantly to drive Zen's business and the teams' priorities. Reporting to … the engineering manager, you must have an excellent track record in a DevOps or better SRE role and have a genuine interest in developing your Line Management experience. Zen are committed to investing in and developing you - we believe that leaders have a huge impact on our people and play a vital part in our success. As a service provider … work with exceptional experienced Lead DevOps engineers and developers using frameworks, standards, and defined operational process to make the unfamiliar comfortable and familiar. Over time we'll increasingly leverage SRE principles to further optimize and enhance capabilities - it's a great opportunity to get in at the start and shape this journey. You'll own the relationship with the operations More ❯
London, England, United Kingdom Hybrid / WFH Options
Natobotics
Join to apply for the SiteReliabilityEngineer (SRE) role at Natobotics . Role: SRE Lead Location: Birmingham, UK (Hybrid, 2-3 days WFO) Contract: 3 months (Possible extension) Are you a skilled SiteReliabilityEngineer (SRE) with experience in maintaining scalable and reliable infrastructure? We're looking for a proactive leader with a … passion for automation, incident management, and system optimization. Key Skills Required: 5+ years of SRE or similar experience Expertise in Cloud Platforms (SIEM technologies preferred) Proficiency in Python or Bash scripting Hands-on experience with Infrastructure as Code (e.g., Terraform, Ansible) Familiarity with Docker and Kubernetes Strong problem-solving and collaboration skills Responsibilities: Design, implement, and manage scalable infrastructure Monitor More ❯
London, England, United Kingdom Hybrid / WFH Options
Bydgoszcz
Are you a passionate SiteReliabilityEngineer? We’re hiring for a company specialized in distributed systems, content delivery, and video streaming at scale. This fast-growing tech company is transforming in-transit entertainment with an intelligent caching platform that enables airlines and cruise lines to deliver personalized, high-quality video content — even without internet access. Join … and networking (HTTP/S, DNS, QUIC) Contribute to post-incident reviews, root cause analyses, and long-term stability initiatives Participate in on-call rotations for incident response and site uptime Our requirements 3+ years of experience in SiteReliability Engineering, DevOps, or Cloud Infrastructure roles Strong hands-on experience with: Kubernetes (Helm, Operators, workload and networking … equivalent professional experience Optional Experience with performance tuning in Go, Rust, or C/C++ Background in content delivery, video/media platforms, or caching technologies Proven contributions to reliability engineering or developer platform improvements #J-18808-Ljbffr More ❯
MCS Group is working with one of their closest clients as they seek to appoint a SiteReliabilityEngineer to their growing team. An award winning business which has seen exponential growth over the last 2 years off the back of their transformative technology being utilised by organisations across the UK and Ireland and beyond. They've … required. Strong knowledge of Linux, Windows, and IP networking, covering routing, DNS, firewalls, and load balancing. Commercial experience with Docker, Kubernetes, and container orchestration. Familiarity with Elasticsearch. Understanding of SRE principles, DevOps, and DevSecOps methodologies. Strong problem-solving skills, attention to detail, and the ability to work autonomously. Full right to work in Ireland or UK. The client is unable More ❯
firm is redefining how financial institutions handle compliance and data—leveraging automation, AI, and modern cloud infrastructure. They’re now looking for a SiteReliabilityEngineer (SRE) to join their London-based team. This is a hybrid role with 2 days per week in the office, offering the best of both collaboration and flexibility. Role: SiteReliabilityEngineer (SRE) Location: London (Hybrid – 2 days/week in-office) Term: Permanent Salary: £40,000–£55,000 Key Responsibilities: Design and manage scalable, cloud-based infrastructure (AWS) Drive automation, monitoring, and CI/CD best practices Collaborate with engineering to ensure system reliability and performance Lead incident response and implement proactive improvements Key Requirements … Hands-on experience in SRE, DevOps, or Infrastructure roles 1-3 year of commercial experience Strong AWS cloud expertise Familiarity with Terraform, Kubernetes, CI/CD pipelines Experience with monitoring tools like Prometheus, Grafana, ELK A proactive mindset and strong problem-solving skills If you’re interested—or know someone who might be—please get in touch or send your More ❯
Job Title: SiteReliabilityEngineer (SRE) – High-Frequency Trading Infrastructure Location: Onsite – New York City, London, or Singapore £140,000 - £240,000 (Depending on Level of experience and interview performance) Our Client, a leading high-frequency trading firm, is seeking a SiteReliabilityEngineer (SRE) to architect and build next-generation production tools and … critical role focused on reliability, scalability, and performance in one of the most competitive and technologically advanced industries. About the Role This opportunity is ideal for an experienced SRE who thrives in production-critical environments. The successful candidate will join a high-caliber team of engineers and work on automating, scaling, and securing systems that drive global trading operations. … Key Responsibilities Design and develop scalable production tools for deployment, monitoring, and infrastructure automation. Ensure the reliability and efficiency of trading systems through proactive automation and tooling. Collaborate with developers and traders to support the live trading environment. Manage and optimize configuration and deployment pipelines across AWS and on-premise infrastructure. Implement observability and monitoring systems to enable rapid More ❯
Job Title: SiteReliabilityEngineer (SRE) – High-Frequency Trading Infrastructure Location: Onsite – New York City, London, or Singapore £140,000 - £240,000 (Depending on Level of experience and interview performance) Our Client, a leading high-frequency trading firm, is seeking a SiteReliabilityEngineer (SRE) to architect and build next-generation production tools and … critical role focused on reliability, scalability, and performance in one of the most competitive and technologically advanced industries. About the Role This opportunity is ideal for an experienced SRE who thrives in production-critical environments. The successful candidate will join a high-caliber team of engineers and work on automating, scaling, and securing systems that drive global trading operations. … Key Responsibilities Design and develop scalable production tools for deployment, monitoring, and infrastructure automation. Ensure the reliability and efficiency of trading systems through proactive automation and tooling. Collaborate with developers and traders to support the live trading environment. Manage and optimize configuration and deployment pipelines across AWS and on-premise infrastructure. Implement observability and monitoring systems to enable rapid More ❯
Join to apply for the SiteReliabilityEngineer role at SS&C Technologies Continue with Google Continue with Google Join to apply for the SiteReliabilityEngineer role at SS&C Technologies Get AI-powered advice on this job and more exclusive features. Sign in to access AI-powered advices Continue with Google Continue … Ensure critical applications are effectively monitored using tools like Prometheus and Grafana. Create and maintain dashboards and alerts to enhance visibility into application health. Define, implement, and track key SRE metrics (SLOs, SLIs, error budgets). Partner with development teams to improve application reliability and resilience. Analyse incident trends and recommend improvements to reduce recurrence. Automate repetitive support tasks … complex systems . Skilled in debugging , code optimisation , and automation . Experience with relational databases and data analysis. Highly Preferred Experience working in SiteReliabilityEngineer (SRE) roles or incident response environments. Hands-on experience with cloud infrastructure, preferably AWS. Familiarity with observability tools such as Grafana, ELK Stack, or similar. Experience deploying and managing applications on More ❯
Luupli started internal testing since June 2024 and getting ready for a commercial BETA testing from December 2024, with the hope of launching fully summer of 2025 Job Title: SiteReliability Platform Engineer About Luupli: Luupli is a social media app that has equity, diversity, and equality at its heart. We believe that social media can be … up of passionate and dedicated individuals who are committed to making Luupli a success. Role Description: We are seeking a talented and experienced SiteReliabilityEngineer (SRE) to join our team. As an SRE, you will play a crucial role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure and services, primarily hosted … to proactively identify performance bottlenecks, system outages, and other potential issues. - Participate in incident response and root cause analysis efforts to drive continuous improvement and prevent future incidents. 3. Reliability and Performance Optimization: - Optimise system performance, reliability, and cost efficiency through continuous monitoring, performance tuning, and capacity planning. - Identify opportunities to automate manual processes and improve system resilience. More ❯
Luupli started internal testing since June 2024 and getting ready for a commercial BETA testing from December 2024, with the hope of launching fully summer of 2025 Job Title: SiteReliability Platform Engineer About Luupli: Luupli is a social media app that has equity, diversity, and equality at its heart. We believe that social media can be … up of passionate and dedicated individuals who are committed to making Luupli a success. Role Description: We are seeking a talented and experienced SiteReliabilityEngineer (SRE) to join our team. As an SRE, you will play a crucial role in ensuring the reliability, scalability, and performance of our cloud-based infrastructure and services, primarily hosted … to proactively identify performance bottlenecks, system outages, and other potential issues. - Participate in incident response and root cause analysis efforts to drive continuous improvement and prevent future incidents. 3. Reliability and Performance Optimization: - Optimise system performance, reliability, and cost efficiency through continuous monitoring, performance tuning, and capacity planning. - Identify opportunities to automate manual processes and improve system resilience. More ❯
London, England, United Kingdom Hybrid / WFH Options
Orgvue
future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney. As a Principal SiteReliabilityEngineer, you will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will work across product, platform, and … you will: Define and enforce SLOs, SLIs, and error budgets across critical services Craft and implement a cloud infrastructure and tooling strategy Work across our organization to level up SRE practices Help implement robust observability metrics, logs & traces using our observability tools Guide the team in building automated, self-healing systems Own and evolve our incident response processes, including on … compliance, scalability, and operational excellence Evaluate and introduce tools, patterns, and practices that improve the performance and reliability of our SaaS platform Desired Skills & Experience: Demonstrable experience leading SRE transformations Deep hands-on expertise with Kubernetes (EKS preferred) in production environments Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.) Expertise in More ❯
000+ active communities and approximately 101M+ daily active unique visitors, Reddit is one of the internet’s largest sources of information. For more information, visit redditinc.com . Reddit SRE is rapidly innovating and our teams are working to meet the needs of infrastructure and development teams as they evolve our product faster than ever before. This is a unique opportunity … to leave your mark on one of the most influential and trafficked corners of the internet. As a Senior SiteReliabilityEngineer on Reddit’s Infrastructure SRE team, you’ll use your knowledge of distributed systems and architecture to improve the reliability and performance of Reddit’s engineering platforms and services. We are looking for someone … at scale. We’re active users of and contributors to Prometheus, Thanos, Grafana, Vector and more. In this role, you will also take ownership of risk management, ensuring the reliability and performance of our systems. You will collaborate with cross-functional teams to identify, assess, and mitigate risks, implementing best practices to enhance system resilience. Your expertise will drive More ❯
and future states of the organisation and make faster, more informed decisions. The company is headquartered in London, with offices in Philadelphia, The Hague, Toronto, and Sydney. Role: Principal SiteReliabilityEngineer You will be a senior technical leader focused on scaling and hardening our AWS- and Kubernetes-based infrastructure. You will collaborate across product, platform, and … expertise, excellent communication skills, and a collaborative spirit. Responsibilities: Define and enforce SLOs, SLIs, and error budgets across critical services Develop and implement cloud infrastructure and tooling strategies Enhance SRE practices across the organization Implement robust observability metrics, logs, and traces using our observability tools Guide the team in building automated, self-healing systems Own and evolve incident response processes … security, DevOps, and software teams to ensure compliance and operational excellence Evaluate and adopt tools and practices to improve platform performance and reliability Desired Skills & Experience: Experience leading SRE transformations Hands-on expertise with Kubernetes (EKS preferred) in production Strong experience with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.) Proficiency in Infrastructure as More ❯
Senior SiteReliabilityEngineer, Production Engineering Please note that we have a hybrid approach to work and would like to find someone who can come into our offices in London at least one day a week. Who We Are Cisco ThousandEyes is a leading Digital Experience Assurance platform that empowers organizations to deliver seamless digital experiences across … offering AI-powered assurance insights within Cisco’s Networking, Security, Collaboration, and Observability portfolios. About The Role We are seeking a skilled Senior SiteReliabilityEngineer (SRE) in Production Engineering with a strong background in SaaS and operations. You will design and manage large-scale, highly available distributed systems in the cloud, collaborating directly with application development … teams to enhance the reliability, performance, and security of our platform. What You’ll Do Collaborate with software engineers to optimize architecture and services for availability, latency, performance, and reliability using cloud-native tools. Design and implement scalable operations tooling to support platform growth and scaling across multiple regions. Design, deploy, and maintain AWS cloud-native services that More ❯
Cheltenham, England, United Kingdom Hybrid / WFH Options
Northrop Grumman
Social network you want to login/join with: Senior SiteReliabilityEngineer, Cheltenham Client: Northrop Grumman Location: Cheltenham, Gloucestershire Job Category: Other EU work permit required: Yes Job Reference: c18f6c71a5b6 Job Views: 4 Posted: 02.06.2025 Expiry Date: 17.07.2025 Job Description: Requisition ID: R10139774 Your Opportunity to Define Possible. Our Opportunity to … Deliver the Nation’s Security. Together. Role clearance type: You must be able to gain and maintain the highest level of UK Government clearance. About Your Opportunity: As an SREEngineer, you will be forward thinking, taking issues and finding repeatable, scalable & automated solutions with a mindset to continuously improve your workflow guided by metrics. You will be able More ❯
Social network you want to login/join with: Azure DevOps Engineer/SRE’s, who have an interest around latest technology and best practices are wanted to become a big part of the systems team within a well known Investment FinTech company. Building a suite of SaaS based products which are used by a large volume of companies … to give responsibility out, so you will be involved in decision making on tech/projects and having a real influence on the product. Technical Overview: Experience in DevOps, SRE, or Cloud Engineering roles. Hands-on experience with Azure services, including Azure Batch, Azure Functions, App Service Plans, Cosmos DB, and Storage Accounts. Experience in writing Infrastructure as Code (IaC … environments. Additional Experience: Experience with Python, SQL, and NoSQL database performance tuning, as well as progressive deployment techniques, would be highly desirable. The successful Azure DevOps Engineer/SRE will earn up to £80,000 and in addition there are exceptional benefits which come as part of the overall package including: 15% bonus, 9% pension, training budget, weekly team More ❯
London, England, United Kingdom Hybrid / WFH Options
Valarian Technologies Limited
software, platforms, and infrastructure. The Role Join us as a SiteReliabilityEngineer and help us build the future of data sovereignty! We're seeking an SRE passionate about creating high-performance, scalable, and reliable services for our production infrastructure. You'll have a direct impact, improving existing systems and developing innovative solutions to complex challenges. Our … implement a comprehensive observability strategy for self-hosted deployments, including infrastructure and tooling for monitoring, alerting, and troubleshooting. This will involve designing and implementing robust metrics and logging systems. Engineer the Acra platform for high availability and fault tolerance. This includes ensuring resilience against Cloud Availability Zone outages and the ability to gracefully handle node failures. Guarantee 99.9% uptime … capacity planning, and optimization of resource utilization. Collaborate closely with the product engineering team to influence the design and implementation of new products and features, ensuring they meet our reliability and scalability standards from the outset. Preferred Qualifications Bachelor’s degree (or foreign equivalent) in Computer Science or a related field is desired; relevant practical experience will also be More ❯
London, England, United Kingdom Hybrid / WFH Options
Sporty
the opportunity to implement them with the team Our stack: AWS, Kubernetes, Docker, Prometheus, Grafana, Security, Python etc You should apply if you have 4+ years experience in a SRE or DevOps position, or if you're a Software Engineer looking to transition then that's also great! You're a veteran in AWS technologies Experience deploying and releasing … for English speakers Our interview process involves 3 main stages: Hackerrank test (Maximum of 90 minutes) A call with a member of our Talent team Final interview with our SRE team (You’ll meet with 3 engineers on separate occasions for 45 minutes each) We can confidently say that we have a feedback loop of 24-48 hours on all More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
ZipRecruiter
Job Description Who we are looking for A SiteReliabilityEngineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems … directly impacting operational efficiency. Using your engineering expertise, you will implement solutions that enhance reliability, including service instrumentation with tools such as Open Telemetry, improve logging practices and develop features for maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best … systems meet user demands and enhance overall service performance. This role is eligible for in the Company’s hybrid working from home policy. skills and experience Excellent knowledge of SiteReliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary More ❯
developer experience to go with it. The tools used on the team include Elixir, Phoenix, Kubernetes and Google Cloud Platform. SiteReliability Engineering at Duffel As an SRE at Duffel, you'll be part of a small team within engineering that is responsible for the reliability, performance, and resilience of our infrastructure and applications. You will be … silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of current observability and reliability practices. - Experienced and comfortable in running incident response. - Big picture thinking - you can make trade offs on technical work streams against business impact. - Fantastic communication skills. You're able … We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a big shift in how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single More ❯
team of passionate thinkers, innovators, and dreamers - and help us connect people and build communities to create economic opportunity for all. About the team and the role: As a SiteReliabilityEngineer at eBay, you'll play a key role in managing major incidents and the overall health of our services, making sure they are both resilient … and high-performing. You'll create strategies for availability and reliability, enhance domain ecosystem observability, and support a shift toward a more engineering-focused culture. Your contributions will ensure that eBay's technology remains cutting-edge and reliable for our global community. What you will accomplish: Proactive Monitoring : Continuously monitor the health of eBay's critical services to identify … and address potential issues before they escalate. Solution Development : Collaborate with Architecture, Engineering, and Operations teams to develop solutions that ensure high site availability, reliability and performance. Collaborative Problem Solving : Work closely with partner teams to resolve recurring technical issues, onboard new alerts, and develop high-quality Standard Operating Procedures (SOPs). Automation and Process Enhancement : Identify and More ❯
SiteReliabilityEngineer, united kingdom Client: Intapp Location: Job Category: Other - EU work permit required: Yes Job Reference: Job Views: 21 Posted: 22.06.2025 Expiry Date: 06.08.2025 Job Description: The Intapp Cloud Platform is a rapidly growing collection of cloud services. As part of a global team, the ideal candidate will be able to quickly move between architecture … etc. What you will do: You will work with Development and Product Management to design and deliver new functionality. You will perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes. You will drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout … JVM-based languages You have a solid understanding of continuous integration, deployment and operations concepts. You have production experience of managing Windows Infrastructure running IIS workloads Passion for resolving reliability issues and identify strategies to mitigate going forward. Automation mindset - if you can automate it, do it. What you'll gain at Intapp: Our culture at Intapp emphasizes accountability More ❯
etc. What you'll do: You will work with Development and Product Management to design and deliver new functionality. You will perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes. You will drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout … JVM-based languages. You have a solid understanding of continuous integration, deployment and operations concepts. You have production experience of managing Windows Infrastructure running IIS workloads. Passion for resolving reliability issues and identify strategies to mitigate going forward. Automation mindset - if you can automate it, do it. Fluency in English. What you'll gain at Intapp: Our culture at More ❯
etc. What you'll do: You will work with Development and Product Management to design and deliver new functionality. You will perform deep dives into both systemic and latent reliability issues; partner with software engineers across the organization to produce and roll out fixes. You will drive standardization efforts across multiple disciplines and services in conjunction with SREs throughout … JVM-based languages. You have a solid understanding of continuous integration, deployment and operations concepts. You have production experience of managing Windows Infrastructure running IIS workloads. Passion for resolving reliability issues and identify strategies to mitigate going forward. Automation mindset - if you can automate it, do it. Fluency in English. What you'll gain at Intapp: Our culture at More ❯