Has anyone actually ever given you a good description of what SRE is? Recently I've met dozens of companies implementing an SRE function. Half are just rebranding an ops team (because Ops ain't cool), some don't want to call the additional silo they have created 'DevOps' (because apparently that's the wrong thing to do) so they … re calling it SRE and the rest actually don't really know how to describe what they're doing. And if you can't describe it simply, you don't know what it is, chief (because Google do it, isn't the right answer). That was until today, when I met a company who actually white boarded their vision … process rather than the build. We discussed Kubernetes, Prometheus and API Gateways. Most importantly, they spoke like they knew what the hell they were on about. Not just about SRE, but on the whole Engineering process. This is a company with at the top of their game, who are about to introduce a brand new monitisation model to a web More ❯
are We are a London tech startup on the lookout for bright, motivated and self-driven individuals to join the team. Who you are You are a DevOps/SiteReliabilityEngineer with experience managing complex infrastructure and deploying scalable, reliable systems. You are passionate about automation, cloud technologies, and continuous improvement. Must have: Proven track record More ❯
software, platforms, and infrastructure. The Role Join us as a SiteReliabilityEngineer and help us build the future of data sovereignty! We're seeking an SRE passionate about creating high-performance, scalable, and reliable services for our production infrastructure. You'll have a direct impact, improving existing systems and developing innovative solutions to complex challenges. Our … implement a comprehensive observability strategy for self-hosted deployments, including infrastructure and tooling for monitoring, alerting, and troubleshooting. This will involve designing and implementing robust metrics and logging systems. Engineer the ACRA platform for high availability and fault tolerance. This includes ensuring resilience against Cloud Availability Zone outages and the ability to gracefully handle node failures. Guarantee 99.9% uptime … capacity planning, and optimization of resource utilization. Collaborate closely with the product engineering team to influence the design and implementation of new products and features, ensuring they meet our reliability and scalability standards from the outset. Preferred Qualifications Bachelor's degree (or equivalent) in Computer Science or a related field; relevant practical experience will also be considered Proficiency with More ❯
Wokingham, Berkshire, United Kingdom Hybrid / WFH Options
Nordcloud
the European cloud revolution. We supercharge our customers to innovate in hyperscaler cloud, enabling seamless migration, advanced security, and data-driven success. Currently, we are looking for an Azure SiteReliabilityEngineer to join our team in the UK. Your daily responsibilities: Architect, implement, and improve existing monitoring and alerting systems Proactively investigate and identify performance anomalies … solving We encourage you to apply , even if you don't meet all of the requirements. We value your growth potential and enthusiasm! This role is required to on site in Wokingham twice a week, please do not apply if this is not possible for you. What we offer: Individual training budget and exam fees for certifications Flexible working More ❯
flexible remoteworking locations within UK/Europe) Employment type: Permanent Working Hours: Full time (9-6 UK) Salary: Up to £110K + Shares + Benefits TransFICC is hiring a SiteReliabilityEngineer to provide high-performance services to our customers. We develop an integration service … product that enables our clients to have a flexible, hosted service without requiring their internal resources to respond to connectivity challenges across trading venues. You will be joining our SRE team and contributing to TransFICC's automation culture. We are a multi-disciplinary team covering everything from desktop and laptop support to data centre provisioning of servers and vendor network … automated, so having experience with a software automation tool like Ansible and coding ability is a must. We are looking for someone experienced as a sys admin or network engineer; however, you must have a reasonable understanding of both. Constructive, open-minded and self-motivated. A belief in life learning, and an awareness of how much there still is More ❯
deployments as well as accurate health monitoring through all our clients, both new and old. The person in this role will join the SiteReliability Engineering team (SRE). The main role of the SRE team is to facilitate the scalability of Dayshape and allow us to meet the demands of an increasing client base. What you'll … do Lead initiatives to enhance Dayshape's ability to scale our cloud platform Maintain and improve our cloud estate in Azure Improve SRE and other teams' working lives through automation of manual tasks Lead in making the deployment of Dayshape more scalable Increase our knowledge sharing of SRE across the organisation Improve the observability of Dayshape through reporting and tool More ❯
Cambourne, Cambridgeshire, United Kingdom Hybrid / WFH Options
Remotestar
to gemstone supplies They have a presence in London, Hong Kong, Amsterdam, and as well in Mumbai and now in New York in 2001. About the role : As the SRE Manager, you will play a critical role in ensuring the reliability, scalability, and performance of our infrastructure and services through both direct technical contribution along with team building and … tooling. Drive automation initiatives to streamline operational workflows and improve efficiency. Develop and maintain tools, scripts, and dashboards to monitor system health, performance, and reliability. Build a first class SRE team. Through a combination of leading by example, coaching and mentoring, mould the team would want to have around you. Provide leadership and guidance to the SRE team, fostering a … culture of collaboration, innovation, and continuous improvement. RESPONSIBILITIES: Proven experience in a senior or lead SRE role, with a strong track record of building and maintaining highly reliable infrastructure and services. Expertise in incident management, including incident response, resolution, and post-mortem analysis. Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog. Experience with More ❯
developer experience to go with it. The tools used on the team include Elixir, Phoenix, Kubernetes and Google Cloud Platform. SiteReliability Engineering at Duffel As an SRE at Duffel, you'll be part of a small team within engineering that is responsible for the reliability, performance, and resilience of our infrastructure and applications. You will be … silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of current observability and reliability practices. - Experienced and comfortable in running incident response. - Big picture thinking - you can make trade offs on technical work streams against business impact. - Fantastic communication skills. You're able … We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a big shift in how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single More ❯
Out in Science, Technology, Engineering, and Mathematics
and drive real change. Constantly grow as you work hard for a mission that matters at a company where you matter. Your Impact As a contributor in the APX SRE organization, you are passionate about delivering solutions to the real-time problems our mission-critical cloud native services encounter. You are also obsessed about achieving the high quality and reliability our customers demand. You will work closely not only with the APX SRE organization, but your technical deliverables will reach the entire engineering organization to enable product teams to continuously deliver features on the vanguard of innovation. What You'll Do Work Location: This role is based out of our London office and follows a hybrid schedule. We rely … meaningful teamwork, mentorship, and shared success. Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely. Exemplify cloud-native sitereliability best practices. Write code that is performant, maintainable, clear, and concise. Employ strong problem-solving skills, with the ability to debug problems in cloud-native distributed systems. More ❯
Are you a passionate Software Engineer looking for an exciting new challenge? Join this team and transition into maintaining and enhancing the reliability of one of the world's largest platforms. In this role, you will utilise your expertise in Golang coding to develop robust applications, ensuring the systems remain resilient, scalable, and efficient. If you thrive in … presence and commitment to innovation, you will have the opportunity to work on projects that reach millions of users, making a real difference in the tech world. As a SiteReliabilityEngineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as … Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology and a drive to ensure excellence, we would love to hear from you More ❯
Location: London, England, United Kingdom Join Axon and be a Force for Good. As an SRE contributor in Axon's Real Time Operations organization, you are passionate about delivering solutions to the real-time problems our mission-critical cloud native services encounter. You are also obsessed about achieving the high quality and reliability our customers demand. You will work … You'll Do Location: London UK Build robust, easy-to-use foundational platforms and tools that enable engineering teams to provision services rapidly, consistently, and securely. Exemplify cloud-native sitereliability best practices. Write code that is performant, maintainable, clear, and concise. Employ strong problem-solving skills, with the ability to debug problems in cloud-native distributed systems. More ❯
along the way! Job Summary We have built Curve Dental into an industry-leading provider of beautiful cloud software for the dental industry. Who We're Looking For Our SiteReliability Engineers (SREs) are passionate about automation and its power to streamline the deployment and operation of software. They collaborate closely with developers to support a wide range More ❯
About xAI xAI's mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals More ❯
hybrid environment. This opportunity is ideal for candidates who thrive in fast-paced environments and are eager to contribute to a growing organisation. If you have a passion for SiteReliability Engineering and the desire to make a meaningful impact, we encourage you to apply. The role is set to commence immediately, and while core benefits are not More ❯
SiteReliability Engineering Manager, Storage - Apple Cloud Services London, England, United Kingdom Software and Services Description The Storage SRE organization is seeking a strong engineering leader to manage Storage focused SRE teams, working closely with peer SRE teams and development partners.You'll help build and optimize the Storage stack from the bare metal to the top of the … improvements. Together with the team, you'll help run the storage used by some of Apple's largest teams. Minimum Qualifications Proven experience in a leadership role within an SRE or DevOps team, specifically focused on distributed storage. Strong background in distributed systems, storage architectures, and data management. Deep knowledge of SRE principles, including monitoring, alerting, error budgets, fault analysis … and other common reliability engineering concepts Lead initiatives to enhance the scalability and performance of distributed storage systems. Collaborate with engineering teams to design and implement robust and scalable storage solutions. Bachelor's or Master's degree in Computer Science, Engineering, or a related field. Preferred Qualifications Experience with Kubernetes, Docker, and containerization Proficient in at least one of More ❯
top AI computing platform. We equip engineers with the tools to deploy AI that is fast, secure, affordable, and built to scale. Whether they need powerhouse GPU hardware on-site or the flexibility of cloud-based solutions, we've got the horsepower to make it happen. Lambda's AI Cloud has been adopted by the world's leading companies … performance through the use of network engineering and other applicable technologies Help with deploying and maintaining network monitoring and management tools You Have 5+ years of experience being SWE, SRE or Network Reliability Engineering Been part of the implementation of production-scale networking projects Experience being on-call and incident response management Have experience building and maintaining Software Defined More ❯
Senior Software Engineer/SRE - Application Middleware Location London Business Area Engineering and CTO Ref # Description & Requirements Are you passionate about building high-performance systems that are fast, resilient, and operate at global scale? Join Bloomberg's Application Middleware SRE team, where you'll combine software engineering and systems expertise to keep the backbone of the Bloomberg Terminal … running smoothly for hundreds of thousands of users around the world. We're not your typical SRE team. We're embedded in a group that powers real-time connectivity, and we own systems where uptime isn't just important-it's essential to the global financial system. This is your opportunity to engineer resilience at scale, automate critical infrastructure … and shape reliability practices across one of the world's most powerful tech platforms. The Team We're the SiteReliability Engineering team within Bloomberg's Application Middleware group. Our mission: ensure that Bloomberg's core connectivity and messaging layers are resilient, scalable, and fully observable. We own systems that operate at high throughput and low latency More ❯
Saffron Walden, Essex, South East, United Kingdom Hybrid / WFH Options
EMBL-EBI
ground-breaking research that improves human and planetary health. As part of our small but highly skilled IT Operations team, youll play a critical role in ensuring the availability, reliability and efficiency of services that support scientists and collaborators worldwide. This is a hands-on, varied position where youll combine deep technical expertise with a service-oriented mindset. If … migration to O365. Core Services Jointly develop and maintain services such as transfer services, software-defined object storage, authentication/authorisation tools, and our Request Tracker ticketing system. Monitoring & Reliability Maintain and evolve distributed Check_mk monitoring while helping shape a long-term monitoring strategy. Automation & Orchestration Work with Gerrit, Foreman, RPM repositories and Puppet to deploy, update and … days annual leave per year, in addition to eight bank holidays Relocation package including installation grant (as applicable) Campus life: Free shuttle bus to and from work, on-site library, subsidised on-site gym and cafeteria, casual dress code, extensive sports and social club activities (on campus and remotely) Family benefits: On-site nursery, child sick leave More ❯
The role As a Site reliabilityengineer you will focus on improving stability and security aspectsof the technical stackofQuorso by: Owning monitoring and logging integrations, as well as alerting capabilities by improving andautomating currently manual processes Identifying andlogging discovered performance and security related issues Working on remediation for the discovered issues related to backend and infrastructural layers, as well as … Stores technology simplifies retailers' data into daily Next Best Actions ("Missions") for every store, guaranteed to engage teams and drive sales. We're an Enterprise platform, targeting large multi-site retailers. We're growing fast with some of the largest retailers in the world already using Quorso to react faster and become more Agile in the face of a … our investors include CEOs and Chairpersons of a number of the 100 largest companies in the world. Requirements Experience of working in the role or as backend/devops engineer for at least 4 years on projects using Ruby, SQL and Kubernetes Ability to quickly learn platform stack andstaying up to date with ongoing development Experience in proactive implementation More ❯
shaping the future of AI. Together, we can make a meaningful impact. See more about our culture on . About The Job Mistral AI is seeking an Applied AI Engineer focused on DevOps to facilitate the adoption of its products among customers and collaborate with them to address complex technical challenges. Applied AI Engineers, ML Infra at Mistral AI … in English • You hold a Bachelor's or Master's degree in Computer Science, Engineering, or a related field • You have 2+ years of experience in a DevOps or SiteReliability Engineering role • You're experienced with deploying and managing AI-based products in production environments • You are fluent in Python • You have experience with containerization technologies such … You hold strong communication skills with an ability to explain complex technical concepts in simple terms to technical and non-technical audiences Ideally you have: • Experience as a Customer Engineer, Forward Deployed Engineer, Sales Engineer, Solutions Architect, or Technical Product Manager • Familiarity with AI frameworks such as PyTorch or TensorFlow • Contributions to open-source projects, particularly in More ❯
Description Summary : We're looking for an experienced Platform/Infrastructure Engineer with a strong Microsoft Azure background and deep knowledge of Kubernetes. You'll play a key role in designing, deploying, and maintaining infrastructure and services that power our products. This role requires hands-on experience with automation, modern IaC practices, CI/CD, and maintaining production-grade … and maintain Infrastructure as Code using Terraform or OpenTofu Develop scripts and automation to support infrastructure and deployment workflows - PowerShell is preferred Collaborate with engineering teams to support platform reliability and enable delivery Maintain visibility and awareness through monitoring and logging tools such as Datadog, Azure Monitor, App Insights etc. Support incident resolution and participate in an on-call … such as Azure Monitor, App Insights, or similar Clear communicator with the ability to collaborate across cross-functional teams Nice to Have: Azure certifications (e.g. Azure Administrator, Azure DevOps Engineer) Experience with GitOps and tools such as ArgoCD or Flux Familiarity with Configuration as Code tools like Ansible or Puppet Exposure to large-scale distributed systems or high-volume More ❯
Apple's Silicon Engineering Group is looking for a high-energy, highly motivated engineer with a focus on development and operations to support a variety of key internal projects by improving and streamlining the design and development process. Description In this role, you will support our team by:• Writing code to maintain and provision production/testing/dev More ❯
About sales-i UK Ltd., a SugarCRM Company From the very beginning, SugarCRM had a unique vision: to offer a different kind of customer relationship management (CRM) software. We pioneered a solution that easily adapts to customer needs, and now More ❯
Services Description Apple Services Engineering infrastructure is BIG. Operating at our scale, across multiple geographically dispersed data centers and servicing hundreds of millions of users presents unique challenges.As an SRE at Apple, you'll need to solve these problems using data, teamwork, and your own expertise. SREs at Apple own the full infrastructure stack; from device driver performance debugging to More ❯
recently raised $16M in Series A funding to accelerate our vision of helping 1 billion people learn. Role Overview Reporting to the founders, you will own capacity, performance and reliability for … Gizmo's full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You'll write code across the stack, but your charter is classic SRE: defend SLOs , eliminate toil , and raise the ceiling on scale before it becomes a hard limit. Key Responsibilities Define SLIs/SLOs for latency, availability and error rate; codify error … on Kubernetes and CI/CD; keep "toil" Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook Hands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL). Strong backend fundamentals around More ❯