Junior SiteReliabilityEngineer … We are currently working with a leading Financial Services company, who are looking for a Junior SiteReliabilityEngineer to join their ever-expanding platform/SRE team from their Shoreditch, London, Office where you will be expected to travel to the office 4 days a week. They are looking for you to have excellent cloud knowledge … ideally AWS as well as having experience of Powershell/Python. As the Junior SiteReliabilityEngineer, you will be a self-starter who has excellent stakeholder management experience who can show outcome based work. You will ideally have 2 years of commercial experience coming from an IT Operations/Cloud infrastructure background. Please note this is More ❯
Role Overview: We are seeking a highly skilled and motivated SiteReliabilityEngineer (SRE) to join our engineering team to support critical application deployments in a "follow-the-sun" environment. In this role, you will leverage your expertise in cloud provisioning, infrastructure as code, and container orchestration to ensure the reliability, scalability, and performance of our … and versioning. Containerization and Orchestration: Deploy, manage, and provide ongoing support for containerized applications using Kubernetes, including Amazon EKS (Elastic Kubernetes Service) and Azure Kubernetes Service (AKS), ensuring their reliability, availability, and performance. Monitoring and Alerting: Monitor application performance and system health through observability tools (e.g., Prometheus, Grafana, ELK stack), proactively identifying and resolving issues to ensure high availability … and solutions, including RESTful APIs, ensuring seamless integration across platforms. Post-Mortem Analysis: Conduct comprehensive post-mortem analyses following incidents, identifying root causes and recommending improvements to enhance system reliability and performance. Mentorship: Mentor and guide junior engineers, fostering a culture of knowledge sharing and continuous improvement within the engineering team. Skills and Experience: Bachelor's degree in computer More ❯
Are you a seasoned SitereliabilityEngineer looking for an exciting new challenge? Join this team and transition into maintaining and enhancing the reliability of one of the world's largest platforms. In this role, you will utilise your expertise in Golang coding to develop robust applications, ensuring the systems remain resilient, scalable, and efficient. If … presence and commitment to innovation, you will have the opportunity to work on projects that reach millions of users, making a real difference in the tech world. As a SiteReliabilityEngineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as … Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology and a drive to ensure excellence, we would love to hear from you More ❯
We unleash the potential of organisations through the science of board effectiveness, building better businesses and benefiting society. The Opportunity As a Senior SiteReliabilityEngineer (SRE), you'll be joining a team whose mission is to ensure the availability, performance, security and reliability of our platform and core services, ensuring that they meet the needs … be responsible for visibility and monitoring of those systems, for building tooling and automation to reduce TOIL and for responding to incidents as part of our 24/7 SRE on-call team. The SRE team: Strives to provide the highest standards of Availability, Scalability, Performance and Security for our Software as a Service environments across multiple cloud vendors and … work Proactively monitors our platform and responds to incidents as part of a 24/7 rota Key responsibilities of the role We're looking for a great Senior SRE to be a hands on individual contributor to key technical projects and to help us build a first-class SRE function. This role will involve: Hands on work with technical More ❯
Job Description Would you like to be an Engineer that builds the Cloud, rather than just uses it? At AWS, our Engineers manage the behind-the-scenes software and tools that support the world's largest cloud computing infrastructure. We … offer an exciting opportunity to join a world-class network team in a dynamic environment that feels like a start-up. As a SiteReliabilityEngineer (SRE) , you will deploy, manage, troubleshoot, and innovate the tools, services, and components that enable our network engineers to automate and maintain network operations. Your internal customers are your network engineering More ❯
Job Title: Platform Engineer/SRE Work Location: Bromley/Chester, UK (Hybrid 3 days in a week) Job Description: We are seeking a Platform Engineer/SRE with a strong and diverse technical background. The ideal candidate will possess hands-on development experience along with SiteReliability Engineering (SRE) expertise. This role requires a proactive … platform stability issues, and develop resilient and reliable systems. Key Responsibilities: Provide hands-on technical leadership in platform engineering initiatives. Ensure platform stability and resilience by identifying and resolving reliability … issues. Collaborate with cross-functional teams to deliver scalable and robust system solutions. Key Skills Required: Strong development experience in Java (primary skill). SiteReliability Engineering ( SRE ) experience. Proficiency with Kafka , Mule , and Oracle Database . Ability to work at a managerial level while remaining hands-on with technical tasks. Nice to Have: Knowledge of payment systems More ❯
Job Title: Platform Engineer/SRE Work Location: Bromley/Chester, UK (Hybrid – 3 days in a week) Job Description: We are seeking a Platform Engineer/SRE with a strong and diverse technical background. The ideal candidate will possess hands-on development experience along with SiteReliability Engineering (SRE) expertise. This role requires a proactive … platform stability issues, and develop resilient and reliable systems. Key Responsibilities: Provide hands-on technical leadership in platform engineering initiatives. Ensure platform stability and resilience by identifying and resolving reliability … issues. Collaborate with cross-functional teams to deliver scalable and robust system solutions. Key Skills Required: Strong development experience in Java (primary skill). SiteReliability Engineering ( SRE ) experience. Proficiency with Kafka , Mule , and Oracle Database . Ability to work at a managerial level while remaining hands-on with technical tasks. Nice to Have: Knowledge of payment systems More ❯
Overview SiteReliabilityEngineer, Region Services Job ID: AWS EMEA SARL (UK Branch) Would you like to help implement innovative cloud computing solutions and solve the most complex technical problems? Are you excited by the prospect of building and running the world's largest cloud computing infrastructure to provide a better world for future generations? AWS builds … you'll be part of a world-class team in a dynamic environment that has the entrepreneurial feel of a start-up. This is an opportunity to operate and engineer systems on a massive scale, and to gain world class experience in cloud computing. You'll be surrounded by people who are passionate about cloud computing, believe that first … Build and operate distributed systems Design and build the tools and utilities that are part of the AWS fleet running our internal services Key job responsibilities The Systems Development engineer will be a key member of a new team pioneering automated build and deployment of Windows based services. The team is adopting a code-first and hands off CI More ❯
response. Preferred qualifications: Master's degree or PhD in Computer Science, or a related technical field. Experience as a cloud customer. About the job SiteReliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally … visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage … the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think More ❯
developer experience to go with it. The tools used on the team include Elixir, Phoenix, Kubernetes and Google Cloud Platform. SiteReliability Engineering at Duffel As an SRE at Duffel, you'll be part of a small team within engineering that is responsible for the reliability, performance, and resilience of our infrastructure and applications. You will be … silently drop spans. - An enthusiasm for both software development and systems engineering. - A high bar for code and configuration quality and readability. - A good understanding of current observability and reliability practices. - Experienced and comfortable in running incident response. - Big picture thinking - you can make trade offs on technical work streams against business impact. - Fantastic communication skills. You're able … We manage a data pipeline using Pub/Sub, Airbyte, and dbt. Our Current Focus We're currently driving a big shift in how we think about and monitor reliability across the engineering organisation, with a focus on early detection of customer-impacting issues. We're extending and standardising our use of OpenTelemetry, and introducing Honeycomb as the single More ❯
Are you a passionate Software Engineer looking for an exciting new challenge? Join this team and transition into maintaining and enhancing the reliability of one of the world's largest platforms. In this role, you will utilise your expertise in Golang coding to develop robust applications, ensuring the systems remain resilient, scalable, and efficient. If you thrive in … presence and commitment to innovation, you will have the opportunity to work on projects that reach millions of users, making a real difference in the tech world. As a SiteReliabilityEngineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as … Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology and a drive to ensure excellence, we would love to hear from you More ❯
Experience with C++, Python, or Golang (optional) About the Company The company itself provides a suite of products and services to help people improve Staff-Level Full-Stack Software Engineer Software Engineer, Onchain (Infrastructure) 108 E 16th Street, New York, NY 10003 Subscribe to our newsletter Join over 111,000 others and get access to exclusive content, job More ❯
with a modern, full-stack platform that delivers logs, metrics, traces, and security monitoring — cutting costs by up to 70% while boosting efficiency. They are looking for a Lead SRE to own and elevate our Alerting & Incident Management platform . You’ll be the driving force behind reliability, customer satisfaction, and product excellence — ensuring smooth alert management, fewer engineering … experience by speeding up alert resolution and reducing interruptions for engineers. Build solutions to common pain points, shaping roadmaps, documentation, and technical knowledge. Develop benchmarking tools to improve performance, reliability, and scalability. Stay ahead of incident management trends to drive new workflows and product improvements. Mentor teams and lead with clear, impactful communication. What We’re Looking For 5+ … platform experience (PagerDuty, OpsGenie, etc. a plus). Solid technical foundation with cloud/distributed systems. Excellent communicator, comfortable working across US/IL time zones. Bonus: leadership experience, SRE/DevOps background, knowledge of SLO/SLA practices. More ❯
with a modern, full-stack platform that delivers logs, metrics, traces, and security monitoring — cutting costs by up to 70% while boosting efficiency. They are looking for a Lead SRE to own and elevate our Alerting & Incident Management platform . You’ll be the driving force behind reliability, customer satisfaction, and product excellence — ensuring smooth alert management, fewer engineering … experience by speeding up alert resolution and reducing interruptions for engineers. Build solutions to common pain points, shaping roadmaps, documentation, and technical knowledge. Develop benchmarking tools to improve performance, reliability, and scalability. Stay ahead of incident management trends to drive new workflows and product improvements. Mentor teams and lead with clear, impactful communication. What We’re Looking For 5+ … platform experience (PagerDuty, OpsGenie, etc. a plus). Solid technical foundation with cloud/distributed systems. Excellent communicator, comfortable working across US/IL time zones. Bonus: leadership experience, SRE/DevOps background, knowledge of SLO/SLA practices. More ❯
recently raised $22M in Series A funding to accelerate our vision of helping 1 billion people learn. Role Overview Reporting to the CTO, you will own capacity, performance and reliability for … Gizmo's full-stack platform as daily traffic climbs from hundreds of thousands to millions of users. You'll write code across the stack, but your charter is classic SRE: defend SLOs , eliminate toil , and raise the ceiling on scale before it becomes a hard limit. Key Responsibilities Define SLIs/SLOs for latency, availability and error rate; codify error … on Kubernetes and CI/CD; keep "toil" Coach full-stack engineers on query optimisation, schema design and back-pressure techniques; document patterns and anti-patterns by creating an SRE playbook Hands-on scale experience : you have run relational stores at 100 k+ TPS or 1 M+ concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL). Strong backend fundamentals around More ❯
Job Requirements Designing, building, and operating large-scale production systems Deep knowledge of Python is preferred, though other languages like Java, Go, Rust, or similar will also be heavily considered Experience using source control (Git, GitHub) and feature branching strategies More ❯
Below are the details of the position: Job Title: Platform Engineer/SRE Work Location: Bromley, UK (Hybrid – 3 days a week) Job Description: 15+ years’ experience in delivering large scale applications with focus on performance, scalability, security, and reliability. Experience in a highly Agile continuous integration and continuous deployment environment, preferably within a financial domain. Strong experience in More ❯