SiteReliabilityEngineer, ML Infrastructure, Large Models SRE link Copy link corporate_fare Google place London, UK Mid Experience driving progress, solving problems, and mentoring more junior team members; deeper expertise and applied knowledge within relevant area. Apply link Copy link Bachelor's degree in Computer Science or a related technical field or equivalent practical experience. … Language Models/Machine Learning tooling and infrastructure. Experience in automation, monitoring, and incident response. Experience in C++, Java, Python, or Go. Understanding of SiteReliability Engineering (SRE) principles and best practices. Excellent communication, project and stakeholder management skills. About the job SiteReliability Engineering (SRE) combines software and systems engineering to build and run large … scale, massively distributed, fault-tolerant systems. SRE ensures that Google's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance. Much of our software development focuses on optimizing existing systems More ❯
the reliability of all cloud systems while keeping levels of manual work low. SREs are expected to be experienced in software engineering principals, operational discipline, and automation. The SRE team work on a fully remote basis and work in conjunction with their US and Australian teams as well. This company are a market leader in Student community management software … to ensure high availability and performance Collaborate with product engineering teams to design/build fit-for-purpose and observable software Required Skills and Experience: Proven experience in a SRE/DevOps/Platform Engineering role and having previously worked in a Software Engineering role in .Net and C# or Java or similar OO development language. Proficiency in C# or … and this job is part of a large program of change and improvement in their Cloud SaaS products over the coming years. If you are looking for an interesting SRE role with a forward-thinking global organisation, then this would be a tremendous career opportunity to consider. Please apply with your CV to find out more. More ❯
Engineer - SiteReliability Engineering page is loaded Engineer - SiteReliability Engineering Apply locations USA-St. Louis-795 Office Pkwy time type Full time posted on Posted 11 Days Ago job requisition id R Our Team We are evolving our Reliability Engineering team to move beyond support and operations. As a Senior Engineer in SiteReliability, you will be part of a diverse and inclusive organization that has full ownership of the availability, performance, and scalability of one of the most critical shared services at LSEG. Main responsibilities We are looking for people with a passion to learn, and who bring a continuous improvement mentality to our team! SREs maintain … core of our team's purpose. Write automation to scale systems sustainably, prevent service issues, or when they occur, quickly recover service. Partner with development teams to improve system reliability, observability, and release velocity. Participate in on-call rotations, incident response, postmortems, and root cause analysis and resolution. Be a vocal advocate of strong/sound engineering practices that More ❯
Senior SiteReliabilityEngineer At UnlikelyAI, we are building the future of AI: one that is reliable, accurate and transparent. Our neurosymbolic technology harnesses the power of LLMs and generative AI, and combines it with classical symbolic technology to produce hallucination-resistant artificial intelligence for high-trust applications. To support our rapidly increasing commercial momentum, we're … looking for an experienced and pragmatic sitereliabilityengineer to join our exceptional team. This role is ideal for someone who has successfully scaled systems from prototype to production and enjoys working in cross-functional teams to champion cloud-native engineering. We are looking for someone with the experience and expertise to define, and own, our approach … to building for reliability and security as first-class citizens. This is a strategically important role for our technology team, as we rapidly approach entering full production in multiple projects. You'll work on a range of customer-facing and internal infrastructure projects, applying your engineering skills to solve complex reliability and scalability challenges. Your ability to build More ❯
Senior SiteReliabilityEngineer London - Hybrid £80,000 - £90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension Excellent opportunity for SiteReliabilityEngineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression … With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries. In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems. The ideal candidate … strategies and conduct chaos engineering experiments Monitor and maintain Kafka clusters for performance and reliability Respond to and resolve application-level production incidents The Person: 5+ years in SRE, DevOps, or infrastructure engineering Strong experience with AWS, EKS/Kubernetes, and Terraform Familiar with Kafka and observability tools like Datadog or Grafana Able to troubleshoot issues across infrastructure and More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Rise Technical Recruitment
Senior SiteReliabilityEngineer London - Hybrid £80,000 - £90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension Excellent opportunity for SiteReliabilityEngineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression … With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries. In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems. The ideal candidate … strategies and conduct chaos engineering experiments *Monitor and maintain Kafka clusters for performance and reliability *Respond to and resolve application-level production incidents The Person: *5+ years in SRE, DevOps, or infrastructure engineering *Strong experience with AWS, EKS/Kubernetes, and Terraform *Familiar with Kafka and observability tools like Datadog or Grafana *Able to troubleshoot issues across infrastructure and More ❯
Employment Type: Permanent
Salary: £80000 - £90000/annum 38 Days Holiday, Healthcare, Pension
London, South East, England, United Kingdom Hybrid / WFH Options
Rise Technical Recruitment Limited
Senior SiteReliability EngineerLondon - Hybrid£80,000 - £90,000 + 38 Days Holiday + Private Healthcare + Life Assurance + Flexible Working + Pension Excellent opportunity for SiteReliabilityEngineer to join a forward-thinking and high-growth technology company offering a Hybrid work environment, a great benefits, and opportunities for further progression!This company … performance. With a strong culture rooted in integrity, creativity, and technical excellence, they've become a trusted partner across global industries.In this role you'll take ownership of platform reliability, resilience engineering, and incident management across cutting-edge cloud infrastructure. You'll play a key role in ensuring uptime, performance, and continuous improvement of core systems.The ideal candidate will … strategies and conduct chaos engineering experiments*Monitor and maintain Kafka clusters for performance and reliability*Respond to and resolve application-level production incidents The Person: *5+ years in SRE, DevOps, or infrastructure engineering*Strong experience with AWS, EKS/Kubernetes, and Terraform*Familiar with Kafka and observability tools like Datadog or Grafana*Able to troubleshoot issues across infrastructure and More ❯
SiteReliabilityEngineer page is loaded SiteReliabilityEngineer Apply locations IND-BLR-Divyasree Technopolis time type Full time posted on Posted Yesterday job requisition id R About LSEG: The London Stock Exchange Group (LSEG) is a global financial markets infrastructure and data provider headquartered in London, UK. Established in 2007, though its core … Exchange-dates to 1801, LSEG plays a meaningful role in capital markets worldwide. It operates several trading venues, including the London Stock Exchange, Borsa Italiana, and Turquoise. Application Support Engineer role that does the ingestion and delivery of subsets of licensed data to each of our client bases using in-house software. This role also involves taking database backup … and creative culture where we encourage new ideas and are committed to sustainability across our global business. You will experience the critical role we have in helping to re-engineer the financial ecosystem to support and drive sustainable economic growth. Together, we are aiming to achieve this growth by accelerating the just transition to net zero, enabling growth of More ❯
SiteReliabilityEngineer (SRE) Manager - Apple Services Engineering London, England, United Kingdom Software and Services Description Apple Service Engineering (ASE)'s Compute team is seeking highly motivated individual with strong technical and communication skills to join us in on our quest to build and enhance massive clusters hosting Virtual Machines, Containers and associated infrastructure that can scale … engage with the upstream community to drive Apple's requirements. Ultimately, you will help build the platform that delivers our applications at scale to our end users.As a Compute SiteReliability Engineering manager, you will be leading a team responsible for providing the platform for mission-critical cloud systems to maintain constant uptime, scale seamlessly, and allow for More ❯
development. The right candidate will have excellent communication skills, solid coding skills, expertise in building scalable, reliable, highly available and fault-tolerant systems, broad knowledge of software engineering and sitereliability engineering in areas such as Large-Scale Data and Compute Infrastructure, Stream Processing, Kubernetes, High-Performance Networking, Observability and Infrastructure Automation. RESPONSIBILITIES Set the technology strategy for … maintain, optimize and support large scale, multi-region, multi-cloud compute and storage infrastructure powering our data platform and mission critical services. Work with fellow Data Infrastructure engineers and SiteReliability engineers to ensure our systems are scalable, reliable, fault-tolerant, highly available, highly performant, and observable. Manage incidents, triage product or system issues and debug/track …/resolve by analyzing the root cause of these issues and the impact on users & operations. Work closely with other Data Infrastructure engineers, SiteReliability engineers, ML Platform engineers, Computer Vision and ML engineers on high-impact projects to create innovative solutions to problems in the self-drive space. Mentor junior engineers in their day to day work More ❯
are passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a SiteReliabilityEngineer to join our SRE team in the Platform Engineering organization and help us scale our products to millions of end-users. We are looking for individuals with a passion for automation and observability, ensuring … and SOP's Develop software, scripts, or tooling to improve efficiency and reduce delivery time of applications and infrastructure Other duties as needed About You 5+ years' experience in SiteReliabilityEngineer roles Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (Prometheus/Grafana, New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon More ❯
are passionate about building unified IT solutions that simplify the way IT organizations work. We are currently looking for a SiteReliabilityEngineer to join our SRE team in the Platform Engineering organization and help us scale our products to millions of end-users. We are looking for individuals with a passion for automation and observability, ensuring … and SOP's Develop software, scripts, or tooling to improve efficiency and reduce delivery time of applications and infrastructure Other duties as needed About You 5+ years' experience in SiteReliabilityEngineer roles Expert+ level Linux administration, scripting, and troubleshooting Demonstrable knowledge of Observability tools (Prometheus/Grafana, New Relic, Splunk, DataDog) Comprehensive experience with AWS (Amazon More ❯
SiteReliabilityEngineer (SRE) , Apple Pay London, England, United Kingdom Software and Services Description Engineering Operations engineers take part in every aspect of the software development lifecycle. We work in a fast-paced environment, and are responsible for everything from the underlying AWS infrastructure services, to control plane considerations, as well as high-availability and robust application … AWS Services, including but not limited to: EC2, S3, EKS, DynamoDB, EBS, Cloud formation, Lambda, VPC, Route 53 Experience operating in core SDLC CI/CD processes, along with SRE concepts - Monitoring, Alerting, Incident management. Worked within DevOps operating model, data analytics, various models and application of AI/ML in this space. BS degree in computer science or equivalent More ❯
AIML - SiteReliabilityEngineer (SRE), Siri Knowledge Platforms London, England, United Kingdom Machine Learning and AI Description As an SRE in the AI/ML organisation within Apple, you will be directly responsible for the infrastructure that powers Siri, search, and other high-impact user-facing solutions running on millions of Apple devices worldwide.We strive to improve … the stability, security, efficiency, and scalability of a 24/7 global service. We have on-call rotations-working in a geographically distributed SRE teams for follow-the-sun support. Your strong troubleshooting ability will be used daily to isolate issues and resolve the root cause through investigative analysis. The role also requires building and maintaining accurate, up-to-date More ❯
We are seeking a foundational member for the Cloud Infrastructure team at Writer. This role involves contributing to the development and implementation of our SiteReliability Engineering (SRE) program. The ideal candidate will ensure the reliability, scalability, performance, and security of Writer's critical systems, proactively guaranteeing that our high-ROI products reach customers seamlessly. Your responsibilities … ensure cost efficiency. Ensure the security and compliance of our systems, adhering to industry standards and regulations. Provide mentorship and technical guidance to junior engineers, fostering a culture of reliability and continuous improvement. Stay current with emerging technologies and industry trends to improve our sitereliability practices. Is this you? Proven expertise in SiteReliability … Kubernetes) and orchestration tools. Knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, ELK Stack) for maintaining system health and performance. Ability to lead and mentor junior engineers in reliability and system optimization best practices. Excellent communication skills for effective collaboration with cross-functional teams and stakeholders. Proactive in identifying and mitigating potential system failures and performance issues. Preferred More ❯
Senior SiteReliabilityEngineer Central London (Hybrid) Up to £100k + Car Allowance & Bonus TRIA are working with a leading hospitality client to hire a Senior SRE, where they are investing heavily in the performance, stability, and reliability of its digital platforms. This is a hands-on leadership role - you won't just guide others, you … Improving alerting, monitoring, and system-level metrics Driving better SLOs, SLIs, and overall uptime What you'll bring: Experience in high-traffic digital or eCommerce platforms 5+ years in SRE/DevOps roles; strong background in incident response Observability, automation, and infrastructure as code expertise Leadership skills - mentoring others or leading from the front The stack includes Kubernetes, Terraform, AWS … Python, and modern CI/CD tools, and it's evolving. If you understand what a good SRE practice looks like, and want to leave systems in a better place than you found them, please apply to be considered and learn more More ❯
We unleash the potential of organisations through the science of board effectiveness, building better businesses and benefiting society. The Opportunity As a Senior SiteReliabilityEngineer (SRE), you'll be joining a team whose mission is to ensure the availability, performance, security and reliability of our platform and core services, ensuring that they meet the needs … be responsible for visibility and monitoring of those systems, for building tooling and automation to reduce TOIL and for responding to incidents as part of our 24/7 SRE on-call team. The SRE team: Strives to provide the highest standards of Availability, Scalability, Performance and Security for our Software as a Service environments across multiple cloud vendors and … work Proactively monitors our platform and responds to incidents as part of a 24/7 rota Key responsibilities of the role We're looking for a great Senior SRE to be a hands on individual contributor to key technical projects and to help us build a first-class SRE function. This role will involve: Hands on work with technical More ❯
our customer's systems are built and maintained. This role blends operational product support with software engineering to create applications to understand the overall health of our systems. The SRE team sits within a wider programme at the core of the customer mission. The role holder: As an SRE, fundamentally you will be doing work that has historically been done … engineering expertise to substitute automation for human labour, with the objective of limiting traditional manual operations work (incident tickets, on-call etc.) to no more than half of the SRE team's time (and aiming for considerably less). You will have an enthusiasm to learn and experiment, to develop tools to understand application health and improve their reliability … enable them to be scalable and resilient to failure, and how to get the best out of the infrastructure they are deployed to. Participating in the wider DevOps/SRE community within the organisation. Competancies It is desirable for you to have experience in the areas below. However more valued for this role is that you have excitement and enthusiasm More ❯
Hybrid position with on-calls We are seeking a highly motivated and skilled SiteReliabilityEngineer (SRE) to ensure the reliability, performance, and scalability of the client's critical Data Platform solutions. In this role, you will provide dedicated support and maintain the health of the data infrastructure. This position involves on-call responsibilities to address More ❯
# SiteReliability EngineerRemote - APAC/EngineeringThe Tyk API Management platform is helping to drive the connected world and power new products and services. We're changing the way that organisations connect any number of their systems and services.Whether internal, external, public or highly encrypted systems, Tyk helps businesses drive value across the retail, finance, telecoms, healthcare, or … radical responsibility If this sounds like an environment that you believe could work for you then read on to find out more. The role: We're looking for a SiteReliabilityEngineer to manage, maintain, improve and provide support on our platform. You will be curious by nature, always looking for ways to improve, as we will … we expect this role to be advocate of continuous improvement Reliability of our new global Tyk Cloud platform Automation of operations and support Writing and maintaining documentation on SRE processes and policies Recommending and implementing ways of driving operational efficiency and driving down our cost to run, without impacting service Assisting in penetration testing for Cloud through liaising with More ❯
innovate in the human digital space. The Role: Join a challenger bank building the future of financial reliability. We're hiring a Senior SiteReliabilityEngineer (SRE) to help ensure our systems stay secure, scalable, and fast, so customers never miss a beat. You'll gain hands-on experience with modern cloud tech while contributing to a … You will contribute to strategic projects/initiatives and influence the outcome, as well as being accountable for Infrastructure delivery. You will operate in a hybrid role (DevOps/SRE and Apps support) working directly with internal users to resolve issues and improve/shift-left and introduce automation. You will represent the support function on Change Advisory Boards and More ❯
team of passionate thinkers, innovators, and dreamers - and help us connect people and build communities to create economic opportunity for all. About the team and the role: As a SiteReliabilityEngineer at eBay, you'll play a key role in managing major incidents and the overall health of our services, making sure they are both resilient … and high-performing. You'll create strategies for availability and reliability, enhance domain ecosystem observability, and support a shift toward a more engineering-focused culture. Your contributions will ensure that eBay's technology remains cutting-edge and reliable for our global community. What you will accomplish: Proactive Monitoring : Continuously monitor the health of eBay's critical services to identify … and address potential issues before they escalate. Solution Development : Collaborate with Architecture, Engineering, and Operations teams to develop solutions that ensure high site availability, reliability and performance. Collaborative Problem Solving : Work closely with partner teams to resolve recurring technical issues, onboard new alerts, and develop high-quality Standard Operating Procedures (SOPs). Automation and Process Enhancement : Identify and More ❯
globe. What you'll do: As a SiteReliabilityEngineer at Zefr, you'll apply your expertise in cloud infrastructure, CI/CD, Observability, and core SRE concepts, to deliver high-quality, reliable, and scalable solutions. A significant aspect of this role involves working closely with Zefr's Engineering and Data Science teams ensuring the infrastructure required More ❯
Are you an experienced Senior DevOps/SiteReliabilityEngineer looking for your next contract role? Join one of the world's leading IT services, consulting, and business solutions organization. Founded in 1968, the company consistently ranks among the top global IT service providers. With a presence in over 50 countries, the company has built a reputation … across industries including banking, healthcare, telecommunications, and retail. The leading consultancy firm has partnered with a global technology leader and they are currently seeking an experienced Senior DevOps/SiteReliabilityEngineer to join the team. Additionally, this role provides a hybrid working arrangement based in London. Ready to make a move? Get in touch and apply More ❯
City of London, London, United Kingdom Hybrid / WFH Options
ECS
SiteReliabilityEngineer - Dynatrace Initial 6-month contract Remote Working £500- £600, Inside IR35 We're recruiting on behalf of a world's leading technology organisation who are looking for a SRE who has hands-on expertise with Dynatrace to play a pivotal role in growing their observability program. As an SRE, you will be responsible for: Act as a technical advisor More ❯