automation. Experience with CI/CD pipeline management and DevOps practices. Strong understanding of disaster recovery and business continuity planning. Experience with performance tuning and capacity planning. Understanding of chaosengineering principles and practices. Skills in cost optimization for cloud infrastructure. Specific Tools and Techniques: Experience in using cloud native monitoring tools like AWS CloudWatch, Azure Monitor, and … days ago London, England, United Kingdom 2 weeks ago Senior Site Reliability Engineer (Content Delivery Network) London, England, United Kingdom 2 days ago Senior Site Reliability Engineer, Production Engineering London, England, United Kingdom 1 day ago London, England, United Kingdom 4 hours ago Staines-Upon-Thames, England, United Kingdom 5 months ago Greater London, England, United Kingdom 3 weeks … England, United Kingdom 2 weeks ago London, England, United Kingdom 2 days ago London, England, United Kingdom 17 hours ago Greater London, England, United Kingdom 4 weeks ago DevOps Engineering Manager (Russian Speaking) London, England, United Kingdom 4 months ago Solution Architect – Cloud-Native & DevOps London, England, United Kingdom 2 weeks ago London, England, United Kingdom 1 week ago More ❯
availability, and performance of the entire infrastructure stack including compute, storage, network and cloud components. Lead incident response efforts across the infrastructure stack, coordinating with Application Support, SRE, and Engineering teams to minimize MTTD and MTTR. Perform root cause analysis for infrastructure related incidents and implement corrective actions. Develop and maintain automation tools for managing infrastructure resources. Collaborate with … Engineering teams to plan and execute system upgrades and maintenance. Conduct capacity planning and resource management for all infrastructure components. Participate in on-call rotations to provide 24x7 support for all critical infrastructure issues. Design and implement disaster recovery plans and business continuity strategies. Implement best practices for monitoring, logging, and alerting across the infrastructure. Foster a culture of … automation. Experience with CI/CD pipeline management and DevOps practices. Strong understanding of disaster recovery and business continuity planning. Experience with performance tuning and capacity planning. Understanding of chaosengineering principles and practices. Skills in cost optimization for cloud infrastructure. Preferred qualifications Experience in using cloud native monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud More ❯
reliability, availability, and performance of the entire infrastructure stack including storage, network and cloud components. Lead incident response efforts across the infrastructure stack, coordinating with Application Support, SRE, and Engineering teams to minimize MTTD and MTTR. Perform root cause analysis for infrastructure related incidents and implement corrective actions. Develop and maintain automation tools for managing infrastructure resources. Collaborate with … Engineering teams to plan and execute system upgrades and maintenance. Conduct capacity planning and resource management for all infrastructure components. Participate in on-call rotations to provide 24x7 support for all critical infrastructure issues. Design and implement disaster recovery plans and business continuity strategies. Implement best practices for monitoring, logging, and alerting across the infrastructure. Foster a culture of … automation. Experience with CI/CD pipeline management and DevOps practices. Strong understanding of disaster recovery and business continuity planning. Experience with performance tuning and capacity planning. Understanding of chaosengineering principles and practices. Skills in cost optimization for cloud infrastructure. experience & qualifications: Experience in using cloud monitoring tools like AWS CloudWatch, Azure Monitor, and Google Cloud Operations More ❯
automation. Experience with CI/CD pipeline management and DevOps practices. Strong understanding of disaster recovery and business continuity planning. Experience with performance tuning and capacity planning. Understanding of chaosengineering principles and practices. Skills in cost optimization for cloud infrastructure. Specific Tools and Techniques: Experience in using cloud native monitoring tools like AWS CloudWatch, Azure Monitor, and More ❯
new Spectra architecture team, we're looking for talented software engineers across at various grades to join us and build out the frameworks and services for Health Monitoring and Chaos Engineering. As you can imagine, this service will be a critical part of hundreds of other services, helping to improve the resiliency of the services and help service owners … not be limited to: Design and develop software in Java, Python, and other languages. Participate in the entire software lifecycle – development, testing, CI/CD and production operations Apply engineering principles for defining robust and maintainable architectures and designs. Build cloud service on top of the modern Infrastructure as Service (IaaS) building blocks. Design and build distributed, scalable, fault … tolerant software systems. Identify requirements, scope solutions, estimate work, schedule deliverables. Help establish and drive the adoption of outstanding coding standards and patterns and help enhance our inclusive engineering culture. Balance between product feature development and production operational concerns like ops automation, structured logging, instrumentation for metrics and participating in on-call. Analyzing and debugging issues, including bugs, customer More ❯
London, England, United Kingdom Hybrid / WFH Options
Hargreaves Lansdown
and load test strategies to ensure the system can handle high volumes of traffic. Implement security testing practices to identify and mitigate vulnerabilities. Develop functional resilience strategies such as chaos engineering. Quality Assurance : Support the team in conducting thorough testing of software applications, including unit, integration, system, and acceptance tests. Collaborate with developers to debug and resolve complex issues … quality of code. Continuous Improvement : Identify areas for improvement in the testing process. Stay updated with industry trends and technologies in test automation. Mentor and train SDET and Software Engineering team members in best practices for test automation and software quality. Documentation : Contribute to test plans and strategies documentation. Maintain clear and comprehensive documentation for test automation frameworks and … tools. About You We are looking for a talented individual who is passionate about quality and testing. You should have: Education : Bachelor's degree in computer science, engineering, or a related field, or equivalent experience. Experience : Advanced experience in test automation development using tools like Selenium, JUnit, TestNG, Cypress, etc. Familiarity with performance testing tools such as Apache Bench More ❯
development. Market leading salary and annual discretionary bonus. Pension contributions, in addition to Health Insurance, Life Assurance. What You’ll Be Doing This is a hands-on and strategic engineering role where you’ll be responsible for ensuring production stability across a highly dynamic microservices architecture hosted in Azure . You’ll have end-to-end ownership over reliability … reliability. Automating recovery, scaling, and monitoring across distributed systems. Collaborating with cross-functional teams to align platform strategy and reliability goals. What You’ll Bring: 8+ years in software engineering or SRE/production infrastructure roles. Strong experience with Java (Spring) and cloud platforms (ideally Azure ). Proven track record in building and maintaining mission-critical systems. Deep understanding … of Kubernetes, observability tooling (Grafana, Prometheus, ELK, etc.), and Infrastructure as Code (Terraform, Bicep). Ability to lead technical conversations across Engineering and Product. Experience in fintech, crypto, or regulated digital infrastructure RDBMS performance tuning (MS SQL) Knowledge of SLAs/SLOs/chaosengineering and platform risk management Seniority level Seniority level Mid-Senior level Employment More ❯
salary and annual discretionary bonus. Pension contributions, in addition to Health Insurance, Life Assurance. 25 Annual Leave. What You’ll Be Doing This is a hands-on and strategic engineering role where you’ll be responsible for ensuring production stability across a highly dynamic microservices architecture hosted in Azure . You’ll have end-to-end ownership over reliability … . Automating recovery, scaling, and monitoring across distributed systems. Collaborating with cross-functional teams to align platform strategy and reliability goals. What You’ll Bring: 5+ years in software engineering or SRE/production infrastructure roles. Strong experience with Java (Spring) and cloud platforms (ideally Azure ). Proven track record in building and maintaining mission-critical systems. Deep understanding … of Kubernetes, observability tooling (Grafana, Prometheus, ELK, etc.), and Infrastructure as Code (Terraform, Bicep). Ability to lead technical conversations across Engineering and Product. Bonus points if you bring: Experience in fintech, crypto, or regulated digital infrastructure RDBMS performance tuning (MS SQL) Knowledge of SLAs/SLOs/chaosengineering and platform risk management More ❯
salary and annual discretionary bonus. Pension contributions, in addition to Health Insurance, Life Assurance. 25 Annual Leave. What You’ll Be Doing This is a hands-on and strategic engineering role where you’ll be responsible for ensuring production stability across a highly dynamic microservices architecture hosted in Azure . You’ll have end-to-end ownership over reliability … . Automating recovery, scaling, and monitoring across distributed systems. Collaborating with cross-functional teams to align platform strategy and reliability goals. What You’ll Bring: 5+ years in software engineering or SRE/production infrastructure roles. Strong experience with Java (Spring) and cloud platforms (ideally Azure ). Proven track record in building and maintaining mission-critical systems. Deep understanding … of Kubernetes, observability tooling (Grafana, Prometheus, ELK, etc.), and Infrastructure as Code (Terraform, Bicep). Ability to lead technical conversations across Engineering and Product. Bonus points if you bring: Experience in fintech, crypto, or regulated digital infrastructure RDBMS performance tuning (MS SQL) Knowledge of SLAs/SLOs/chaosengineering and platform risk management More ❯
London, England, United Kingdom Hybrid / WFH Options
Elliptic
Senior DevOps Engineer Department: Engineering Employment Type: Full Time Location: London, UK Description The impact you will have: You will have a transformative impact across Elliptic by evangelising DevOps, security, and reliability principles and fostering a culture of efficiency and autonomy. You will join a growing team of experienced and passionate engineers who are not afraid to fail and … enjoy tackling difficult problems head-on. Openness is one of our core values at Elliptic, and nowhere is this more evident than in our engineering teams. We strongly encourage engineers to challenge convention and find unique and innovative solutions to our customers' problems. Key Responsibilities What you will do: Provide senior DevOps expertise and leadership across Engineering at … all layers of the stack Evangelise DevOps, security and reliability engineering across the Engineering team-at-large Provision resilient infrastructure across multiple regions and AZs Build compliant, reliable and featureful developer platforms centered on container orchestration. Enable Continuous Delivery and Deployment capabilities using CICD pipelines and GitOps tooling Enable shifting left on security and testing, and facilitate progressive More ❯
development. Market leading salary and annual discretionary bonus. Pension contributions, in addition to Health Insurance, Life Assurance. What You’ll Be Doing This is a hands-on and strategic engineering role where you’ll be responsible for ensuring production stability across a highly dynamic microservices architecture hosted in Azure . You’ll have end-to-end ownership over reliability … reliability. Automating recovery, scaling, and monitoring across distributed systems. Collaborating with cross-functional teams to align platform strategy and reliability goals. What You’ll Bring: 5+ years in software engineering or SRE/production infrastructure roles. Strong experience with Java (Spring) and cloud platforms (ideally Azure ). Proven track record in building and maintaining mission-critical systems. Deep understanding … of Kubernetes, observability tooling (Grafana, Prometheus, ELK, etc.), and Infrastructure as Code (Terraform, Bicep). Ability to lead technical conversations across Engineering and Product. Experience in fintech, crypto, or regulated digital infrastructure RDBMS performance tuning (MS SQL) Knowledge of SLAs/SLOs/chaosengineering and platform risk management #J-18808-Ljbffr More ❯
London, England, United Kingdom Hybrid / WFH Options
Bjak
pipelines. Ensure observability and logging for test executions, e.g. OpenTelemetry, ELK. Collaborate with Software Engineers to enforce quality in system refactoring efforts. Bachelor's Degree in Computer Science, Software Engineering, or related fields. 3+ years of experience in QA Automation for backend services and cloud infrastructure. Strong expertise in API and microservices testing, e.g. Postman, RestAssured, Supertest. Experience in … testing. Good to have: Familiarity with service mesh testing, e.g. Istio, Linkerd. Experience in gRPC and event-driven architecture testing, e.g. Kafka, RabbitMQ. Exposure to fault injection testing, e.g. ChaosEngineering, Gremlin. Fully Remote Role: Work from anywhere and enjoy the freedom of a fully remote position. Innovative Challenges: Work on fast-moving, challenging, and unique business problems. More ❯
shape our SRE strategy, establish best practices, and set the standard for service reliability and performance. What You’ll Do Define strategies for Application Performance Monitoring, Unit Cost, and Chaos Engineering. Continuously optimize production environments to enhance reliability and efficiency. Implement and apply MTTR, SLO, and SLI principles to ensure high service standards. Respond to incidents, analyze root causes … layers that drive our platform’s success. What You Need Proven experience implementing SRE principles at scale, including deep knowledge of SLI/SLO/SLA differences. A product engineering background with strong coding skills in Python, C#, or similar. Experience with incident management frameworks and evolving them for efficiency. Expertise in cloud platforms (AWS preferred) and container orchestration … real impact, apply now and help us build the future of Thredd! Seniority level Seniority level Mid-Senior level Employment type Employment type Full-time Job function Job function Engineering and Information Technology Referrals increase your chances of interviewing at Thredd by 2x Get notified about new Site Reliability Engineer jobs in London, England, United Kingdom . London, England More ❯
who will shape the future of our AI-powered automation platform, with a particular focus on modernizing our application testing and deployment pipelines. The ideal candidate will combine deep engineering expertise with strategic thinking to create intelligent, scalable solutions that transform how we approach automation, dramatically reducing the time and complexity of application validation and delivery. This role requires … LLM-based approaches to test script generation, automated debugging, and intelligent test maintenance across our distributed systems Pioneer innovative quality practices that leverage AI for automated performance analysis, intelligent chaosengineering scenarios, and predictive system reliability testing Design self-healing test systems that use machine learning to adapt to application changes, automatically maintain test suites, and provide AI … focused on building solutions that scale across teams, accelerate our testing cycles, and ultimately enable us to deliver higher quality products faster than ever before. About the team Our Engineering Environment At Blink, you'll work within a fully integrated engineering ecosystem where you can test across multiple layers - from algorithms and ASICs to hardware, firmware, AWS infrastructure More ❯
Software Development Engineer Test (SDET) to join our journey of technical excellence and customer obsession.As an SDET on our team, you'll blend your software development expertise with quality engineering to create robust test automation frameworks that ensure our subscription services remain reliable, scalable, and performant. You'll work in a collaborative environment where quality is everyone's responsibility … and your contributions will directly impact the viewing experience of customers around the globe. We're at an exciting inflection point in our quality engineering evolution. With numerous opportunities to innovate in our testing approach, you'll have the chance to architect new automation solutions, implement comprehensive test strategies, and drive efficiency improvements that help us deliver features with … Key job responsibilities - Lead the design and evolution of our automated test framework, ensuring it can handle the complex requirements of our expanding product portfolio - Collaborate closely with software engineering and quality assurance teams to identify opportunities to improve test coverage and optimize testing workflows - Develop advanced test automation tools and integrations to accelerate software development velocity and enhance More ❯
generation omni-commerce Gateway. We are currently hiring a Principle/Distinguished Engineer to support teams within this domain. In this role you will lead highly technical and strategic engineering initiatives on mission-critical platforms across our team, enabling every engineer to their best work. Your role will be tasked with solving the most complex, challenging technical problems across … this team to meet our demanding needs. You will play an influential role in partnership with engineering leadership group and other cross-divisional VPs of Engineering, owning technical vision and direction as well as Developer Experience. In order to excel in this role you will possess: Great communication skills. Ability to influence across teams and with senior stakeholders. … to speed on the latest and greatest happenings within technology. Strong appreciation of Event Storming and DDD having applied these mythologies in shaping microservices architectures. Experience in creating/engineering Cloud Native Architectures. Additional Experience (nice to have). Some experience with Model Context Protocol/AI having had some experience in how this can shape the future of More ❯
London, England, United Kingdom Hybrid / WFH Options
Goodstack
Responsibilities: Identify issues early, document test cases, automate testing, improve automation tools, and monitor quality gates in CI/CD pipelines. Team Collaboration: Work closely with Grants, Platform, and Engineering teams. Maintain performance, adhere to our pillars, and meet business objectives. Success in this role: At 3 months: understand systems and contribute. At 6 months: execute independently and impact … a continuous flow environment and active participation in Agile practices. Bonus skills include: Designing scalable test frameworks. Security testing fundamentals (OWASP Top 10, Sonarcloud). Writing custom GitHub Actions. ChaosEngineering knowledge. What you can expect: Salary reviews, share options, office perks, wellness and learning budgets, conference attendance, volunteer days, generous leave, flexible hours, parental leave, WFH budget More ❯
and manage reliability, feature flags and cloud costs. The Harness Software Delivery Platform includes modules for CI, CD, Cloud Cost Management, Feature Flags, Service Reliability Management, Security Testing Orchestration, ChaosEngineering, Software Engineering Insights and continues to expand at an incredibly fast pace. Harness is led by technologist and entrepreneur Jyoti Bansal, who founded AppDynamics and sold More ❯
Staff Software Engineer, AI Reliability Engineering London, UK About Anthropic Anthropic's mission is to create reliable, interpretable, and steerable AI systems. We want AI to be safe and beneficial for our users and for society as a whole. Our team is a quickly growing group of committed researchers, engineers, policy experts, and business leaders working together to build … maintaining SLO/SLA frameworks for business-critical services Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence) Have experience with chaosengineering and systematic resilience testing Can effectively bridge the gap between ML engineers and infrastructure teams Have excellent communication skills Strong candidates may also: Have experience operating large More ❯
AWS Fault Injection Service is a fully managed service for running fault injection experiments to improve an application's performance, observability, and resilience. Fault injection experiments are used in chaosengineering, which is the practice of stressing an application by creating disruptive events in testing or production environments. Examples of these events are sudden increase in CPU or … naturally customer centric and thrive in a fast-paced environment that requires strong technical and business judgment and solid written and verbal communication skills. You are experienced on leading engineering teams, helping individuals grow and making the team effective, while remaining humble and fun! If this sounds like the right challenge for you, then please apply today! Key job … s why you'll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. BASIC QUALIFICATIONS - 2+ years of engineering team management experience - Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management More ❯
and manage reliability, feature flags and cloud costs. The Harness Software Delivery Platform includes modules for CI, CD, Cloud Cost Management, Feature Flags, Service Reliability Management, Security Testing Orchestration, ChaosEngineering, Software Engineering Insights and continues to expand at an incredibly fast pace. Harness is led by technologist and entrepreneur Jyoti Bansal, who founded AppDynamics and sold … afraid of being data driven - including using Salesforce and other tools to track your progress Managing full sales cycle from prospect to close Collaborating with other teams, including sales engineering and sales development About You A proven track record of driving and closing enterprise deals Account planning and execution skills Ability to sell C-Level and across both IT More ❯
and manage reliability, feature flags and cloud costs. The Harness Software Delivery Platform includes modules for CI, CD, Cloud Cost Management, Feature Flags, Service Reliability Management, Security Testing Orchestration, ChaosEngineering, Software Engineering Insights and continues to expand at an incredibly fast pace. Harness is led by technologist and entrepreneur Jyoti Bansal, who founded AppDynamics and sold … afraid of being data driven - including using Salesforce and other tools to track your progress Managing full sales cycle from prospect to close Collaborating with other teams, including sales engineering and sales development About You A proven track record of driving and closing deals Account planning and execution skills Ability to sell C-Level and across both IT and More ❯
business units, focusing on business readiness and practice growth in the region Deep understanding of AI, GenAI or Agentic AI related solutions and related network management & automation Experience in chaosengineering to enhance underlying processes, information security gaps and perform internal audits and MIS reporting Adeptness in IoT VAPT, Zero Trust Security, and orchestration, ensuring robust controls for More ❯
business units, focusing on business readiness and practice growth in the region Deep understanding of AI, GenAI or Agentic AI related solutions and related network management & automation Experience in chaosengineering to enhance underlying processes, information security gaps and perform internal audits and MIS reporting Adeptness in IoT VAPT, Zero Trust Security, and orchestration, ensuring robust controls for More ❯
in technology operations, who is looking to broaden their skillset. After developing your specialist skills you are now looking for opportunities to grow and learn more about wider resilience, chaosengineering and cloud services - we will support, provide guidance and mentor you. Nevertheless, we are open to other experiences as we are creating a new diverse and dynamic More ❯