Cambourne, Cambridgeshire, United Kingdom Hybrid / WFH Options
Remotestar
strong track record of building and maintaining highly reliable infrastructure and services. Expertise in incident management, including incident response, resolution, and post-mortem analysis. Proficiency in monitoring, alerting, and observability tools such as Prometheus, Grafana, ELK stack or Datadog. Experience with cloud platforms such as AWS, Azure, or GCP, including infrastructure as code tools like Terraform or CloudFormation. Strong scripting More ❯
environments like GCP, Azure, AWS as well as Private Cloud solutions Soft skills & Emotional Intelligence - Staying calm under pressure and effectively handling stressful situations whilst continuing to communicate effectively Observability - Familiarity with monitoring tools to diagnose issues and ensure system performance and reliability About working for us Our focus is to ensure we're inclusive every day, building an organisation More ❯
collaboration skills, with the ability to influence and align diverse teams on a shared vision. Knowledge of DevOps practices and tools CI/CD pipelines. Knowledge of Monitoring and Observability tooling. In addition, any experience of these would be useful: Familiarity with data mesh concepts (such as ownership based on specific areas, and thinking about data products). Expertise in More ❯
collaboration skills, with the ability to influence and align diverse teams on a shared vision. Knowledge of DevOps practices and tools CI/CD pipelines. Knowledge of Monitoring and Observability tooling. In addition, any experience of these would be useful: Familiarity with data mesh concepts (such as ownership based on specific areas, and thinking about data products). Expertise in More ❯
end automation to eliminate toil, improve efficiency, and enhance operational resilience. Lead the transition from traditional IT operations to a proactive, AI-driven, self-healing infrastructure. Establish a global observability, telemetry, and predictive analytics framework for real-time insights. Align operational strategies with business goals, ensuring IT supports digital transformation initiatives across BCG Core, BCG X, and CT. Infrastructure & Cloud More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
BAE Systems (New)
or DevOps Expertise in microservices and API design Docker, and container runtime platforms such as Kubernetes, EKS, ECS etc Strong understand of operational concepts on AWS, particularly monitoring and observability, FinOps UtilisingCI/CD tools, such as Bamboo, Jenkins, TeamCity, Bitbucket, in order to streamline delivery of new features and fixes Continual testing of code using Automated Testing Frameworks A More ❯
Ansible, Fluentd) Build and optimize CI/CD pipeline templates in GitLab to streamline application deployment and test workflows across various environments Deploy and maintain robust monitoring, alerting, and observability tools (e.g. Prometheus, Grafana, ELK) to enhance performance, reliability, and visibility Automate incident management processes, including root cause analysis and self-healing mechanisms, to improve platform stability Ensure compliance with More ❯
Site Reliability/DevOp Engineer London - 5 Days Onsite Up to £550 per day (Umbrella, Inside IR35) 12-Month Contract Must hold live and transferrable DV Clearance Are you passionate about reliability, automation, and supporting mission-critical systems? Join this More ❯
working, or the ability to flex your start and finish times. Where possible, we support a working pattern that suits your lifestyle and helps you reach your ambitions. Title: Observability Engineer Base location: Belfast/Remote UK About the company: Imperva, a Thales company, is an analyst-recognized cybersecurity leader-championing the fight to secure data and applications wherever they … pops and core infrastructure with new modern technologies, embracing Infrastructure as code at all levels with automation as a core requirement for all projects. We are looking for an Observability Engineer to work within our SRE teams to design, build and iterate on our O11Y platform. This engineer will have to work both hands on and strategically with our architects … global service delivery and product teams to plan an observability road map and then execute on those plans. Responsibilities: Assess & Enhance Observability: Review the current observability platform, identify areas for improvement, and guide the team in enhancing monitoring, logging, tracing, and alerting capabilities. Design & Implement Solutions: Develop and optimize observability solutions that provide deep insights into system and service health. More ❯
our production systems. Key Responsibilities Design, implement, and manage AWS cloud infrastructure. Develop and maintain automation scripts and tooling. Support production systems and ensure high availability and performance. Implement observability and monitoring solutions. Collaborate closely with the PBS (Platform/Backend Services) team. Contribute to infrastructure as code (IaC) and DevOps best practices. Requirements Hands-on experience with AWS. Automation … experience (e.g., Terraform, Ansible, CI/CD tools). Strong understanding of infrastructure and cloud architecture. Experience supporting production environments. Familiarity with observability tools (e.g., Prometheus, Grafana, CloudWatch). Excellent problem-solving and communication skills. Desirable Experience working in a fast-paced or agile development environment. Familiarity with container technologies (e.g., Docker, Kubernetes). Previous experience in a similar role More ❯
our production systems. Key Responsibilities Design, implement, and manage AWS cloud infrastructure. Develop and maintain automation scripts and tooling. Support production systems and ensure high availability and performance. Implement observability and monitoring solutions. Collaborate closely with the PBS (Platform/Backend Services) team. Contribute to infrastructure as code (IaC) and DevOps best practices. Requirements Hands-on experience with AWS. Automation … experience (e.g., Terraform, Ansible, CI/CD tools). Strong understanding of infrastructure and cloud architecture. Experience supporting production environments. Familiarity with observability tools (e.g., Prometheus, Grafana, CloudWatch). Excellent problem-solving and communication skills. Desirable Experience working in a fast-paced or agile development environment. Familiarity with container technologies (e.g., Docker, Kubernetes). Previous experience in a similar role More ❯
We are looking for an enthusiastic and inquisitive Platform Engineer to help strengthen our platform engineering capabilities and improve the supportability, observability, and documentation of our infrastructure. You will ensure that infrastructure, connectivity, and shared service dependencies are fully mapped, documented, and efficiently supported. You will play a central role in enabling engineering teams by developing and maintaining cloud-native …/CD workflows. Maintain clear documentation of infrastructure, connectivity, and platform dependencies across environments. Champion the creation and maintenance of support documentation, diagrams, and knowledge bases for platform components. Observability & Troubleshooting Ensure observability is embedded across all layers of the infrastructure stack, enabling proactive alerting, monitoring, and root cause analysis. Evaluate and implement tooling to enhance visibility, debugging, and response … with GitOps tools (e.g. ArgoCD). Experience with Service Mesh technologies (e.g. Anthos). Scripting skills (e.g. Bash, Python) to build automation and tooling. Exposure to monitoring, logging, and observability tooling (e.g. Prometheus, Grafana, GCP Operations Suite). Understanding of shared service architecture, security, and access control patterns. Behavioural Competencies Strategic Documentation: Ensures infrastructure and connectivity are mapped, maintained, and More ❯
Hereford, Herefordshire, West Midlands, United Kingdom Hybrid / WFH Options
Twinstream Limited
Work Scheme Key Responsibilities of the Site Reliability Engineer: Partner with developers to improve performance and reliability across systems Automate toil and reduce unnecessary alerts with smart tooling Evolve observability so we can prevent issues before they become incidents Improve CI/CD pipelines and support development teams in delivering quality faster Explore new technologies, tools, and services that improve … plus) Experience with Terraform and modern IaC practices Hands-on with Docker and orchestration tools (Kubernetes, OpenShift, or Docker Swarm) CI/CD experience (Jenkins or equivalent) Monitoring/observability tools: Grafana , Prometheus , or InfluxDB Event-driven messaging: RabbitMQ or similar Strong Linux skills, scripting, and understanding of network security protocols Experience with AWS: EC2, S3, RDS, Lambda Desirable: Experience … coding in Python, Java, or Go Exposure to cross-domain solutions Experience in a service management environment Observability best practices and metric-driven reliability improvement Security Requirements Due to the sensitive nature of our work, candidates must be eligible for Developed Vetting (DV) clearance. All offers are subject to security screening. Ready to Engineer Systems That Matter? If youre a More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Embarcaderomediagroup
ll sit at the heart of our engineering operations, bringing together SRE principles and modern platform engineering practices. This includes combining principles of SRE - such as service-level reliability, observability, incident response - with platform engineering practices like GitOps, Infrastructure as Code, DevSecOps automation, and self-service enablement, to help development teams ship faster, safer, and more cost-efficiently. What you … ll be doing: Designing and operating highly reliable, scalable, and secure Azure-based platforms Applying SRE principles like SLOs, observability, and incident management to drive service reliability Building Infrastructure as Code using Terraform (v1.7+) and GitOps workflows Enabling teams through platform tools, reusable Terraform modules, and self-service infrastructure Enhancing CI/CD pipelines (Azure DevOps, YAML-based) with security … knowledge (AKS, Functions, SQL, Cosmos DB, etc.) Strong Infrastructure as Code skills with Terraform (v1.7+) Experience with CI/CD pipelines, GitOps, and automation tools (PowerShell, Bash) Familiarity with observability and incident tools like Datadog, ELK, and synthetic monitoring Solid understanding of networking (TCP/IP, Load Balancing, DNS, Routing) Good knowledge of DevSecOps practices - including security scanning, IAM, and More ❯
Sunnyvale, California, United States Hybrid / WFH Options
TalentDetect
reliability efforts Architect, build, and manage scalable infrastructure using Terraform, AWS, and Kubernetes Design and maintain CI/CD pipelines (Jenkins, Maven, Git-based flows) Set up and optimize observability tools such as Prometheus, Grafana, and Datadog Write automation scripts and backend tooling in Python (preferred), Golang, or Rust Perform advanced Linux server debugging, log analysis, and incident response Ensure … flexibility required) Preference for Bay Area-based candidates for smoother collaboration Technical Skills: Terraform, Kubernetes, AWS (across both core and advanced services) CI/CD tools: Jenkins, Maven, Git Observability: Prometheus, Grafana, Datadog Scripting/Backend: Strong Python (preferred), with knowledge of Golang or Rust Operating Systems: Linux (with advanced-level debugging) Personal Traits: Highly trustworthy, reliable, and committed Comfortable More ❯
both strategic vision and the ability to dive deep into technical challenges. Responsibilities Lead and Manage the Platform Engineering Initiatives Define and execute the technical roadmap for platform infrastructure, observability, and developer experience Drive DevOps, SRE, and Infrastructure initiatives to ensure platform reliability and performance Foster a culture of automation, observability, and continuous improvement Architect and Implement Scalable Solutions Design … optimal performance and scalability across all regions Own Platform Reliability and Operations Define and maintain SLOs/SLIs/SLAs for critical platform services Implement comprehensive monitoring, alerting, and observability solutions Design and maintain disaster recovery and business continuity plans Lead incident response and post-mortem processes Optimize Platform Performance and Costs Implement strategies to optimize infrastructure costs without compromising … in solving complex technical issues Contribute to codebases as needed to drive projects forward Requirements Technical Expertise Proven experience managing Kubernetes clusters and expertise in container orchestration. Experience with observability tools (e.g., DataDog, Prometheus, Grafana) Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation Experience in Database optimization and management (especially for multi-tenant architectures) Extensive knowledge of More ❯
workflows. Build infrastructure automation using tools like Terraform to ensure consistent provisioning of cloud and on-prem resources. Manage and evolve CI/CD systems, focusing on deployment standardization, observability, and integration with platform tools. Implement and maintain secrets management practices using tools like Vault across environments. Develop self-service tooling to enable development teams to manage application deployments and … configurations. Partner with teams to implement platform observability and system monitoring using tools such as ELK and Prometheus. Contribute to platform documentation, knowledge sharing, and developer onboarding for platform tooling. Required Qualifications: Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience. 10+ years of experience in platform engineering, DevOps, or infrastructure roles. Expert with Kubernetes … Experience with Vault for secure secrets management. Proficiency with scripting or programming languages (e.g., Python, Go, Bash). Experience with Terraform or other Infrastructure as Code tools. Familiarity with observability tools (e.g., ELK stack, Prometheus). Elastic experience is a plus. Strong collaboration and communication skills. Must have or be able to obtain SEC+ certification within three months of hire. More ❯
Crewe, Cheshire, United Kingdom Hybrid / WFH Options
Manchester Digital
platform security, reliability, and performance across systems deployed in Canada, the UK, and AWS cloud environments Contribute to key projects, platform optimizations, and ongoing maintenance initiatives Help drive scalability, observability, and operational excellence If you're passionate about infrastructure, cloud, and systems engineering-and want to help shape the future of mobility-we want to hear from you! Requirements We … configurations (Azure AD , Ory, Cognito, Firebase) - Understanding of Site Reliability Engineering and key concepts - Proficient in Infrastructure as Code pipeline deployments and pipeline version control within Terraform or CloudFormation. - Observability Systems, e.g., Nagios, New Relic - Able to troubleshoot/work under pressure, meet deadlines. - Previous experience in a cloud engineering role. - AWS certified as SysOps Administrator/Solutions Architect/… understanding of Infrastructure as Code principles and related tech such as Terraform or CloudFormation - Enhanced experience of AWS cloud technologies, e.g., ECS, EC2, VPC, Lambda, CFS. Ideally AWS certified. - Observability Systems, e.g., New Relic, CloudWatch, SquadCast - ITIL Qualified or awareness of the framework. Bonus Qualifications: -Experience with Linux system administration and troubleshooting. -Basic knowledge of AWS cloud technologies such as More ❯
infrastructure and system issues, as well as log ingestion and communication issues. Design and develop scalable, robust, and high-performance data pipelines and data storage solutions. Develop and maintain observability frameworks using tools like Kibana, Grafana, or similar Work with cross-functional teams to define observability and search requirements. Scale, script and maintain our development and production platform foundation with More ❯
contributing to CI/CD improvements and IaC refinement Partner with developers to resolve bottlenecks in the delivery pipeline Driving Excellence Lead improvements in platform reliability, cost optimisation, and observability Establish DevOps standards, documentation, and training for others Be a core contributor to platform and infrastructure roadmap planning Key Goals & Objectives: Ensure infrastructure is secure, scalable, and cost-efficient in … Azure Improve system reliability through automation, observability, and alerting Increase engineering velocity by optimising CI/CD pipelines Enable and mentor engineers across the org on DevOps best practices Key Responsibilities Design, implement, and maintain Azure infrastructure using IaC tools Build and refine CI/CD pipelines using GitHub Actions Troubleshoot production systems with Elastic Cloud tooling Collaborate with developers More ❯
strategy, execution, tooling and best practices Collaborate with multiple product teams and respective owners to design infrastructure as we scale Building custom metrics and features to enhance Primer's observability Infrastructure as code (IaC) development Writing processes and documentation for system design, troubleshooting and maintenance What are we looking for? Strong experience with a cloud provider (AWS preferred but we … Kubernetes clusters Knowledge of security best practices and the ability to implement security controls at the infrastructure level Experience with monitoring and logging tools like DataDog or Grafana's observability stack (Prometheus, Tempo, Loki, Grafana) Familiarity with the open standard OpenTelemetry Excellent written and verbal communication skills, we're a collaborative team! PLEASE NOTE: Our engineering teams work fully remotely More ❯
results that matter. By taking advantage of all structured and unstructured data - securing and protecting private information more effectively - Elastic's complete, cloud-based solutions for search, security, and observability help organizations deliver on the promise of AI. What Is The Role: You will have the opportunity to work with a tremendous services, engineering, product, and sales team and wear … consultant will be focused on excellence, taking the initiative for self-improvement and possess great communication skills. Our customers' use cases extend across all the Elastic Solutions: Enterprise Search, Observability and Security, and beyond, and the scale of data in their environments ranges from gigabytes to petabytes. This diverse mix of a customer base means the challenges they face that More ❯
Bristol, Avon, South West, United Kingdom Hybrid / WFH Options
Twinstream Limited
Socials & Events Cycle to Work Scheme & Life Assurance Key Responsibilities of the Site Reliability Engineer: Work closely with engineers and sysadmins to increase performance and reduce toil Advance system observability, monitoring and alerting Automate, troubleshoot, and proactively resolve issues before they escalate Improve development environments to meet delivery and quality targets Research and evaluate tools and platforms to support scale More ❯
Terraform). Experience in software development in general, with skills in a high-level language (e.g., Python, JavaScript, TypeScript, Java) and familiarity with modern development practices Understanding of Cloud Observability, Monitoring, and Tracing tools (Datadog, CloudWatch, Jaeger, ELK) and how best to leverage to support effective MTTR and mitigate high CFR Our UK benefits: Stock Options Annual Performance Bonus or More ❯
or GCP): Migration and operation of cloud environments, including compute and storage scalability Containerisation & Virtualisation: Familiarity with virtual and physical server provisioning, especially in strategic data centres Platform Resilience & Observability: Designing for uptime, performance, and root cause analysis. Web Services & APIs: Used for Integration with 24+ LBGI systems Batch Processing: Understanding of batch suite performance and scheduling constraints RPA & Automation More ❯