regular 1:1s, career development planning, and technical coaching. Support Agile processes and help the team improve velocity and product quality through well-managed sprints. Advocate for operational excellence, observability, and automation in software development practices. Ensure services are built with security, reliability, and performance in mind - especially in cloud-native environments. Work cross-functionally to resolve dependencies, align with More ❯
results that matter. By taking advantage of all structured and unstructured data - securing and protecting private information more effectively - Elastic's complete, cloud-based solutions for search, security, and observability help organizations deliver on the promise of AI. What is The Role The Search Inference team is responsible for bringing performant, ergonomic, and cost effective machine learning (ML) model inference More ❯
vs non-relational (NoSQL) databases in modern digital platforms. Hands-on experience with GitHub and GitHub Actions for implementing CI/CD pipelines and supporting DevOps workflows. Familiarity with observability tools and practices (e.g., logging, tracing, metrics) to support performance, reliability, and incident response. Understanding of content management systems (CMS), digital asset management (DAM), and video streaming platforms or protocols More ❯
Leeds, Yorkshire, United Kingdom Hybrid / WFH Options
ASDA
Engineers to understand problems, analyse requirements & deliver solutions that enhance engineering productivity Write code for low latency, highly available and scalable solutions Contribute to delivering initiatives to improve system observability, incident response processes and operational efficiency Continually update technical knowledge and skills using internal training as well taking time to self-develop utilising external sources Champion a culture of continuous More ❯
AWS Cloud Engineer We're seeking a Cloud Engineer to own and scale our AWS-based infrastructure, powering a platform used by millions of cybersecurity individuals. You'll ensure performance, security, scalability, and cost-efficiency, while enabling fast, reliable deployments More ❯
Leeds, Yorkshire, United Kingdom Hybrid / WFH Options
William Hill PLC
Our team is building the next generation Sports Betting platform that optimizes flexibility, performance, responsiveness and resiliency. The technologies we like to use include Java, SpringBoot, Kafka, Cassandra, Postgres, Kubernetes, AWS, Postgres, etc. We are looking for an experienced Java More ❯
AWS Cloud Engineer We’re seeking a Cloud Engineer to own and scale our AWS-based infrastructure, powering a platform used by millions of cybersecurity individuals. You’ll ensure performance, security, scalability, and cost-efficiency, while enabling fast, reliable deployments More ❯
AWS Cloud Engineer We’re seeking a Cloud Engineer to own and scale our AWS-based infrastructure, powering a platform used by millions of cybersecurity individuals. You’ll ensure performance, security, scalability, and cost-efficiency, while enabling fast, reliable deployments More ❯
Salary: Competitive Plus Benefits Location: London Store Support Centre and Home, London, EC1M 6HA Contract type: Permanent Business area: Sainsbury's Tech Closing date: 30 June 2025 Requisition ID: 306085 We'd all like amazing work to do, and real More ❯
Manchester, England, United Kingdom Hybrid / WFH Options
bet365
Who we are looking for A Site Reliability Engineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems, directly impacting operational efficiency. Using your engineering … practices and develop features for maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development life cycle. By supporting governance standards set by the central teams, you will foster a culture where these principles are integral to development. Your … of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang and JavaScript. Knowledge and experience of modern software development More ❯
Stoke-on-Trent, England, United Kingdom Hybrid / WFH Options
bet365
Who we are looking for A Site Reliability Engineer, who will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems, directly impacting operational efficiency. Using your engineering … practices and develop features for maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development life cycle. By supporting governance standards set by the central teams, you will foster a culture where these principles are integral to development. Your … of Site Reliability Engineering principles, including the creation and management of effective Service Level Indicators (SLI) and Service Level Objectives (SLO) for reliability and customer satisfaction. Knowledge of contemporary observability tools, techniques and best practice including Splunk, New Relic, Grafana and Pager Duty. Excellent knowledge of programming languages including Python, Golang and JavaScript. Knowledge and experience of modern software development More ❯
infrastructure components including AKS , managed identities, network controls, and secure storage. Manage infrastructure state with Terraform , integrating into our GitOps workflows. API and Automation Development (Java) Security & Compliance Engineering Observability and Incident Management Technical Environment Cloud: Microsoft Azure (AKS, Azure DevOps, storage, networking) Languages: Java (API/tooling), Bash, YAML, Go (optional) CI/CD: GitHub Actions, Argo CD, Terraform … Concourse (legacy) Observability: Datadog, custom metrics ingestion Source Control: GitHub Enterprise Orchestration: Kubernetes (Helm-based), Argo CD, Terraform What You’ll Bring: Deep experience designing and maintaining Azure-based infrastructure at scale. Solid engineering skills in Java , particularly for backend systems and automation tools. Proven ability to re-architect CI/CD systems, with hands-on experience in GitHub Actions … Background in financial services, trading infrastructure, or regulated environments. Experience with GitHub Enterprise, Argo CD patterns, or Kubernetes policy enforcement. Contributions to open-source tooling in CI/CD, observability, or platform engineering. This is a hands-on, high-leverage role for engineers who want to build resilient systems and own the tooling that powers real-time trading infrastructure. Apply More ❯
infrastructure components including AKS , managed identities, network controls, and secure storage. Manage infrastructure state with Terraform , integrating into our GitOps workflows. API and Automation Development (Java) Security & Compliance Engineering Observability and Incident Management Technical Environment Cloud: Microsoft Azure (AKS, Azure DevOps, storage, networking) Languages: Java (API/tooling), Bash, YAML, Go (optional) CI/CD: GitHub Actions, Argo CD, Terraform … Concourse (legacy) Observability: Datadog, custom metrics ingestion Source Control: GitHub Enterprise Orchestration: Kubernetes (Helm-based), Argo CD, Terraform What You’ll Bring: Deep experience designing and maintaining Azure-based infrastructure at scale. Solid engineering skills in Java , particularly for backend systems and automation tools. Proven ability to re-architect CI/CD systems, with hands-on experience in GitHub Actions … Background in financial services, trading infrastructure, or regulated environments. Experience with GitHub Enterprise, Argo CD patterns, or Kubernetes policy enforcement. Contributions to open-source tooling in CI/CD, observability, or platform engineering. This is a hands-on, high-leverage role for engineers who want to build resilient systems and own the tooling that powers real-time trading infrastructure. Apply More ❯
infrastructure components including AKS , managed identities, network controls, and secure storage. Manage infrastructure state with Terraform , integrating into our GitOps workflows. API and Automation Development (Java) Security & Compliance Engineering Observability and Incident Management Technical Environment Cloud: Microsoft Azure (AKS, Azure DevOps, storage, networking) Languages: Java (API/tooling), Bash, YAML, Go (optional) CI/CD: GitHub Actions, Argo CD, Terraform … Concourse (legacy) Observability: Datadog, custom metrics ingestion Source Control: GitHub Enterprise Orchestration: Kubernetes (Helm-based), Argo CD, Terraform What You’ll Bring: Deep experience designing and maintaining Azure-based infrastructure at scale. Solid engineering skills in Java , particularly for backend systems and automation tools. Proven ability to re-architect CI/CD systems, with hands-on experience in GitHub Actions … Background in financial services, trading infrastructure, or regulated environments. Experience with GitHub Enterprise, Argo CD patterns, or Kubernetes policy enforcement. Contributions to open-source tooling in CI/CD, observability, or platform engineering. This is a hands-on, high-leverage role for engineers who want to build resilient systems and own the tooling that powers real-time trading infrastructure. Apply More ❯
spirit. Responsibilities: Define and enforce SLOs, SLIs, and error budgets across critical services Develop and implement cloud infrastructure and tooling strategies Enhance SRE practices across the organization Implement robust observability metrics, logs, and traces using our observability tools Guide the team in building automated, self-healing systems Own and evolve incident response processes, including on-call practices and post-mortem … with AWS core services (EC2, EKS, RDS, S3, ALB/NLB, IAM, CloudWatch, etc.) Proficiency in Infrastructure as Code using Terraform and knowledge of GitOps workflows Strong background in observability: metrics, visualization, logging, tracing Understanding of automation, CI/CD pipelines, deployment automation, and release strategies Experience with incident management, disaster recovery, root cause analysis, and post-incident reviews Additional More ❯
and hybrid retrieval mechanisms Implement evaluation frameworks (BLEU, ROUGE, hallucination checks) to monitor answer quality Deploy production systems on GCP (Cloud Run, Vertex AI, BigQuery, Pub/Sub) Own observability, IaC (Terraform), and CI/CD (GitHub Actions) pipelines Collaborate with product, mobile, and clinical experts to ship weekly improvements Ensure compliance with data privacy standards (GDPR, NHS DSPT) Who … or recommender systems at scale Deep knowledge of embeddings , LLM-based retrieval , and vector similarity search Hands-on with GCP (or AWS/Azure), Terraform, CI/CD, and observability Strong communicator, product-minded, and thrives in fast-paced startup environments UK-based and available to work 2–3 days per week in-office (London) Bonus Points Experience in healthcare More ❯
and hybrid retrieval mechanisms Implement evaluation frameworks (BLEU, ROUGE, hallucination checks) to monitor answer quality Deploy production systems on GCP (Cloud Run, Vertex AI, BigQuery, Pub/Sub) Own observability, IaC (Terraform), and CI/CD (GitHub Actions) pipelines Collaborate with product, mobile, and clinical experts to ship weekly improvements Ensure compliance with data privacy standards (GDPR, NHS DSPT) Who … or recommender systems at scale Deep knowledge of embeddings , LLM-based retrieval , and vector similarity search Hands-on with GCP (or AWS/Azure), Terraform, CI/CD, and observability Strong communicator, product-minded, and thrives in fast-paced startup environments UK-based and available to work 2–3 days per week in-office (London) Bonus Points Experience in healthcare More ❯
and hybrid retrieval mechanisms Implement evaluation frameworks (BLEU, ROUGE, hallucination checks) to monitor answer quality Deploy production systems on GCP (Cloud Run, Vertex AI, BigQuery, Pub/Sub) Own observability, IaC (Terraform), and CI/CD (GitHub Actions) pipelines Collaborate with product, mobile, and clinical experts to ship weekly improvements Ensure compliance with data privacy standards (GDPR, NHS DSPT) Who … or recommender systems at scale Deep knowledge of embeddings , LLM-based retrieval , and vector similarity search Hands-on with GCP (or AWS/Azure), Terraform, CI/CD, and observability Strong communicator, product-minded, and thrives in fast-paced startup environments UK-based and available to work 2–3 days per week in-office (London) Bonus Points Experience in healthcare More ❯
tools to manage a large-scale, multi-vendor network with an emphasis on automation, telemetry, and model-driven infrastructure as code. Automate the full network lifecycle-including provisioning, configuration, observability, testing, troubleshooting, and capacity planning. Collaborate with architecture and design teams and the CTO office to implement new technologies that ensure scalability, efficiency, and operational resilience. Develop tools and platforms … that enhance the observability, reliability, and performance of the production network. Enhance existing monitoring and observability frameworks, integrating intelligent alerting and self-remediation capabilities to reduce manual intervention and improve incident response. Define and measure service-level objectives (SLOs) to track infrastructure performance and reliability. Write software utilizing orchestration systems to automate tasks and interact with other systems. Provide mentorship More ❯
This new role will establish a multidisciplinary team supporting platforms and federated engineering teams, ensuring that digital products are robust, scalable, and cost-effective. By building and maintaining automation, observability, and CI/CD pipelines, and by championing best practices in reliability engineering, the team enables faster, safer software delivery and rapid incident response. KEY RESPONSIBILITIES AND IMPACT Lead and … develop a high-performing DevOps Engineers-recruit and manage a multidisciplinary team responsible for automation, observability/monitoring, security & compliance automation, CI/CD pipelines, reliability/resilience, FinOps, root cause/incident response, dashboarding/reporting, and 24/7 runbook & on-call coordination. Drive platform automation and operational excellence-own and evolve automation strategies, tooling, and processes that … engineering and incident response-embed best practices in site reliability engineering, including proactive monitoring, incident detection, root cause analysis, and continuous improvement to minimize downtime and user impact. Enhance observability and operational visibility-oversee the design, implementation, and evolution of monitoring, alerting, dashboarding, and reporting capabilities that provide actionable insights and enable rapid response to issues. Embed security, compliance, and More ❯
operate autonomously within production environments, integrating LLMs, multi-agent workflows, cloud-native infrastructure and real-time API interfaces. Experience translating ML models into resilient cloud applications, optimizing for performance, observability and secure operations at scale. Core Responsibilities Architect distributed agentic systems using LLMs and tool-using AI components across enterprise cloud environments Design and implement modular, event-driven architectures (e.g. … SageMaker , Bedrock or OpenAI APIs Build support systems for autonomous agents including memory storage, vector search (e.g., Pinecone, Weaviate) and tool registries Enforce system-level requirements for security, compliance, observability and CI/CD Drive PoCs and reference architectures for multi-agent coordination , intelligent routing and goal-directed AI behavior Contribute to internal standards for scalable AI deployments, model governance More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Staffworx
operate autonomously within production environments, integrating LLMs, multi-agent workflows, cloud-native infrastructure and real-time API interfaces. Experience translating ML models into resilient cloud applications, optimizing for performance, observability and secure operations at scale. Core Responsibilities Architect distributed agentic systems using LLMs and tool-using AI components across enterprise cloud environments Design and implement modular, event-driven architectures (e.g. … SageMaker , Bedrock or OpenAI APIs Build support systems for autonomous agents including memory storage, vector search (e.g., Pinecone, Weaviate) and tool registries Enforce system-level requirements for security, compliance, observability and CI/CD Drive PoCs and reference architectures for multi-agent coordination , intelligent routing and goal-directed AI behavior Contribute to internal standards for scalable AI deployments, model governance More ❯
automation, and operationof theenterprise data platform, ensuring its capabilities align with business needs. The platform is built onAzure, Snowflake, Data Lakes, and Kafka, requiring expertise acrosssecurity, data governance, integrations, observability, DevOps, and automation. This is a London-based role, as regular on-site client engagement is required. The position will behired by Marionete, a leader in delivering cutting-edge data … ingestion, storage, processing, and consumption). Drive the implementation ofmodern platform architecturessuch asLakehouse, Kappa, and Lambda, ensuring alignment with industry best practices. Overseeend-to-end platform capabilities, includingsecurity, monitoring, observability, automation, and governance. ImplementDevOps and automation practices, ensuring a highly available, resilient, and self-service platform for data engineers and consumers. Ensure seamlessdata ingestion and processing pipelines, leveragingKafka, Fivetran, Snowflake More ❯
with React & Material UI, Postgres, Hasura and AWS Serverless Technologies such as Lambda, DynamoDB and EventBridge - all managed via AWS CDK & SST. We use Sentry, Lumigo and LogRocket for observability and Github Actions for automated testing and deployment. End-to-end Ownership. You will be entrusted with end-to-end ownership of your projects. From product, design and architectural decisions … Typescript and AWS). You focus on having a high impact. You've spearheaded the engineering of critical systems before, working with best-in-class tooling in AWS, IaaC, observability and quality assessments. You want to discover the best ways to bring this to an early-stage startup. You know what good can look like. You understand what it takes … to build highly reliable & well architected products. You build with quality, observability & redundancy at the forefront. You're ready to get a lot done. You enjoy all aspects of building a product and are comfortable moving across the stack when necessary. You enjoy problem solving and thinking from first principles. You're ready to pick up new skills and build More ❯
event streaming. Our infrastructure is cloud-native, leveraging AWS services and following infrastructure-as-code principles. We practice DevOps and GitOps methodologies with a strong focus on automation and observability . Working at scale presents exciting challenges, and we're looking for someone passionate about solving complex problems and building robust, scalable solutions. What makes a great Platform Engineer at … ArgoCD) and Infrastructure as Code (Terraform/Terragrunt) Kubernetes expertise in container orchestration and cluster management Network engineering skills including load balancers, CDN, Istio, and security patterns Experience with observability platforms (OpenTelemetry) and distributed systems Nice-to-have skills: Python programming and Linux system debugging Database administration (SQL, MongoDB, Redis) Message broker and event streaming experience (Kafka) Database performance optimisation More ❯