City of London, London, United Kingdom Hybrid / WFH Options
Huxley Associates
availability, secure deployments, and efficient agent orchestration using AKS. You will create and maintain CI/CD pipelines for Azure services, Semantic Kernel agents, manage Kubernetes clusters, and integrate observability tools to monitor system health and performance. You'll also ensure alignment with enterprise-grade security practices, including zero trust principles, identity-aware routing, and integration with Azure API Management More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Salt Search
production. Deploy, maintain, and optimise machine learning services within a cloud environment (AWS). Recommend and implement prompt management tools and provide expertise in prompt engineering. Introduce and manage observability, monitoring, and evaluation frameworks for ML and AI services. Enable auto-evaluation of prompts and models against domain-specific requirements. Build Python-based microservices, data pipelines, and serverless functions. Collaborate More ❯
Leeds, West Yorkshire, United Kingdom Hybrid / WFH Options
Tria
within enterprise systems. Strong understanding of cloud platforms (Azure preferred). Knowledge of Infrastructure-as-Code (IaC), APIs, and automation tools. Familiarity with CI/CD pipelines, monitoring, and observability tools. Knowledge of ITSM, Agile, DevOps, and service-level objectives (SLOs) and indicators (SLIs). Excellent problem-solving skills and ability to work in complex, multi-supplier environments. Desirable: Bachelor More ❯
a focus on security, data protection, and performance optimization. Experience managing transport and change governance, incident triage, and root cause analysis. Skilled in monitoring tools like SAP Cloud ALM, observability platforms, and incident management platforms such as Jira or Azure DevOps. Adept at documentation using Confluence and following agile methodologies like Scrum and Kanban. Exceptional stakeholder management and communication skills More ❯
designed infrastructure that scales without slowing anyone down. Tame complex LLM infrastructure (real-time usage, flaky providers, token routing - the lot). Raise the quality bar across the board: observability, auth, reliability, and more. This isn't a role for passengers. It's for engineers who love ambiguity, thrive under pressure, and see infrastructure as a multiplier. What We're More ❯
team leadership. Expert-level knowledge of AWS and deep hands-on experience with AWS CDK in production environments. Strong background in DevSecOps, infrastructure-as-code, CI/CD, and observability practices. Proven ability to scale cloud platforms in a high-growth, high-regulatory tech environment. Experience building and leading high-performing technical teams across multiple cloud disciplines. Strong understanding of More ❯
Wetherby, West Yorkshire, Yorkshire, United Kingdom
Equals One Ltd
Architecture: Evolve a modular, scalable platform (ECS on AWS), with clear boundaries between ingestion, retrieval, reasoning and delivery. Quality & reliability: Testing (unit/integration/evals), CI/CD, observability (tracing/metrics for LLM and retrieval paths), and performance tuning. Collaboration: Work closely with Product and ELT; mentor engineers; contribute to technical strategy and research. Innovation: Research and recommend More ❯
LS22, Wetherby, City and Borough of Leeds, West Yorkshire, United Kingdom
Handshaik
Architecture: Evolve a modular, scalable platform (ECS on AWS), with clear boundaries between ingestion, retrieval, reasoning and delivery. Quality & reliability: Testing (unit/integration/evals), CI/CD, observability (tracing/metrics for LLM and retrieval paths), and performance tuning. Collaboration: Work closely with Product and ELT; mentor engineers; contribute to technical strategy and research. Innovation: Research and recommend More ❯
will: Design and evolve the architecture of highly scalable, reliable, and secure distributed systems. Drive technical excellence across the engineering organization by setting standards for code quality, system design, observability, and operational best practices. Collaborate closely with Product, UX, and Application Engineering teams to deliver impactful features while ensuring architectural soundness and scalability. Mentor and guide senior and mid-level More ❯
concerns and driving service excellence. Communicate effectively with internal and external stakeholders, providing insights and updates on service health and operational performance. Continuous Improvement Lead initiatives to increase automation, observability, and operational resilience. Stay abreast of industry trends, emerging technologies, and best practices, fostering a culture of continuous learning within the team. Requirements Proven experience in IT Service Operations, ideally More ❯
concerns and driving service excellence. Communicate effectively with internal and external stakeholders, providing insights and updates on service health and operational performance. Continuous Improvement Lead initiatives to increase automation, observability, and operational resilience. Stay abreast of industry trends, emerging technologies, and best practices, fostering a culture of continuous learning within the team. Requirements Proven experience in IT Service Operations, ideally More ❯
you thrive in a fast-paced environment where you can make a real difference, we want to hear from you! Required skills/expertise: Develop and implement a comprehensive observability strategy for self-hosted deployments, including infrastructure and tooling for monitoring, alerting, and troubleshooting. This will involve designing and implementing robust metrics and logging systems. Engineer the ACRA platform for More ❯
with stakeholders across the company to shape roadmaps, scope ambitious projects, and balance technical innovation with delivery. Champion engineering excellence - from coding standards and CI/CD pipelines to observability and incident management - creating a culture of technical rigor and continuous improvement. Be a visible technical leader within 9fin, pushing forward the adoption of AI/ML across teams and More ❯
Observability Lead - Corporate Bank Birmingham | Hybrid | Up to £86,500 + benefits We're working with a global bank that's looking for an Observability Lead to make its systems more reliable, resilient, and easier to support. This role is a mix of hands-on problem solving and leading change, with plenty of scope to drive transformation across monitoring, automation … and production support practices. What you'll be doing Leading improvements to critical banking applications, making them more reliable and resilient. Driving the bank's observability and monitoring transformation, introducing smarter tools, automation, and modern practices. Bringing an SRE mindset to the team and embedding change across processes and culture. Coaching and mentoring the team, helping them adopt new ways … Strong technical background with Java, UNIX, Linux, and Middleware platforms (eg, WebLogic). Cloud experience - ideally Google Cloud, but AWS or Azure are also relevant. Hands-on experience with observability tools like New Relic, Splunk, Grafana, or Dynatrace. Proven experience in driving transformation, modernising production support, and improving system resilience. Leadership experience - able to guide teams, influence culture, and embed More ❯
innovation cycles. You will have the opportunity to take ambiguity and refine it into valuable outcomes, taking risks where justified by the reward.You will understand how CI/CD, observability, and SLOs form part of a mature product offering and push for best practices. Use your insight to prevent production issues before they happen. When issues do occur you will More ❯
technologies: Logical reasoning, scripting ability, security concepts (light) Infrastructure as Code (Terraform) AWS infrastructure (VPC, EC2, IAM) Linux tooling and system admin CI/CD pipelines from infra perspective Observability, logging, monitoring GitOps, container orchestration (K8s) Benefits As well as a competitive pension scheme, BAE Systems also offers employee share plans, an extensive range of flexible discounted health, wellbeing & lifestyle More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Method Resourcing
teams to operationalize models and ship ML-powered features into production. Continuously assess and iterate on production models, balancing long-term ML strategy with tactical improvements. Champion code quality, observability, and resilience within their ML systems through reviews and hands-on contributions. Help shape their internal ML standards and practices, ensuring they stay ahead of industry advancements. Offer technical mentorship More ❯
colleagues and clients across the Snowflake ecosystemExperience in design and delivering business solutions on other modern data platforms (e.g. Databricks, Azure, AWS or GCP native stacks)Experience with platform observability and CI/CD for data platformsHands-on experience with modern data engineering tools such as dbt, Fivetran, Matillion or AirflowHistory of supporting pre-sales activities in a product or More ❯
optimization, anomaly detection, and predictive analytics. Understanding of AI frameworks and libraries (e.g., TensorFlow, PyTorch, Scikit-learn) and their application in network automation and monitoring. Experience with telemetry and observability frameworks (e.g., Prometheus, Grafana) for real-time network monitoring and troubleshooting. Experience : Minimum of 7 years' of experience in network engineering, operations, and support. Proven ability to work hands-on More ❯
companies counter the new style of attacks on the ever-changing landscape of cybersecurity. Wallarm enables developers, Security Ops and DevSecOps teams the ability to secure their APIs via observability, and ensure Protection and Analytics to manage risk, protect the business, and enable speed of development with safety. As a Solution Architect aligned with Customer Engineering, you will be an More ❯
to support diverse analytics and operational workloads Ensure data quality and consistency by incorporating automated testing, data validation, and monitoring mechanisms Drive best practices around data engineering, including testing, observability, security, and documentation Troubleshoot and resolve issues in production environments to ensure data integrity and platform reliability What you bring: 5+ years of professional experience in data engineering roles, preferably More ❯
others in the team. You have a bias to simplicity, where you care most about achieving impact Bonus Experience with evaluation harnesses and frameworks for Generative AI Experience with observability, monitoring, and safety techniques for deployed GenAI systems Experience in strongly typed languages such as Go The Company Our mission is to be the definitive food company. We are transforming More ❯
AI SRE assistant. Kubernetes promises agility, elasticity, reliability and high availability, but it also introduces complexity, high operational overhead, and cost overruns due to over provisioning of workloads. Traditional observability only surfaces the "what" - Komodor goes further by delivering the "why", "where" and the "how"; providing a full platform to detect, investigate and remediate while optimizing workloads. By combining our More ❯
functional teams delivering and maintaining large-scale digital platforms, ensuring high availability, scalability, and resilience. The role requires a blend of technical depth and leadership capability particularly in automation, observability, and mentoring team members. Key Skills & Experience: DevOps/SRE experience (5+ years) – ownership of projects, strong automation and Infrastructure-as-Code approach, incident management, and leadership of initiatives. Terraform … state management, and AWS integration. Kafka – experience with production clusters, scaling, tuning, troubleshooting, and event-driven systems. MongoDB – strong admin experience including replication, sharding, tuning, and backups. Monitoring/Observability – Prometheus, Grafana, ELK, Datadog, with strong alerting/SLO design. AWS – expertise across EC2, VPC, S3, RDS, IAM, ALB/NLB, and cost optimisation. Linux – advanced administration, performance debugging, and More ❯
provide solutions. Mentor junior team members, providing guidance on standard methodologies for DevOps. Infrastructure Automation & Management: Use Terraform/OpenTofu and automation frameworks to provision and manage infrastructure. Monitoring & Observability: Configure and utilise observability tools like Datadog for performance monitoring, alerting, and visualisation, ensuring system reliability and quick identification of issues. Performance Optimisation: Continuously monitor the performance of the tools More ❯