DevOps: you build it, you run it. Tech Stack M&S uses a variety of technologies including; Java, Spring, SpringBOOT, Micronaut React, Next.js, Typescript, Angular Azure Cloud, Kubernetes, Dynatrace (observability) SQL Server, MongoDB Ignite, Redis Everyone's Welcome We are ambitious about the future of retail. We're disrupting, innovating and leading the industry into a more conscientious, inspiring digital More ❯
DevOps: you build it, you run it. Tech Stack M&S uses a variety of technologies including; Java, Spring, SpringBOOT, Micronaut React, Next.js, Typescript, Angular Azure Cloud, Kubernetes, Dynatrace (observability) SQL Server, MongoDB Ignite, Redis Everyones Welcome We are ambitious about the future of retail. Were disrupting, innovating and leading the industry into a more conscientious, inspiring digital era. Were More ❯
Kubernetes) at scale Experience working with a cloud provider (AWS/Azure/GCE), or sysadmin/SRE experience in data centers Experience designing, building, and operating high-scale observability or infrastructure systems Working knowledge of networking fundamentals, experience with CNIs or cloud networking infrastructure preferred What We Require 4+ years of professional software development experience on core infrastructure with More ❯
in Python Proficiency in designing and executing complex prompt strategies and intput/output data validation models to achieve desired outputs from LLMs Experience monitoring AI applications using popular observability tools (e.g. Langfuse, Langsmith) to ensure seamless performance and monitoring Strong skills in data transformations for both structured and unstructured data; ability to integrate these processes into scalable pipelines Experience More ❯
in Python Proficiency in designing and executing complex prompt strategies and intput/output data validation models to achieve desired outputs from LLMs Experience monitoring AI applications using popular observability tools (e.g. Langfuse, Langsmith) to ensure seamless performance and monitoring Strong skills in data transformations for both structured and unstructured data; ability to integrate these processes into scalable pipelines Experience More ❯
AI team. I’m Responsible For... Delivering robust, fully tested, maintainable software that impacts end users Designing and implementing production-ready scalable NLP applications and APIs Developing monitoring and observability solutions and integration testing frameworks Conducting code reviews and providing constructive feedback to team members Ensuring the scalability, performance, and reliability of AI applications Staying up-to-date with the More ❯
distributed service architectures, including how best to test and release them, and how to ensure system stability when making changes independent of other services. You are able to use Observability tooling to understand, diagnose, improve, debug, measure and visualise platform health. You are up-to-date with the latest technologies including AI for example Machine Learning for personalisation or automation More ❯
enablement teams, to promote these through regular knowledge sharing sessions. Accountable for operational efficiency - drive improvements in efficien cy , reliab ility , and scala bility supported by logging , monitoring and observability as a foundational capability. Responsible for adoption - promote the platform capabilities through technical communities of practice leadership, high internal standards for documented processes and internal guides, an d take steps More ❯
to join our Client Impact Team. The Client Impact Team was established to provide fast turnaround for client requests, small features, and defect resolution. The team also owns the observability and health of our operational platform. The team has made enormous improvements in these areas by building tooling. The vision for the coming year is to build on this foundation More ❯
This job is brought to you by Jobs/Redefined, the UK's leading over-50s age inclusive jobs board. Orgvue is an organisational design and planning platform that empowers your business to transform its workforce by understanding the work More ❯
a long-term contract role based in Canary Wharf , offering a hybrid work model (3 days onsite). What You'll Be Doing: Lead the design and implementation of observability frameworks using Splunk for end-to-end monitoring, logging, and tracing. Drive automation of infrastructure provisioning and configuration using DevOps best practices . Provide technical authority and mentorship to engineering … Wallstreet FX environments. Lead incident response efforts and conduct post-mortem analysis to improve system resilience. What We're Looking For: Strong hands-on experience with Splunk architecture and observability tooling Expertise in containerization (Docker/Kubernetes) and cloud-based infrastructure Proficient in ETL/data engineering workflows Background in Energy Trading or Financial Systems is a big plus Excellent More ❯
systems that power real-time and batch predictions at scale • Design production pipelines for training, deployment, and monitoring using modern MLOps tooling • Take ownership of technical quality, resilience, and observability of critical ML services • Build reusable tools and frameworks to enable fast, safe experimentation and deployment Platform, Standards & MLOps Foundations • Define and build the core MLOps capabilities for the organisation … including training pipelines, deployment frameworks, and observability tooling • Establish standardised patterns and best practices to accelerate model development, testing, and deployment • Lead the evolution of our ML platform, working with engineering partners to improve scalability, governance, and developer experience • Contribute to responsible ML practices—supporting auditability, explainability, and model health monitoring Technical Leadership & Collaboration • Partner with data scientists to take … GitHub Actions, ArgoCD) • Proven ability to build reusable tooling, scalable services, and resilient pipelines for real-time and batch inference • Strong understanding of ML system lifecycle: testing, monitoring, governance, observability • Excellent collaboration and communication skills; able to influence cross-functional teams and lead complex technical work • A background in software engineering, computer science, or a quantitative field—or equivalent experience More ❯
systems that power real-time and batch predictions at scale Design production pipelines for training, deployment, and monitoring using modern MLOps tooling Take ownership of technical quality, resilience, and observability of critical ML services Build reusable tools and frameworks to enable fast, safe experimentation and deployment Platform, Standards & MLOps Foundations Define and build the core MLOps capabilities for the organisation … including training pipelines, deployment frameworks, and observability tooling Establish standardised patterns and best practices to accelerate model development, testing, and deployment Lead the evolution of our ML platform, working with engineering partners to improve scalability, governance, and developer experience Contribute to responsible ML practices—supporting auditability, explainability, and model health monitoring Technical Leadership & Collaboration Partner with data scientists to take … GitHub Actions, ArgoCD) Proven ability to build reusable tooling, scalable services, and resilient pipelines for real-time and batch inference Strong understanding of ML system lifecycle: testing, monitoring, governance, observability Excellent collaboration and communication skills; able to influence cross-functional teams and lead complex technical work A background in software engineering, computer science, or a quantitative field—or equivalent experience More ❯
at scale, leveraging AWS Organizations, Landing Zones, and multi-account best practices. Develop and maintain Infrastructure as Code solutions using Terraform, CloudFormation, and AWS CDK. Champion security, compliance, and observability by integrating services like AWS Security Hub, GuardDuty, and Inspector. Design CI/CD pipelines to enable seamless deployments and self-service models for customers. Innovate with AWS Networking, KMS … architectures and multi-account AWS setups. Extensive experience with AWS Organisations Expert-level knowledge of AWS Networking, TLS, and security best practices. Experience with container orchestration (Kubernetes, EKS) and observability tools (Grafana, ELK). A passion for innovation, problem-solving, and delivering high-impact solutions. Working with Control Tower and Landing Zones Why Work For Us? Competitive base salary up More ❯
Step into a company that is reimagining how observability works by building a platform that delivers immediate insights without the overhead of traditional indexing. By eliminating complexity and significantly reducing operational costs by up to 70 percent, this solution offers a unified view across logs, metrics, traces, and security events all in real time. Now they are looking for a … be the key link. They are ideally looking for someone with: Strong experience supporting technical products in a customer facing capacity Deep understanding of cloud native technologies and modern observability stacks such as Grafana, DataDog, Splunk or similar A hands on mindset and the ability to work comfortably across Kubernetes, microservices, and comparable environments Beyond technical skills, they value clear More ❯
City of London, London, England, United Kingdom Hybrid / WFH Options
QA
is seeking a dedicated DevOps Engineer Apprentice to bolster their NHS project team. In this role, the chosen candidate will be instrumental in enhancing the incident management protocols, advancing observability and monitoring strategies, and refining CI/CD practices within the AWS ecosystem.Responsibilities:Collaborating with cross-functional teams to ensure smooth and reliable incident management using Jira and Service Now.Developing … and implement observability and monitoring solutions to ensure high system availability and performance.Contributing to maintaining and improving CI/CD pipelines, ensuring efficient code integration and deployment on AWS.Supporting the design and execution of automated test strategies to enhance the quality and security of cloud-based applications.The successful candidate must have:Experience with AWS cloud services and management tools.Familiarity with More ❯
agentic AI systems. Here are some examples of areas you will be working in: Develop our bespoke agentic assistant execution platform (TypeScript) Build and integrate systems for retrieval, search, observability, multi-modal generation, and more Maintain model evaluation suites to optimise for quality, cost and reliability Deploy and operate stateful workloads on our cloud infrastructure (AWS) Contribute to our real … and operationalising language models in production, especially in multi-agentic systems Experience in deploying and scaling stateful compute workloads using Docker, Kubernetes, Pulumi or other IaC approaches Experience in observability tooling and devops practices Experience with collaborative real-time technologies (CRDTs, Y.js, sync engines, etc.) Interest in writing (fiction, non-fiction, essays) However, you are NOT expected to: Be an More ❯
Experience using AWS (Serverless) and/or GCP Understand the importance of driving quality into code through test automation Have supported applications in production, with demonstrable experience of good observability practices within a full stack environment. (e.g. Rum, Tracing) Have worked in a collaborative environment with strong engineering practices and know what good engineering looks like Care about the product … mindset, you will have intimate knowledge of our products from code commit through to production operation Supporting production systems with monitoring tools such as Datadog Strive for stable systems observability You will champion our principles, fuel a growth mindset by getting involved in communities and help improve our engineering culture Pushing the boundaries, questioning the status quo, ensuring what we More ❯
services at scale in the cloud Collaborating with engineers, designers, product managers within and outside of the team Implementing maintainable and well-tested solutions iteratively, keeping business impact and observability as a primary focus Understanding the wider context of the business and designing system architectures that meet short and long term business goals Your skillset: Algorithms, data structures Strong technical … knowledge in observability Web services, REST, Containers, cloud Testing, reliability, monitoring Strong knowledge in building and owning application end-to-end, from inception to maintenance, to retirement Professional communication skills in architecture design Expertise in one of the programming languages and paradigms - our systems are written in TypeScript, Java, Golang, Rust, Python and others Desirable Experience with Kubernetes, Prometheus, Terraform More ❯
Services : Enables thousands of servers and BPIPE endpoints to "call home" and receive correct settings. Peer Discovery Infrastructure : Groups servers into discoverable clusters and provides tools to manage them. Observability and Monitoring Frameworks : Ensures we have high visibility across a vast estate of global infrastructure. Data Quality Tooling : UI and backend systems for diagnosing distribution issues across the real-time … SRE pillars: Latency Monitoring & Management - Define SLIs/SLOs, track latency, and build tools to diagnose issues. Capacity Management - Maintain disaster readiness and scalability through monitoring and forecasting. System Observability - Proactively detect issues, build alerting systems, and centralize health dashboards. Production Risk Management - Ensure safe software releases, drive infrastructure improvements. Incident Response - Lead or support fast, effective remediation during live More ❯
major incidents and the overall health of our services, making sure they are both resilient and high-performing. You'll create strategies for availability and reliability, enhance domain ecosystem observability, and support a shift toward a more engineering-focused culture. Your contributions will ensure that eBay's technology remains cutting-edge and reliable for our global community. What you will … JVM configurations, and a deep understanding of UNIX, Linux, networking (TCP/IP), and databases (both relational and NoSQL). Experience in android and iOS application debugging. Experience with observability tools such as Grafana and Prometheus, and skills in documenting procedures for knowledge management. Strong interpersonal and communication skills to thrive in fast-paced, dynamic environments. NOTE: As part of More ❯
/CD pipelines for modern web applications Familiarity with infrastructure-as-code tools such as Terraform Understanding of security best practices in web infrastructure and application delivery Exposure to observability tooling and techniques (e.g., Prometheus, Grafana, structured logging) Confident in debugging and resolving issues in complex distributed or web-based Systems A product mindset and collaborative approach to improving how More ❯
Infrastructure as Code (IaC) principles. This senior position offers the opportunity to influence a broad range of infrastructure domains, including: Networking & Exchange Connectivity Linux Systems & Kubernetes Administration Microservice Orchestration & Observability Disaster Recovery & Security Hardening SKILLS & EXPERIENCE REQUIRED: Containers & Orchestration: Deep knowledge of Kubernetes and container security. Experience in managing global or multi-cluster deployments is highly valued. Distributed Systems & Messaging More ❯
optimization, anomaly detection, and predictive analytics. Understanding of AI frameworks and libraries (e.g., TensorFlow, PyTorch, Scikit-learn) and their application in network automation and monitoring. Experience with telemetry and observability frameworks (e.g., Prometheus, Grafana) for real-time network monitoring and troubleshooting. Experience : Minimum of 7 years' of experience in network engineering, operations, and support. Proven ability to work hands-on More ❯
to ensure high standards. Our platforms are highly stable, with a focus on building new features and exploring innovative technologies. We have significantly reduced critical incidents through investments in Observability, with teams responsible for their applications in production (Run What You Wrote). Supporting a global customer base, we tailor online experiences for various regions, overcoming technical restrictions and optimizing More ❯