incident tooling (e.g., PagerDuty, Datadog). Technical Expertise required for this engagement: Guide operational practices across services built using Java (Spring Boot) , Kafka , MongoDB and related technologies. Oversee monitoring, observability, and performance tuning using Datadog , ELK , Prometheus , or similar tooling. Problem Management & Root Cause Elimination required: Lead proactive and reactive problem management efforts. Identify recurring production issues and collaborate with … rapid change practices including canary releases, feature flags, and progressive delivery. Continuous Improvement & DevOps Practices: Drive automation and self-service initiatives to reduce manual intervention and operational burden. Champion observability best practices (metrics, traces, logs) and error budget tracking. Promote DevOps culture and continuous feedback loops between engineering and operations. Governance, Risk & Compliance: Ensure operational processes comply with security, privacy More ❯
Research Lab. The role Ensuring resilience, uptime and operational efficiency is mission-critical to its success. As a Production Software Engineer, you will play a key role in driving observability, reliability, change safety and runtime optimisation across a complex, federated engineering environment. You will design and implement the systems, tooling and workflows that ensure the distributed platform is robust, observable … with infrastructure teams to own and evolve domain specific metrics, alerting and diagnostics infrastructure used to operate and monitor the platform Building and maintaining core systems for deployment automation, observability, runtime environment management and release readiness Promoting runtime engineering best practices, working with federated teams to align on standards, service ownership and fault tolerance Participating in a shared production support … are we looking for? Strong background in software engineering, ideally in distributed, real-time systems Experience with containerisation and orchestration technologies, such as Kubernetes, in production environments Familiarity with observability tooling and practices, such as Victoria Metrics, Prometheus, Grafana, OpenTelemetry and SLOs Well-developed debugging skills with the ability to navigate unfamiliar systems, identify root causes and deliver effective solutions More ❯
for those with strong Engineering and DevOps capabilities and a deep interest in operationalising AI solutions. We are looking for someone with complementary skills that extend into infrastructure and observability, preferably with experience in E-Commerce. The AI team owns all ML-related research, implementation and maintenance. In practice, this means keeping up to date with best practices in production … ownership of the deployment and monitoring pipelines within your expertise Contribute to the ongoing innovation R&D projects by enabling production readiness Maintain and implement CI/CD pipelines, observability, and infrastructure for ML services Requirements Degree in relevant field with 3+ years of industry experience Strong Technical Skills: Python, AWS, Docker, Terraform Experience deploying and maintaining machine learning models More ❯
at scale, leveraging AWS Organizations, Landing Zones, and multi-account best practices. Develop and maintain Infrastructure as Code solutions using Terraform, CloudFormation, and AWS CDK. Champion security, compliance, and observability by integrating services like AWS Security Hub, GuardDuty, and Inspector. Design CI/CD pipelines to enable seamless deployments and self-service models for customers. Innovate with AWS Networking, KMS … Proficiency in Python, Go, or similar languages for automation and scripting. Expert-level knowledge of AWS Networking, TLS, and security best practices. Experience with container orchestration (Kubernetes, EKS) and observability tools (Grafana, ELK). A passion for innovation, problem-solving, and delivering high-impact solutions. Experience leading/managing junior engineers Significant experience with Control Tower and deploying landing zones. More ❯
implementation, testing, and rollout - ensuring operational stability of delivered solutions. Provide hands-on technical guidance, support team planning, and facilitate delivery ceremonies. Champion engineering best practices across development, testing, observability, and operational support. Raise the team's maturity and drive progress towards or maintain Elite DORA standards. Build, mentor, and manage a high-performing software engineering team. Foster a culture … technical proficiency in: Languages: Java 17+ (Java 21 preferred) Frameworks: Micronaut (preferred), Spring Boot Testing: JUnit, Mockito Build Tools: Gradle Data & Messaging: Kafka, MongoDB APIs: GraphQL Federation, REST Infrastructure & Observability: Terraform, OpenTelemetry, Dynatrace Soft Skills & Leadership Exceptional communication skills - able to distill and present engineering decisions to executives and business teams. Experienced in managing relationships with third-party vendors and More ❯
Collaborate with People/HR and engineering leadership on career pathing, training, and coaching for engineering staff. Technology Enablement: Evaluate and deploy tools - especially AI - that support engineering productivity, observability, and collaboration. Work closely with DevOps, QA, and SRE teams to align infrastructure and operational excellence with engineering needs. Own key vendor relationships, evaluation of partnerships and represent technology on … scaling engineering orgs across multiple geographies or domains (e.g., front-end, back-end, infrastructure). Familiarity with tools like Linear, Asana, GitHub, Datadog, DORA metrics, or similar performance/observability platforms. Background in organisational change management or engineering program management. What you can expect from us Competitive salary with substantial incentive schemes Generous long-term incentive plan (LTIP) tez token More ❯
platform that epitomises the best of modern cloud practices - leveraging immutable infrastructure, zero trust and effective pipelines to allow our teams to quickly operationalise workloads that have compliance, security, observability, logging, and alerting baked in, and we're growing a team passionate to deliver it. What Experience You'll Bring to the Team Building out an effective, modular, platform leveraging … existing infrastructure to the new platform. Helping us track, understand, and optimize cloud spend Enabling product teams to own and run their own infrastructure, backed by solid provision around observability, alerting, and operability Building out an effective HA/DR approach so that teams are serviced well around these needs Supporting 'secure by default' approaches on the platform, so that More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Starling Bank
problems and challenges, who can work across teams do great things here at Starling, to continue changing banking for good. Responsibilities: As a Data Scientist in the Machine Learning Observability & Governance team, you will play a crucial role in enabling Starling Bank to maximally exploit AI in line with its risk appetite, while ensuring ethical and responsible AI practices. Your … responsible. Stakeholder Communication & Visibility: Ensure clear communication and good visibility with stakeholders such as risk teams, regarding how data scientists at Starling observe and manage ML and AI models. Observability Centre of Excellence: Support colleagues in enhancing their observability work by maintaining existing observability tooling, assisting in identifying key metrics to monitor, and providing expert advice on internally-developed model More ❯
platform engineering team as we scale LangSmith and LangGraph Platform products. You'll work in Europe (remotely) and architect and operate the critical systems that power our customers' AI observability and LangGraph app deployments, working directly with cutting-edge technologies at the intersection of AI and distributed systems. Scale critical systems : Design and implement high throughput data-intensive systems supporting … building and operating production systems at scale Infrastructure expertise : Deep knowledge of Kubernetes, containerized infrastructure, cloud platforms (e.g. GCP) Database expertise : Production experience with OSS datastores (PostgreSQL, Redis, Kafka) Observability mastery : Hands-on experience with observability stacks (Datadog, Prometheus/Grafana, OpenTelemetry or similar) Programming proficiency : Strong hands-on software engineering skills (Python, Go, Rust) Operational mindset : "You build it More ❯
Technical Account Manager (Observability) Join a high-growth SaaS company at the forefront of modern observability, helping clients manage data challenges with scalable solutions. As a Technical Account Manager, you will be a trusted advisor to tech teams, assisting with onboarding, data source integration, troubleshooting, and executive reviews to ensure customer success. This role involves working with cloud infrastructure and … observability stacks, influencing customer outcomes and company growth from day one. Ideal candidates are technically sharp, curious, and customer-focused, with experience in cloud, DevOps, or monitoring tools, and excellent communication skills. A collaborative environment that values humility, learning, and innovation. We offer competitive packages, including stock options, with a clear path to IPO. To learn more, apply below or More ❯
City of London, London, United Kingdom Hybrid / WFH Options
ECS
behalf of a world's leading technology organisation who are looking for a SRE who has hands-on expertise with Dynatrace to play a pivotal role in growing their observability program. As an SRE, you will be responsible for: Act as a technical advisor to help teams maximize their use of Dynatrace. Collaborate with engineering groups to design dashboards, alerts … the transition from tools such as Prometheus, Grafana, and CloudWatch to Dynatrace. Recommend RBAC structures and data access models that align with organizational and security requirements. Assist in shaping observability strategies for Kubernetes workloads in hybrid (cloud/on-prem) environments. Promote observability-as-code approaches using Terraform, GitLab, or other infrastructure-as-code solutions. Develop reusable implementation patterns, clear More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Stealth iT Consulting
defining and tracking OKRs/KPIs to measure engineering performance and drive continuous improvement. In-depth understanding of hybrid and multi-cloud environments, CI/CD pipelines, DevSecOps, SRE, observability, and ITIL practices. Experience working with developers to implement and evolve monitoring and observability strategies is a plus. Why Join Us? Client Variety: Work with high-profile clients across industries More ❯
defining and tracking OKRs/KPIs to measure engineering performance and drive continuous improvement. In-depth understanding of hybrid and multi-cloud environments, CI/CD pipelines, DevSecOps, SRE, observability, and ITIL practices. Experience working with developers to implement and evolve monitoring and observability strategies is a plus. Why Join Us? Client Variety: Work with high-profile clients across industries More ❯
SR2 | Socially Responsible Recruitment | Certified B Corporation™
roadmaps for platform , infrastructure , and AI/ML tooling . Act as primary product partner to engineering, SRE, and data science teams. Lead initiatives that boost developer experience , system observability , and engineering efficiency . Foster a culture of enablement , documentation , and adoption for internal tooling and platforms. Required Skills and Experience Strong technical foundation (e.g. CS degree , former engineering experience … success in building and scaling core platforms , cloud infrastructure , developer tooling , and API services . Deep, practical knowledge of AWS , distributed systems , CI/CD , IaC , DevOps principles , and observability tooling . Direct collaboration with engineering/SRE teams on internal tooling and shared services. Track record of delivering internal products that boost developer workflows , reliability , or deployment velocity . … with AI/ML platforms , MLOps , and deploying machine learning models into production environments. Core Focus Areas Core Platform Engineering Developer Experience (DevEx/DX) Engineering Productivity DevOps Tooling Observability Solutions Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Cloud-Native Architecture If you're passionate about building foundational platforms and tools that empower world-class engineering teams More ❯
SR2 | Socially Responsible Recruitment | Certified B Corporation™
roadmaps for platform , infrastructure , and AI/ML tooling . Act as primary product partner to engineering, SRE, and data science teams. Lead initiatives that boost developer experience , system observability , and engineering efficiency . Foster a culture of enablement , documentation , and adoption for internal tooling and platforms. Required Skills and Experience Strong technical foundation (e.g. CS degree , former engineering experience … success in building and scaling core platforms , cloud infrastructure , developer tooling , and API services . Deep, practical knowledge of AWS , distributed systems , CI/CD , IaC , DevOps principles , and observability tooling . Direct collaboration with engineering/SRE teams on internal tooling and shared services. Track record of delivering internal products that boost developer workflows , reliability , or deployment velocity . … with AI/ML platforms , MLOps , and deploying machine learning models into production environments. Core Focus Areas Core Platform Engineering Developer Experience (DevEx/DX) Engineering Productivity DevOps Tooling Observability Solutions Platform as a Service (PaaS) Infrastructure as a Service (IaaS) Cloud-Native Architecture If you're passionate about building foundational platforms and tools that empower world-class engineering teams More ❯
a sharp troubleshooter who can solve problems independently, especially within complex, distributed systems. You'll also need a solid grasp of modern containerized environments and how to effectively use observability tools to keep things running smoothly. You'll have the chance to work both independently and collaboratively with a global team, always striving to ensure our applications are stable, performant … Kubernetes. Investigate logs and system behavior, pulling data from pods and containers using tools like kubectl, event viewer, and central logging platforms Monitor application health, performance, and availability levaraging observability platforms like Grafana and Kibana. Test and interact with API endpoints, documenting and validating their functionality using tools like Swagger/OpenAPI and Postman. Respond to support tickets promptly within … and log retrieval. APIs : Skilled in interacting with and testing RESTful APIs using Swagger/OpenAPI and Postman. Elastic Stack : Familiarity with Elasticsearch for log aggregation, indexing, and querying. Observability : Good understanding of monitoring and visualization concepts; experience with Grafana Scripting & Automation : Automation-first mindset, can streamline tasks. Powershell experience is a plus. Diagnostics : Skilled with Event Viewer, IIS Manager More ❯
delivering fast access from any geography Playlist Services: Dynamic path configuration systems optimizing user connectivity in real-time PGM Relays: Infrastructure for reliable multicast data delivery We use automation, observability, and software engineering to detect issues before they impact customers and reduce manual toil wherever we can. What You'll Do Build production-grade software that powers Bloomberg's global … infrastructure Design and implement scalable, fault-tolerant systems with a focus on observability, performance, and automation Collaborate across engineering teams to introduce automated, self-service operational workflows Conduct deep systems analysis and root cause investigations for complex, distributed systems Propose and prototype innovative approaches to reliability and risk mitigation Contribute to design docs, runbooks, and post-incident reviews-clear communication …/or Kubernetes or other Pipeline Management Platforms is a significant advantage. Machine Management at Scale: Experience with capacity planning and automating the lifecycle of large machine fleets. System Observability and Monitoring: Deep understanding of SLIs/SLOs/SLAs, alerting, and building dashboards for complex systems. Reliability in Distributed Systems: Knowledge of fault tolerance and the unique challenges of More ❯
at scale, leveraging AWS Organizations, Landing Zones, and multi-account best practices. Develop and maintain Infrastructure as Code solutions using Terraform, CloudFormation, and AWS CDK. Champion security, compliance, and observability by integrating services like AWS Security Hub, GuardDuty, and Inspector. Design CI/CD pipelines to enable seamless deployments and self-service models for customers. Innovate with AWS Networking, KMS … architectures and multi-account AWS setups. Extensive experience with AWS Organisations Expert-level knowledge of AWS Networking, TLS, and security best practices. Experience with container orchestration (Kubernetes, EKS) and observability tools (Grafana, ELK). A passion for innovation, problem-solving, and delivering high-impact solutions. Working with Control Tower and Landing Zones Why Work For Us? Competitive base salary up More ❯
Step into a company that is reimagining how observability works by building a platform that delivers immediate insights without the overhead of traditional indexing. By eliminating complexity and significantly reducing operational costs by up to 70 percent, this solution offers a unified view across logs, metrics, traces, and security events all in real time. Now they are looking for a … be the key link. They are ideally looking for someone with: Strong experience supporting technical products in a customer facing capacity Deep understanding of cloud native technologies and modern observability stacks such as Grafana, DataDog, Splunk or similar A hands on mindset and the ability to work comfortably across Kubernetes, microservices, and comparable environments Beyond technical skills, they value clear More ❯
Services : Enables thousands of servers and BPIPE endpoints to "call home" and receive correct settings. Peer Discovery Infrastructure : Groups servers into discoverable clusters and provides tools to manage them. Observability and Monitoring Frameworks : Ensures we have high visibility across a vast estate of global infrastructure. Data Quality Tooling : UI and backend systems for diagnosing distribution issues across the real-time … SRE pillars: Latency Monitoring & Management - Define SLIs/SLOs, track latency, and build tools to diagnose issues. Capacity Management - Maintain disaster readiness and scalability through monitoring and forecasting. System Observability - Proactively detect issues, build alerting systems, and centralize health dashboards. Production Risk Management - Ensure safe software releases, drive infrastructure improvements. Incident Response - Lead or support fast, effective remediation during live More ❯
portals, dashboards, internal tools, and web applications. Collaborate closely with DevOps on CI/CD pipelines, deployment workflows, infrastructure, and SecOps compliance. Uphold high standards for code quality, system observability, and technical documentation. Act as the technical lead, setting direction and best practices for the full-stack engineering team. Mentor engineers, providing guidance on architecture, design patterns, and career growth. … cross-functional teams Deep experience with React, TypeScript, .NET Core, SOAP/REST APIs, and MySQL/PostgreSQL, Red Hat OpenShift, Kubernetes Understanding of DevOps, cloud deployments, and service observability Bonus: Interest/experience in AI, digital twins, Nvidia Omniverse SDK & APIs, Universal Scene Description What We Offer : Reimbursement for tuition and professional dues Three weeks of vacation and five More ❯
Bristol, Avon, South West, United Kingdom Hybrid / WFH Options
Hargreaves Lansdown
documentation practices, including Architectural Decision Records, Solution Memos, and C4 diagrams. Guide cloud architecture choices, particularly around container orchestration and the use of AWS services. Champion best practices for observability, logging, security, and networking. Identify opportunities to enhance Developer Experience and efficiency through smarter tooling and frameworks. Support engineering teams with mentoring, pairing, and skills development. Lead conversations around Event … every level of the organisation. Proven ability to balance trade-offs, costs, and technical constraints. Experience coaching teams towards engineering and architecture best practices. Deep understanding of security, networking, observability, and system flows. Adept at producing clear, concise architectural documentation. Desirable: Previous experience as a Solution or Enterprise Architect. Background in enterprise systems and legacy-to-modern transitions. Familiarity with More ❯
Employment Type: Permanent, Part Time, Work From Home
bristol, south west england, united kingdom Hybrid / WFH Options
Hargreaves Lansdown
documentation practices, including Architectural Decision Records, Solution Memos, and C4 diagrams. Guide cloud architecture choices, particularly around container orchestration and the use of AWS services. Champion best practices for observability, logging, security, and networking. Identify opportunities to enhance Developer Experience and efficiency through smarter tooling and frameworks. Support engineering teams with mentoring, pairing, and skills development. Lead conversations around Event … every level of the organisation. Proven ability to balance trade-offs, costs, and technical constraints. Experience coaching teams towards engineering and architecture best practices. Deep understanding of security, networking, observability, and system flows. Adept at producing clear, concise architectural documentation. Desirable: Previous experience as a Solution or Enterprise Architect. Background in enterprise systems and legacy-to-modern transitions. Familiarity with More ❯
bath, south west england, united kingdom Hybrid / WFH Options
Hargreaves Lansdown
documentation practices, including Architectural Decision Records, Solution Memos, and C4 diagrams. Guide cloud architecture choices, particularly around container orchestration and the use of AWS services. Champion best practices for observability, logging, security, and networking. Identify opportunities to enhance Developer Experience and efficiency through smarter tooling and frameworks. Support engineering teams with mentoring, pairing, and skills development. Lead conversations around Event … every level of the organisation. Proven ability to balance trade-offs, costs, and technical constraints. Experience coaching teams towards engineering and architecture best practices. Deep understanding of security, networking, observability, and system flows. Adept at producing clear, concise architectural documentation. Desirable: Previous experience as a Solution or Enterprise Architect. Background in enterprise systems and legacy-to-modern transitions. Familiarity with More ❯
bradley stoke, south west england, united kingdom Hybrid / WFH Options
Hargreaves Lansdown
documentation practices, including Architectural Decision Records, Solution Memos, and C4 diagrams. Guide cloud architecture choices, particularly around container orchestration and the use of AWS services. Champion best practices for observability, logging, security, and networking. Identify opportunities to enhance Developer Experience and efficiency through smarter tooling and frameworks. Support engineering teams with mentoring, pairing, and skills development. Lead conversations around Event … every level of the organisation. Proven ability to balance trade-offs, costs, and technical constraints. Experience coaching teams towards engineering and architecture best practices. Deep understanding of security, networking, observability, and system flows. Adept at producing clear, concise architectural documentation. Desirable: Previous experience as a Solution or Enterprise Architect. Background in enterprise systems and legacy-to-modern transitions. Familiarity with More ❯