1,051 to 1,075 of 1,268 Observability Jobs

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Providence, Rhode Island, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Fort Wayne, Indiana, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Las Vegas, Nevada, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
El Paso, Texas, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Salt Lake City, Utah, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Sioux Falls, South Dakota, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Technology Head of AI

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
adherence to security and compliance requirements. Drive rapid experimentation with clear exit criteria; scale successful pilots into reliable, maintainable services using automated testing, observability and release practices. Develop and manage strategic vendor and partner relationships, balancing build vs. buy decisions and negotiating commercial and risk terms that protect value. Provide … including model risk management, privacy, data residency, human oversight and auditability. Hands on understanding of AI/MLOps practices and platforms, including model lifecycle, observability, cost control, CI/CD, feature stores and data integration. Experience defining and governing enterprise standards and architectures for AI platforms, APIs and integration, aligned ...

Staff Machine Learning Engineer, ML Infrastructure

Hiring Organisation
SimpliSafe
Location
Cambridge, Massachusetts, United States
Employment Type
Permanent
Salary
USD Annual
highest-stakes ML systems at SimpliSafe. Identify and remove the systemic bottlenecks in our ML deployment infrastructure - whether that's serving reliability, deployment friction, observability gaps, scaling, or cost. Build and operate real-time CV inference at scale Own the design and evolution of cloud-side inference systems that process … durable. Own reliability and operational excellence Lead incident response and postmortems for critical ML systems; turn lessons learned into platform-level improvements. Define SLOs, observability standards, and on-call practices for ML services in production. Qualifications 8+ years of software/ML engineering experience, with a clear track record ...

London-Based Observability TAM - Drive Real-Time Data Value

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
leading tech company in Greater London is seeking a seasoned Technical Account Manager (TAM) to redefine the observability landscape. The role involves leading post-sales journeys, engaging with stakeholders from software engineers to executives, and troubleshooting complex integrations. Candidates should have hands-on experience with observability tools like Grafana, DataDog ...

Manager Applications

Hiring Organisation
Medline Industries
Location
Northbrook, Illinois, United States
Employment Type
Permanent
Salary
USD 201,000 Annual
Job Summary The Application Manager will be responsible for the organization's portfolio of business applications across various departments. This will include development, implementation, upgrades, daily management, and maintenance, Stakeholder management, application availability, system and ...

Site Reliability Engineer — Observability & Automation

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
infrastructure. The role involves designing and implementing monitoring solutions, analyzing system performance, and optimizing operational processes. The ideal candidate will have strong monitoring, observability, and alerting skills, experience with cloud platforms, and excellent problem-solving abilities. This position offers a hybrid work environment, unlimited PTO, and a comprehensive benefits program. ...

Staff SRE: Observability, Automation & Global Reliability

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
London. This role focuses on the reliability, scalability, and performance of Replit's infrastructure serving millions of users worldwide. You will work on designing observability solutions, leading incident response, and automating operational tasks while mentoring other engineers. The ideal candidate has extensive experience in Site Reliability Engineering, strong programming skills ...

RVP, EMEA Sales - Observability

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
just to execute a function, but to help redefine the future of how work gets done. Observe by Snowflake brings AI-native observability to the Snowflake AI Data Cloud, helping engineering and data teams debug, optimize, and understand systems operating at massive scale. Traditional observability tools were not built … strong judgment, and the ability to align people, strategy, and execution across functions. WHAT WE LOOK FOR 10+ years of experience selling cloud, infrastructure, observability, data platforms, or enterprise software. 2+ years of experience managing high-performing enterprise sales teams. Experience selling to senior technical and business stakeholders, including CIOs ...

Software Engineer (Prometheus / Grafana)

Hiring Organisation
SRT Marine Systems PLC
Location
Bristol, United Kingdom
Employment Type
Permanent
Salary
£50000 - £75000/annum
Software Engineer (Prometheus/Grafana) here at SRT, you will be part of a small team tasked with implementing an end-user observability visualisation. Currently, we have observability dashboards in place for our engineers, utilising Prometheus for metrics collection and Grafana for visualisation. This initiative aims to deliver a more … across multiple sites. We are fortunate to have a team of highly experienced engineers, including UX designers, who can provide support and guidance. Ourlead observability engineer will oversee and assist with your work throughout the project in the role of Software Engineer (Prometheus/Grafana). Key Responsibilities - Software Engineer ...

Software Engineer (Prometheus / Grafana)

Hiring Organisation
SRT Marine Systems PLC
Location
Birmingham, West Midlands (County), United Kingdom
Employment Type
Permanent
Salary
£50000 - £75000/annum
Software Engineer (Prometheus/Grafana) here at SRT, you will be part of a small team tasked with implementing an end-user observability visualisation. Currently, we have observability dashboards in place for our engineers, utilising Prometheus for metrics collection and Grafana for visualisation. This initiative aims to deliver a more … across multiple sites. We are fortunate to have a team of highly experienced engineers, including UX designers, who can provide support and guidance.Our lead observability engineer will oversee and assist with your work throughout the project in the role of Software Engineer (Prometheus/Grafana). Key Responsibilities - Software Engineer ...

Principal Engineer - Customer Engagement Platform

Hiring Organisation
Jobleads-UK
Location
Skipton, England, United Kingdom
Apps, Power Automate and the CRM/engagement ecosystem. You define and embed cross‐cutting standards such as API/event contracts, workflow architecture, observability, resilience patterns, and dependency baselines, and drive adoption of the Golden Path: policy‐as‐code CI/CD, progressive delivery, automated rollback/forward … building on Dynamics 365, Power Platform and workflow automation to move with speed *and* confidence. Through Golden Path pipelines, policy‐as‐code, release‐linked observability, on‐demand environments and shift‐left quality, you turn high‐performance delivery into a normal, repeatable capability that compounds over time. This empowers colleagues, reduces ...

Senior Sales

Hiring Organisation
Harrington Starr
Location
London Area, United Kingdom
Senior Enterprise Sales Financial Markets | Observability & Infrastructure Software London/Hybrid £120k–£140k base x2 OTE uncapped This is a high-growth, PE-backed technology platform used by Tier-1 financial institutions to monitor, analyse and optimise mission-critical trading and infrastructure environments. The software sits deep within low-latency … Infrastructure & SRE teams Production Engineering Trading Technology Capacity & Performance Engineering Enterprise Architecture You will be positioning a platform that sits at the heart of observability, operational resilience and infrastructure intelligence across complex financial ecosystems. Commercial Scope Deals are large, strategic and multi-year: £100k+ minimum entry point £500k typical deal ...

Performance Engineer

Hiring Organisation
Morson Edge
Location
United Kingdom
Employment Type
Contract
combines deep technical capability with the ability to enable and coach multiple engineering teams. You'll own performance strategy, enhance internal performance tooling, improve observability, and help shape the next generation of AI-assisted performance analysis. The Opportunity You'll act as the central performance engineering specialist across multiple product … internal performance platform. The environment is mature in performance testing, so the focus is on taking things to the next level through automation, innovation, observability improvements, and AI-driven tooling. You'll work closely with engineering teams, tech leads, and platform teams to improve system performance, reduce manual effort ...

Principal Cloud Engineer - Azure - Hybrid - Manchester. Job in Manchester LilyLifestyle Jobs

Hiring Organisation
Jobleads-UK
Location
Manchester, England, United Kingdom
Implement governance, policy, and identity standards (Entra ID) Develop core platform capabilities, including: API Management (APIM) and Web Application Firewall (WAF) Logging, monitoring, and observability Introduce and scale Infrastructure as Code (Terraform) across the environment Contribute to the design and implementation of business continuity and disaster recovery strategies Support … working with: Azure landing zones and governance frameworks Infrastructure as Code (Terraform preferred) Identity and access management (Entra ID/Azure AD) Monitoring and observability tooling Experience working in environments undergoing cloud transformation Ability to operate across engineering and architecture, with a focus on practical implementation Strong communication skills ...

Principal Cloud Engineer - Azure - Hybrid - Manchester

Hiring Organisation
Experis
Location
Manchester, United Kingdom
Employment Type
Permanent
Salary
£78000/annum + Excellent Bens
Implement governance, policy, and identity standards (Entra ID) Develop core platform capabilities, including: API Management (APIM) and Web Application Firewall (WAF) Logging, monitoring, and observability Introduce and scale Infrastructure as Code (Terraform) across the environment Contribute to the design and implementation of business continuity and disaster recovery strategies Support … working with: Azure landing zones and governance frameworks Infrastructure as Code (Terraform preferred) Identity and access management (Entra ID/Azure AD) Monitoring and observability tooling Experience working in environments undergoing cloud transformation Ability to operate across engineering and architecture, with a focus on practical implementation Strong communication skills ...

Solution Train Engineer - Payments - Contract - URGENT!

Hiring Organisation
Mark Loucas Payments
Location
South East, United Kingdom
Employment Type
Contract
Contract Rate
GBP Annual
dynamic environment Excellent coordination and communication skills across technical and non-technical teams Hands-on experience with tools like JIRA, Confluence, GitLab, and observability/monitoring platforms Familiarity with production readiness checklists and Live proving processes Experience that would be great to have: SAFe certification Exposure to scaled agile frameworks … SAFe) or large enterprise delivery models Knowledge of Release process, service-level objectives (SLOs), observability, and incident response workflows Certification in Agile, Scrum, or Release Train Engineering Minimum of 5 years of strong delivery experience ...

Senior Backend / Full-Stack Engineer (E5/E6 Level) – AI-Native Startup – Strong Comp + Equity

Hiring Organisation
Mondrian Alpha
Location
London Area, United Kingdom
preferred • Experience with modern frontend frameworks (React/Next.js) is a plus for full-stack candidates • Strong understanding of system design, reliability, scalability, and observability • Experience in startups or fast-paced product environments is highly desirable • AI-native mindset — comfortable leveraging AI tooling and rapid iteration workflows • Strong communication skills … React, Next.js • AI-native workflows and internal LLM tooling • Distributed systems and real-time infrastructure • OpenSearch, SingleStore, Trigger.dev, Axiom • Modern cloud-native infrastructure and observability stack What they offer: • Excellent compensation + meaningful equity • High-ownership environment with direct impact on product and architecture • Small, elite engineering team • Direct collaboration ...

AI Engineering Director

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
Build APIs, integrations, MCP Servers, and reusable platform capabilities to connect AI systems with enterprise platforms, tools, and workflows. Establish evaluation, experimentation, regression, and observability frameworks to continuously improve AI system quality, reliability, and agent behavior. Mentor senior engineers and influence engineering direction through code reviews, architecture discussions, technical standards … makers with compelling technical arguments. Preferred Qualifications, Capabilities, and Skills Experience with enterprise-scale AI platform development. Knowledge of industry-standard AI evaluation and observability frameworks. Expertise in cloud-native architectures and container orchestration. Proven track record of cross-functional collaboration and leadership. Familiarity with MCP protocols and enterprise integration ...

Lead Product Engineer

Hiring Organisation
Albert Bow
Location
London Area, United Kingdom
production LLM systems, including evals, retrieval, agent orchestration, prompts and tool use • Setting the standard for engineering quality, testing, code review, deployment and observability • Working directly with customers to understand problems and shape solutions • Mentoring engineers through code reviews, technical discussions and hands-on collaboration • Making pragmatic architectural decisions around … Fluency in Python or a closely related backend language • Experience designing systems with genuine complexity, including queues, async workflows, durable state and production-grade observability • Hands-on experience building and operating LLM-based products in production • Strong experience with evals, agentic systems, RAG, prompt design and tool use • A proven ...

Full Stack Engineer

Hiring Organisation
Techmunity | AI Startup Recruitment
Location
London Area, United Kingdom
pattern (RAG, agents, structured outputs, classification) balancing accuracy, latency, cost and data privacy. Making AI features production-grade. Own the eval harnesses, prompt versioning, observability, cost controls and guardrails that separate a demo from a product. Pull the context features need. Query the data warehouse for what each feature depends … practical LLM toolkit: RAG, tool use and agents, structured outputs, prompt engineering Know how to make AI features production-grade: eval harnesses, prompt versioning, observability, cost and latency controls, guardrails Be comfortable with SQL to pull what you need from the data warehouse Come from a maths, physics, computer science ...