926 to 950 of 1,203 Permanent Observability Jobs

Senior DevOps Engineer

Hiring Organisation
Understanding Recruitment
Location
Oxford, England, United Kingdom
hire a Senior DevOps Engineer. This is a great opportunity to join a highly technical, cross-functional engineering team working across cloud infrastructure, DevOps, observability, and business-critical platform integrations. The environment is heavily AWS-focused, with strong investment in automation, security, and scalable architecture. This role would suit someone … support scalable AWS infrastructure using Terraform and GitLab CI/CD Drive DevOps and DevSecOps best practices across cloud platforms and deployment processes Improve observability, monitoring, and platform reliability across customer-facing systems Perform troubleshooting, root-cause analysis, and production support within complex cloud environments Bring strong AWS experience across ...

Director, AI Platform Owner

Hiring Organisation
Jobleads-UK
Location
City Of London, England, United Kingdom
alignment, and audit readiness in partnership with enterprise risk and compliance teams. Oversee AI gateway and control‐plane capabilities including usage tracking, rate limiting, observability, logging, and chargeback mechanisms. Establish clear RACI models for platform ownership versus domain ownership. Enable safe self‐service AI development across low‐code … gateway patterns, zero‐trust architecture, and identity-centric access controls. Understanding of Responsible AI, data classification, and AI risk management. Experience operating platform observability and usage analytics. Background in multi-vendor or hybrid platform environments. Experience within financial services, asset management, or other regulated industries is strongly preferred. Familiarity with ...

Senior AI Engineer - MCP / AI Tooling

Hiring Organisation
Adepta Partners Ltd
Location
Belfast, Northern Ireland, United Kingdom
agents and internal tooling Implement authentication, session handling, streaming, and stateful interactions Help define standards and reusable patterns for MCP development Contribute to observability, reliability, security, and platform scalability Work closely with senior engineers on architecture and AI-native engineering practices Build systems that support safe, governed AI-assisted workflows … product mindset Comfortable working in fast-moving, evolving technical environments Nice to Have Experience with Cloudflare Workers, serverless platforms, or edge environments Knowledge of observability, CI/CD, and platform engineering Understanding of AI security considerations such as prompt injection and permission scoping Experience building internal AI tooling or developer ...

Software Engineer

Hiring Organisation
Metric
Location
City of London, London, United Kingdom
software for advanced computing platforms. Build and optimise low-latency software interfaces and hardware integrations. Contribute to DevOps, CI/CD pipelines, monitoring, and observability tooling. Lead technical projects from design through deployment. Collaborate with product, engineering, and research teams to deliver new capabilities. Improve system performance, reliability, and scalability. … embedded systems, hardware integration, FPGAs, or scientific instrumentation. Background in quantum computing, HPC, telecoms, robotics, defence, semiconductors, or other deep-tech environments. Experience with observability tools such as Grafana, Prometheus, or InfluxDB. Knowledge of digital signal processing, RF systems, or data acquisition. About You: You are a hands-on engineer ...

Site Reliability Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
moment when the work is genuinely changing shape. Over the last year we've hardened the platform, reduced cost, and built serious observability into our highest-volume systems. The next year is about scaling that work, absorbing infrastructure from a recent acquisition, and being thoughtful about how AI shows … thrive. Here's what that looks like in practice: Month 1 : You're onboarded across our AWS estate, Terraform, and observability stack. You've completed your first on-call shift with support from the team, landed your first PR in the DevOps repo, and started working Claude Enterprise into your ...

Senior Security Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
strategy, conduct technical evaluations, and manage escalation relationships with strategic partners. Ensure compliance with regulatory frameworks and risk management practices in financial services. Drive observability and detection strategies using logging and SIEM solutions, enhancing our security posture. Collaborate with internal stakeholders, architecture teams, and third-party vendors to maintain … Proficiency in cloud networking and security in AWS, GCP, or Azure, including Transit Gateway, VPN/Direct Connect, and Azure Virtual WAN. Experience with observability and detection strategies using logging and SIEM solutions, such as Splunk. Availability for senior technical escalation during critical incidents and after-hours work for high ...

Platform Engineer - DevOps Specialist - 6 months contract £650/d inside IR35

Hiring Organisation
Tenth Revolution Group
Location
United Kingdom
PLEASE NOTE that you must be SC Eligible and open to travel to Telford - 2 days min per month Are you passionate about SRE, observability, and driving operational excellence We’re looking for a talented Observability Engineer to help us build and scale world-class monitoring solutions across complex technology … landscapes. 🔍 About the Role As an Observability Engineer, you’ll play a critical role in ensuring system reliability, performance, and proactive incident management . You’ll work across teams to embed observability into the heart of our solutions, leveraging cutting-edge tooling such as Dynatrace . 💡 Key Responsibilities Translate ...

Senior Software Engineer

Hiring Organisation
Vermillion Analytics
Location
London, South East, England, United Kingdom
Employment Type
Full-Time
Salary
£65,000 - £80,000 per annum
junior engineers on implementation, design patterns, and engineering practice Identifying and resolving systemic technical issues, not just isolated bugs Improving deployment reliability, monitoring, and observability Communicating trade-offs and risks clearly — to engineers and non-engineers alike Participating in and leading production incident response where needed WHAT THEY'RE LOOKING … experience with Svelte and/or jQuery AWS cloud infrastructure OpenAI APIs or LLM-powered application development Browser or email extension development CRM integrations Observability and operational tooling WHAT GOOD LOOKS LIKE HERE Over time, senior engineers at this company become trusted owners of complex systems, lead technical initiatives involving ...

Principal Engineer - Digital Experience Platform

Hiring Organisation
Jobleads-UK
Location
Skipton, England, United Kingdom
focus of your role is **Value–Flow–Quality (VFQ)**, improve delivery flows, strengthen test automation and embed quality inside CI, and build release‐linked observability that ties every deployment to metrics, logs, traces and golden signals.You will also own the engineering strategy that enables the platform to evolve, to maximise … where lead time drops, deployment frequency rises, and change‐failure rate stays low, using automation‐first pipelines, trunk‐based development, progressive delivery, release‐linked observability and data‐ready environments to make fast, safe flow the norm – with daily/weekly delivery of value the norm.**2. Deep craft in modern ...

Principal ML Platform Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
GPUs and cloud infrastructure. Develop internal tools and abstractions and agentic systems that reduce operational overhead for researchers and engineers. Drive improvements across observability, automation, reliability, and developer experience. Collaborate closely with researchers and product engineers to understand pain points and turn them into robust platform capabilities. Contribute to technical … model serving systems in production. Supporting research or data‐intensive workloads. Working with GPU‐based systems or other performance‐sensitive infrastructure. Experience with observability and debugging in distributed systems. Familiarity with Terraform, Datadog, GitHub Actions, or similar tools. Bonus points for Experience building agentic or LLM‐powered internal tools. Experience ...

Data Architect

Hiring Organisation
Jobleads-UK
Location
Houghton-le-Spring, England, United Kingdom
technologies and tools that enhance Arriva's data and BI capabilitiesStay current with emerging trends including open table formats (Apache Iceberg, Delta Lake), data observability, real-time/streaming architectures, and cloud-native solutionsConduct proofs-of-concept and technical assessments to validate new technologies before adoptionMonitor industry best practices … patterns including ETL/ELT, data lake architectures, and lakehouse approachesUnderstanding of emerging technologies such as open table formats (Apache Iceberg, Delta Lake), data observability tools, and real-time/streaming architecturesExperience with data governance frameworks, data quality tools, metadata management, and ensuring compliance with security and regulatory requirementsPersonal AttributesStrong ...

Principal 5G Network Core Architect

Hiring Organisation
Jobleads-UK
Location
United Kingdom
security, key management, and trust boundaries Leading the cloud-native design and deployment of 5GC, OAM, and supporting control components (Kubernetes, CNFs/VNFs, observability, resilience) Defining and maintaining OAM data models and workflows aligned with standard management frameworks (O-RAN SMO/O1/O2, NETCONF/YANG) Contributing … security (3GPP security, IPsec/TLS, PKI, RBAC, logging/audit)Experience with cloud-native telco platforms (Kubernetes, CNFs/VNFs, Helm/Operators, observability stacks) Hands-on lab experience integrating 5GC + gNB + OAM using COTS components and standard interfaces (NG, F1/E1, O1/O2, NETCONF ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Tucson, Arizona, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Portland, Oregon, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Denver, Colorado, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Milwaukee, Wisconsin, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Oklahoma City, Oklahoma, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Seattle, Washington, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Jacksonville, Florida, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Albuquerque, New Mexico, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Indianapolis, Indiana, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Philadelphia, Pennsylvania, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Boston, Massachusetts, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Hartford, Connecticut, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...

Staff AI Machine Learning Engineer

Hiring Organisation
Medeloop
Location
Atlanta, Georgia, United States
Employment Type
Permanent
Salary
USD Annual
decommissioning agents dynamically for complex healthcare workflows). Develop rigorous evaluation and safety frameworks - automated testing, benchmarking, regression testing, adversarial testing, safety guardrails, observability (tracing, logging, metrics), and human-in-the-loop mechanisms to ensure reliable, compliant performance in production. Drive LLM and ML model development - train, fine-tune … tools: LangChain/LangGraph, Model Context Protocol (MCP), Agent-to-Agent (A2A) protocols, Hugging Face, PyTorch, vector databases/semantic search, prompt engineering, and observability platforms (e.g., LangSmith, Phoenix). Experience designing fully automated evaluation and testing pipelines for autonomous agents and their orchestration, including metrics for reliability, safety, factuality ...