876 to 900 of 1,213 Permanent Observability Jobs

AI Engineering Product Manager

Hiring Organisation
Jobleads-UK
Location
Waterside, Scotland, United Kingdom
grade AI agents integrated with complex airline systems. Establish best practices for OpenAI, Anthropic, Azure OpenAI, LangGraph, AutoGen and other frameworks. Implement engineering discipline: observability, safety, automated evaluation, behavioural testing and continuous improvement. Matrix and Partner Leadership Operate effectively across Group, OpCos, cloud, data and security teams. Coordinate delivery streams … direct authority. Demonstrated integration of LLM‐based agents with enterprise systems, APIs, RPA, orchestration platforms and internal tools. Grounding in DevSecOps, cloud‐native architecture, observability and CI/CD. Strong communication skills; able to translate complex technical concepts to senior executives. Experience with high‐stakes, fast‐paced environments and ambiguous ...

Platform Principal Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
self-service capabilities. Upskill and Mentor: Transition the in-house engineering team into a high-performing internal platform team throughout the platform build process. Observability: Design and implement enterprise-grade logging, metrics, and tracing for Kubernetes at scale. IaC Leadership: Implement and manage Infrastructure as Code to a senior standard … Terraform/Open Tofu module design. (MUST) Kubernetes Engineering: GitOps (Argo CD/Flux), secrets management, ingress/mesh, and OPA/Gatekeeper. (MUST) Observability: OpenTelemetry (MUST) Tooling: Spacelift, Atlantis, or Terraform Cloud (Desired) Governance: EPAC (Enterprise Policy as Code) (Desired) What You'll Bring To Us Recent, hands ...

Senior Software Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
flexibility, simplicity and delivery speed Build and maintain backend services and integrations that support our insurance journeys Work with infrastructure, CI/CD and observability to help the team ship safely and often Partner with product, design and data to turn ambiguous opportunities into concrete, measurable improvements Raise the technical … similar Testing: integration and end-to-end testing, component story testing, and visual regression testing CI/CD: Automated testing and deployment pipelines Observability: Analytics platforms, error monitoring and performance tracking Cloudflare experience, including Workers, CDN or load balancing Builder.io or other visual/content tooling experience ...

Director of Software Engineering

Hiring Organisation
Spire
Location
Glasgow, Scotland, United Kingdom
hands-on: review code, prototype solutions, and get into the details when it matters Establish engineering standards across code quality, system design, testing, and observability, and hold the team to them Be the person engineers come to when the problem is genuinely hard Team Building & Culture Recruit, develop, and retain … Experience writing performance software in Rust Background in space systems, aerospace, or highly constrained real-time environments Experience building data lakes, telemetry platforms, or observability infrastructure at scale A history of leading teams through technical transformations and not just maintaining the status quo Spire operates a hybrid work model ...

Principal Machine Learning Infrastructure Engineer London, United Kingdom

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
training pipelines for throughput, fault tolerance, and cost efficiency, including checkpointing strategies, gradient accumulation, and multi-node synchronization. Build and maintain experiment tracking and observability systems that give researchers clear visibility into training runs, hyperparameter sweeps, and model performance. Data I/O and Performance Solve data loading bottlenecks … workflows generate and consume data Experience building model serving infrastructure with latency and throughput requirements Familiarity with experiment tracking tools (Weights & Biases, MLflow) and observability stacks (Prometheus, Grafana) What we offer Equity options – share in our success and growth. 10% employer pension contribution – invest in your future. Free office lunches ...

Cloud Architect

Hiring Organisation
Tata Consultancy Services
Location
Luton, England, United Kingdom
least privilege, KMS encryption, secrets management, data classification, PII redaction, prompt/response filtering, and model governance. Drive non-functional requirements: reliability, scalability, latency, observability, DR, and cost controls (FinOps) for GenAI workloads. Guide build teams through solution design, reviews, and implementation; produce architecture artefacts (HLD/LLD), patterns … more languages (Python/Node.js preferred) and infrastructure-as-code (CDK/CloudFormation/Terraform) for repeatable deployments. Experience setting up observability for GenAI: tracing, logging, metrics, and model/application performance dashboards. Excellent communication skills for architecture storytelling, stakeholder management, and client-facing workshops. Rewards & Benefits TCS is consistently ...

Senior Front-End Engineer

Hiring Organisation
Mochi Health
Location
San Francisco, California, United States
Employment Type
Permanent
Salary
USD Annual
will own your applications end-to-end: architecting the client-side state, building flawlessly responsive UI, optimizing rendering performance, and owning the frontend observability in production. If you are drawn to product problems where the UX complexity is real, the autonomy is absolute, and the impact on patient outcomes … responsible for what happens after the code ships. You will own the frontend deployment pipelines, establish strict performance budgets, and manage client-side observability and error tracking (e.g., Sentry, Datadog) to catch regressions before our patients do. Build Agentic Workflows: Mochi is an AI-first engineering org. You will ...

Cloud SRE - Global Observability Lead (Remote UK)

Hiring Organisation
Jobleads-UK
Location
Newcastle upon Tyne, England, United Kingdom
leading technology company is seeking a Staff Site Reliability Engineer - Cloud to architect the Observability Centre of Excellence, ensuring reliability and uptime of global platforms. This role involves implementing OpenTelemetry, developing automation scripts, and optimizing platform performance while collaborating with engineering teams. Required skills include experience with observability tools like ...

Senior SRE & Observability Engineer – Trade Tech

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
Bloomberg L.P. is seeking a Senior Software Engineer/SRE for the TRAX Observability team in London. This role involves enhancing systems for performance metrics, improving telemetry reliability, and collaborating with various teams across global offices. Candidates should have experience with high-level programming languages, Unix/Linux basics … observability concepts like distributed tracing and logging. Strong communication skills are essential. The position emphasizes technical growth, stakeholder influence, and a commitment to diversity and inclusion within the workplace. #J-18808-Ljbffr ...

Lead Machine Learning Engineer - REMOTE

Hiring Organisation
Lennar Homes
Location
Boston, Massachusetts, United States
Employment Type
Permanent
Salary
USD 190,700 Annual
Lead ML Engineer - REMOTE We are Lennar Lennar is one of the nation's leading homebuilders, dedicated to making an impact and creating an extraordinary experience for their Homeowners, Communities, and Associates by building quality ...

Senior Software Engineer, Chem-Bio

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
The AI Security Institute is the world's largest and best-funded team dedicated to understanding advanced AI risks and translating that knowledge into action. We’re in the heart of the UK government with ...

Data Architect

Hiring Organisation
Jobleads-UK
Location
Manchester, England, United Kingdom
We believe in the power of ingenuity to build a positive human future.We challenge where it matters and own the outcome.As strategies, technologies, and innovation collide, we create opportunity from complexity. Our teams of interdisciplinary ...

Hybrid Cloud SRE: Kubernetes, GitOps & Observability

Hiring Organisation
Jobleads-UK
Location
United Kingdom
while embracing modern working methods, and collaborating with fellow engineers. The ideal candidate will have hands-on experience with Kubernetes and OpenShift, along with observability tools like Prometheus and Grafana. A supportive environment awaits where you can thrive and develop your skills. #J-18808-Ljbffr ...

Database Reliability Engineer

Hiring Organisation
Jobleads-UK
Location
Southampton, England, United Kingdom
tune performance across hundreds of instances. Architect Cross‐Cloud Portability: use CNPG and cloud‐native patterns to keep our database layer provider‐agnostic. Evolve Observability & Monitoring: build proactive monitoring and alerting to detect regressions before they affect customers. Support Replication & Mobility: enable data streaming and zero‐downtime migration strategies … provision infrastructure, avoiding manual implementations. Distributed Systems enthusiast: enjoy the challenge of multi‐tenant, multi‐region, multi‐cloud scenarios with rigorous data integrity. Security & Observability mindset: build deep observability (Prometheus/Grafana/OpenTelemetry/Humio) and guardrails for secure operation. Engineering via code: deliver backend services in Java with ...

Database Reliability Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
tune performance across hundreds of instances. Architect Cross‐Cloud Portability: use CNPG and cloud‐native patterns to keep our database layer provider‐agnostic. Evolve Observability & Monitoring: build proactive monitoring and alerting to detect regressions before they affect customers. Support Replication & Mobility: enable data streaming and zero‐downtime migration strategies … provision infrastructure, avoiding manual implementations. Distributed Systems enthusiast: enjoy the challenge of multi‐tenant, multi‐region, multi‐cloud scenarios with rigorous data integrity. Security & Observability mindset: build deep observability (Prometheus/Grafana/OpenTelemetry/Humio) and guardrails for secure operation. Engineering via code: deliver backend services in Java with ...

Principal Machine Learning Engineer

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
data engineering teams to implement scalable data lakehouse oriented feature architectures and enterprise‐grade ML governance. Champion engineering standards for model quality, documentation, observability, and platform resilience. Feature Engineering & Data Architecture Architect highly scalable, production‐ready feature pipelines within Lakehouse environments. Set the technical direction for fallback and resilience strategies … including scoring metrics, latency, error analytics, and SLOs. Partner with platform teams to optimise cost, scale, and reliability of inference endpoints. Monitoring, Drift Detection & Observability Define observability standards for feature drift, concept drift, performance degradation, and data integrity. Lead the creation of dashboards, benchmarks, and automated alerting across ...

Site Reliability Engineer

Hiring Organisation
Jobleads-UK
Location
Manchester, England, United Kingdom
Site Reliability Engineer, you will enhance system reliability, observability and performance through a strong engineering approach and assist with incident resolution and best practices. You will have software engineering skills, focusing on system reliability and observability. You will monitor the health, performance and availability of critical systems, directly impacting operational … maintainability. You will also help engineer tools and automation for effective service management. Collaboration is key, working across multiple functions to integrate reliability and observability best practices into the software development life cycle. By supporting governance standards set by the central teams, you will foster a culture where these principles ...

Manager – Site Reliability Engineering

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
minimise business disruption and improve operational efficiency.* Risk Management & ComplianceEnsure compliance with regulatory standards and internal governance. Proactively identify and mitigate operational risks.* **Metrics & Observability**Establish and maintain robust observability practices, employing metrics, logging, and tracing to drive data-driven decisions and improve system health.* **Out of hours support/… toil, and eliminating manual operational tasks.* Excellent communication and stakeholder management skills, particularly under pressure.* Expertise in automation (Python, Shell, PowerShell etc.)* Familiarity with observability tools and practices (metrics, logging, tracing).* Ability to lead capacity planning and scalability strategies to support growth.* Knowledge of clearing and settlement processes ...

Principal AI Architect

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
experimenting with cutting‐edge technologies. Preferred Requirements Advanced Integration - Experience integrating Salesforce with external agents via APIs and open standards (MCP, A2A). Governance & Observability - Familiarity with prompt governance, observability, monitoring frameworks, responsible AI and compliance best practices Cross‐Platform Background - Background in cross‐platform integrations (e.g., Hyperscaler SDKs ...

Senior SRE: AI-Driven Observability & Automation

Hiring Organisation
Jobleads-UK
Location
Greater London, England, United Kingdom
leading consulting firm in London is looking for an experienced Site Reliability Engineer to enhance IT operations through observability and automation. Key responsibilities include architecting observability platforms, implementing SRE best practices, and driving AI-based automation initiatives. Ideal candidates will have over 12 years of experience in IT operations, strong ...

Senior Software Engineer (Node.js / TypeScript / AWS)

Hiring Organisation
Adria Solutions
Location
Manchester, North West, United Kingdom
Employment Type
Permanent
Salary
£80,000
build scalable backend services and cloud infrastructure Architect event-driven and distributed systems on AWS Develop APIs, microservices and internal tooling Improve reliability, observability and developer workflows Conduct load testing and performance optimisation Contribute to frontend applications where required About You You are a senior engineer with deep backend … driven architectures and high-concurrency systems Infrastructure as Code experience (Pulumi, Terraform or similar) Strong understanding of databases, caching and performance optimisation Experience with observability, monitoring and alerting Comfortable working across the stack when required Strong Linux, Docker and Git knowledge Not the Right Fit If Your experience is primarily ...

Senior Java Engineer

Hiring Organisation
Burns Sheehan
Location
Manchester, England, United Kingdom
engineering teams whilst owning the end-to-end delivery of projects and you’ll be involved across the requirements and architecture through to deployment, observability and long-term reliability. This position playa a key role in driving engineering excellence, modernising architecture, and ensuring the integration of secure, resilient practices across … performing engineering teams while owning the end-to-end delivery of mission-critical software. This includes everything from requirements and architecture through to deployment, observability, and long-term reliability. You’ll also play a key role in driving engineering excellence, modernising architecture, and embedding secure, resilient practices across systems. What ...

IT Service Performance & Reliability Manager

Hiring Organisation
Spectrum It Recruitment Limited
Location
New Milton, Hampshire, South East, United Kingdom
Employment Type
Permanent
Salary
£60,000
across critical IT services. This role focuses on keeping customer-facing services fast, reliable, and fully observable, while driving continuous improvement. You will lead observability across services, ensuring effective monitoring and actionable insights. You'll manage capacity and performance through forecasting and trend analysis, identifying risks early and driving improvements. … performance in IT environments Hands-on experience with AWS and Azure Strong knowledge of ITIL v3/v4 (certification required) Experience with monitoring/observability tools (e.g. Zabbix, Grafana, Kibana, OpenSearch) Knowledge of Windows and Linux server environments Scripting skills (e.g. Python, PowerShell, Node.js) Experience integrating data via APIs, webhooks ...

AI Platform Engineer

Hiring Organisation
Wave Group
Location
London Area, United Kingdom
handles thousands of concurrent users Agentic orchestration on serverless AWS — Building Lambda + Step Functions infrastructure for autonomous workflows; managing state, cost, and reliability Observability and guardrails — Implementing monitoring that catches when autonomous agents fail; tracking decision quality, tool-use patterns, and failure modes Platform reliability — Designing systems with … developer experience Nice to have: Experience with multi-modal systems (video, audio, or image processing) AWS Bedrock, SageMaker, or similar managed AI services Observability frameworks for LLM systems (Langfuse, Arize, LangSmith) RAG pipelines, vector databases, or retrieval-augmented systems CI/CD automation; infrastructure-as-code (Terraform, CloudFormation) Why This ...

Lead Software Engineer

Hiring Organisation
5V Video
Location
City of London, London, United Kingdom
+ AWS (Lambda, API Gateway, S3, DynamoDB) Handling event-driven architectures (Kafka, SNS/SQS, etc.) Driving system design decisions across distributed systems Improving observability, reliability, and performance in production Debugging complex issues and leading resolution across teams Staying hands-on while setting technical direction and standards Tech Stack Python … Lambda, API Gateway, S3, DynamoDB, IAM) Event-driven systems (Kafka, SNS/SQS) CI/CD (Concourse, Git workflows) Databases (Postgres, DynamoDB, Couchbase) Observability (Prometheus, Grafana, CloudWatch) What You’ll Bring Strong backend engineering experience (Python preferred) Proven experience building distributed systems at scale Deep understanding of microservices + event ...