Actions, MLFlow, ZenML, or similar). Deep understanding of containerisation and orchestration tools (Docker, Kubernetes). Desirable Experience deploying AI inference engines (vLLM, Ray Serve, Triton). Familiarity with observability tools for LLMs (TruLens, Helicone, LangSmith). Understanding of AI safety and reliability frameworks (Guardrails AI). This is an exciting opportunity to help define the infrastructure powering the next More ❯
Actions, MLFlow, ZenML, or similar). Deep understanding of containerisation and orchestration tools (Docker, Kubernetes). Desirable Experience deploying AI inference engines (vLLM, Ray Serve, Triton). Familiarity with observability tools for LLMs (TruLens, Helicone, LangSmith). Understanding of AI safety and reliability frameworks (Guardrails AI). This is an exciting opportunity to help define the infrastructure powering the next More ❯
Strong experience with AWS (VPCs, EC2, ECS/EKS, RDS, S3, etc.) Solid understanding of database systems (Postgres, SQL Server) IaC mastery (Terraform, CloudFormation, Ansible) Passion for monitoring and observability (Grafana, Elastic, PagerDuty, etc.) Familiarity with configuration management tools (Puppet, etc.) Git, Docker, and scripting skills (bash or similar) A collaborative mindset and the ability to communicate technical concepts clearly More ❯
Actions, MLFlow, ZenML, or similar). Deep understanding of containerisation and orchestration tools (Docker, Kubernetes). Desirable Experience deploying AI inference engines (vLLM, Ray Serve, Triton). Familiarity with observability tools for LLMs (TruLens, Helicone, LangSmith). Understanding of AI safety and reliability frameworks (Guardrails AI). This is an exciting opportunity to help define the infrastructure powering the next More ❯
Strong experience with AWS (VPCs, EC2, ECS/EKS, RDS, S3, etc.) Solid understanding of database systems (Postgres, SQL Server) IaC mastery (Terraform, CloudFormation, Ansible) Passion for monitoring and observability (Grafana, Elastic, PagerDuty, etc.) Familiarity with configuration management tools (Puppet, etc.) Git, Docker, and scripting skills (bash or similar) A collaborative mindset and the ability to communicate technical concepts clearly More ❯
london (city of london), south east england, united kingdom
Amber Labs
Actions, MLFlow, ZenML, or similar). Deep understanding of containerisation and orchestration tools (Docker, Kubernetes). Desirable Experience deploying AI inference engines (vLLM, Ray Serve, Triton). Familiarity with observability tools for LLMs (TruLens, Helicone, LangSmith). Understanding of AI safety and reliability frameworks (Guardrails AI). This is an exciting opportunity to help define the infrastructure powering the next More ❯
level test solutions and automation frameworks using Python, Terraform, and modern cloud-native practices. Contribute to the platforms CI/CD pipeline by integrating automated testing, resilience checks, and observability hooks at every stage. Lead initiatives that drive testability, platform resilience, and validation as code across all layers of the ML platform stack. Collaborate with engineering, MLOps, and infrastructure teams More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Salt Search
non-functional requirements Deep understanding of microservices architecture , cloud-native applications, and API development Experience with distributed systems - managing workloads at scale using modern practices for availability, performance, and observability Knowledge and experience with Test-Driven Development (TDD) and automated testing frameworks Excellent collaboration and communication skills, with a track record of working effectively in Agile environments Familiarity with public More ❯
tools. Strong hands-on technical skills in automation (Jenkins, GitLab, Docker, Kubernetes, Shell Scripting etc). Strong hands-on technical skills in automation, infrastructure as code, logging, monitoring and observability, infrastructure configuration, scripting languages and applications. Experience working in Cloud ecosystem- building and deploying workloads on cloud, preferably AWS. Associate Certification (preferable) in AWS Development and/or Architect or More ❯
queuing technologies, i.e. RabbitMQ Experience of REST and/or GraphQL APIs Knowledge of the core AWS services: i.e. EC2/ECS, RDS, S3 Experience using DataDog or similar observability tools Knowledge of containerisation: Docker, Kubernetes, AWS Fargate etc Any experience of front-end or fullstack development using TypeScript & React Experience building software for financial services and/or investment More ❯
City of London, London, United Kingdom Hybrid / WFH Options
TreasurySpring
queuing technologies, i.e. RabbitMQ Experience of REST and/or GraphQL APIs Knowledge of the core AWS services: i.e. EC2/ECS, RDS, S3 Experience using DataDog or similar observability tools Knowledge of containerisation: Docker, Kubernetes, AWS Fargate etc Any experience of front-end or fullstack development using TypeScript & React Experience building software for financial services and/or investment More ❯
london, south east england, united kingdom Hybrid / WFH Options
TreasurySpring
queuing technologies, i.e. RabbitMQ Experience of REST and/or GraphQL APIs Knowledge of the core AWS services: i.e. EC2/ECS, RDS, S3 Experience using DataDog or similar observability tools Knowledge of containerisation: Docker, Kubernetes, AWS Fargate etc Any experience of front-end or fullstack development using TypeScript & React Experience building software for financial services and/or investment More ❯
london (city of london), south east england, united kingdom Hybrid / WFH Options
TreasurySpring
queuing technologies, i.e. RabbitMQ Experience of REST and/or GraphQL APIs Knowledge of the core AWS services: i.e. EC2/ECS, RDS, S3 Experience using DataDog or similar observability tools Knowledge of containerisation: Docker, Kubernetes, AWS Fargate etc Any experience of front-end or fullstack development using TypeScript & React Experience building software for financial services and/or investment More ❯
and budgeting. 3) Platform Development & Process Management Co-develop with systematic team and IT a scalable data/quant architecture (data pipelines, model services, APIs) and embed SRE practices (observability, resilience, cost efficiency). Lead automation of portfolio alignment and sustainability reporting ; maintain production health, troubleshoot incidents, and drive continuous improvement. 4) Client & Stakeholder Delivery Address day‐to‐day queries More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Hargreaves Lansdown
Figma and Vercel is a big plus. Capable of writing clean, maintainable and well-tested code. Comfortable working in on-prem and cloud-native environments with an interest in observability, using tools like Prometheus and Grafana to keep services healthy and maintainable. Familiarity with AWS services and how to integrate them into modern applications. A keen focus on quality and More ❯
services. About You Solid experience building and deploying services with Java and Spring Boot. Comfort working in a cloud-native environment - Kubernetes (EKS), containers, scaling etc. An interest in observability, using tools like Prometheus and Grafana to keep services healthy and understand usage patterns. Familiarity with AWS services and how to integrate them into modern applications. A keen focus on More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Hargreaves Lansdown
Figma and Vercel is a big plus. Capable of writing clean, maintainable and well-tested code. Comfortable working in on-prem and cloud-native environments with an interest in observability, using tools like Prometheus and Grafana to keep services healthy and maintainable. Familiarity with AWS services and how to integrate them into modern applications. A keen focus on quality and More ❯
scalability and reduce manual intervention. Operational Security, SRE & Assurance: Ensure security platforms are resilient, continuously monitored, and designed for 24x7 support and incident response readiness. Embed security telemetry and observability to enable proactive threat detection and automated response. Apply SRE principles to improve reliability, performance, and maintainability of security services. Lead platform health, patching automation, and vulnerability remediation workflows. Define More ❯
innovation, LSEG is a place where everyone can grow, develop and fulfil your potential with meaningful careers. ROLE SUMMARY: LSEG is seeking a skilled and forward-thinking Telemetry and Observability Engineer to join our growing Network Engineering function. This role is critical to defining and implementing our observability strategy, ensuring our systems are measurable, reliable, and continuously improving. You will … work at the intersection of software engineering, SRE, and platform operations, helping teams across LSEG gain actionable insights from telemetry data and maximise the value of our observability tooling. WHAT YOU'LL BE DOING: Design and implement scalable telemetry pipelines for metrics, logs, traces, and events across distributed systems. Develop and maintain observability standards, NMS tooling, dashboards, alerting frameworks, and … SLOs in collaboration with product and platform teams. Champion best practices in instrumentation, monitoring, and incident response across engineering teams. Integrate and optimise observability tools (e.g., OpenTelemetry, Prometheus, Grafana, Splunk, Elastic, etc.) within the NPS ecosystem. Collaborate cross-functionally to ensure observability is embedded into the SDLC and CI/CD pipelines. Drive adoption of observability platforms through enablement, documentation More ❯
London, England, United Kingdom Hybrid / WFH Options
Client Server
running production workloads on Kubernetes (Amazon EKS preferred) You have a good knowledge of DevOps practices including CI/CD, IaC (Terraform) and container orchestration You have experience with observability tooling You have a solid understanding of secure coding and deployment practices You're collaborative and pragmatic with great communication skills What's in it for you: Salary to £100k More ❯
london, south east england, united kingdom Hybrid / WFH Options
Client Server
running production workloads on Kubernetes (Amazon EKS preferred) You have a good knowledge of DevOps practices including CI/CD, IaC (Terraform) and container orchestration You have experience with observability tooling You have a solid understanding of secure coding and deployment practices You're collaborative and pragmatic with great communication skills What's in it for you: Salary to £100k More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Digital Skills ltd
using AWS Organisations/Control Tower/Landing Zones Strong knowledge of Linux/Unix systems administration Expert-level experience in AWS Networking/TCP/Firewalls/Certs Observability champion, experience of designing and building monitoring and logging tools such as CloudWatch, ELK, and Grafana Strong scripting skills in Bash, JavaScript or similar knowledge of SecDevOps security best practices More ❯
using AWS Organisations/Control Tower/Landing Zones Strong knowledge of Linux/Unix systems administration Expert-level experience in AWS Networking/TCP/Firewalls/Certs Observability champion, experience of designing and building monitoring and logging tools such as CloudWatch, ELK, and Grafana Strong scripting skills in Bash, JavaScript or similar knowledge of SecDevOps security best practices More ❯
Are excited to learn more about financial markets and trading systems Bonus experience: Ruby, Spark, Trino, Kafka Financial markets exposure SQL (Postgres, Oracle) Cloud-native deployments (AWS, Docker, Kubernetes) Observability tools (Splunk, Prometheus, Grafana) Why Apply? This is a fantastic opportunity to join a high-performance engineering team in a business that invests heavily in technology and talent. You’ll More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Hunter Bond
Are excited to learn more about financial markets and trading systems Bonus experience: Ruby, Spark, Trino, Kafka Financial markets exposure SQL (Postgres, Oracle) Cloud-native deployments (AWS, Docker, Kubernetes) Observability tools (Splunk, Prometheus, Grafana) Why Apply? This is a fantastic opportunity to join a high-performance engineering team in a business that invests heavily in technology and talent. You’ll More ❯