SRE), you will play a key role in ensuring the reliability, scalability, and efficiency of our clients' platforms. Your focus will include building strong observability practices, aligning with the SRE mindset & principles, and driving continuous improvement. This will involve: Defining and implementing Service Level Indicators (SLIs) and Service Level Objectives … support, depending on client requirements. Key expectations from this role include: As a Consultant : Lead site reliability engineering initiatives with a strong emphasis on observability, ensuring high performance and reliability of applications & infrastructure. Provide strategic insights to shape the overall SRE strategy while collaborating on the design and implementation of … solutions. Establish effective monitoring, alerting and incident response strategies to maintain system availability and promote continuous improvement by collaborating with team members to deliver observability best practices and SRE methodologies. As part of your role you will also have the opportunity to contribute to the business and your own personal More ❯
implement reusable cloud-uptime components as code. Regularly review and optimise SRE practices, tools, and methodologies to enhance overall system reliability and team efficiency. Observability and Automation: Contribute to the design, implementation, and maintenance of observability and monitoring solutions to track the platform health, its cost-effectiveness, the reliability, and More ❯
implement reusable cloud-uptime components as code. Regularly review and optimise SRE practices, tools, and methodologies to enhance overall system reliability and team efficiency. Observability and Automation: Contribute to the design, implementation, and maintenance of observability and monitoring solutions to track the platform health, its cost-effectiveness, the reliability, and More ❯
Manchester, Lancashire, United Kingdom Hybrid / WFH Options
Embarcaderomediagroup
of our engineering operations, bringing together SRE principles and modern platform engineering practices. This includes combining principles of SRE - such as service-level reliability, observability, incident response - with platform engineering practices like GitOps, Infrastructure as Code, DevSecOps automation, and self-service enablement, to help development teams ship faster, safer, and … more cost-efficiently. What you'll be doing: Designing and operating highly reliable, scalable, and secure Azure-based platforms Applying SRE principles like SLOs, observability, and incident management to drive service reliability Building Infrastructure as Code using Terraform (v1.7+) and GitOps workflows Enabling teams through platform tools, reusable Terraform modules … DB, etc.) Strong Infrastructure as Code skills with Terraform (v1.7+) Experience with CI/CD pipelines, GitOps, and automation tools (PowerShell, Bash) Familiarity with observability and incident tools like Datadog, ELK, and synthetic monitoring Solid understanding of networking (TCP/IP, Load Balancing, DNS, Routing) Good knowledge of DevSecOps practices More ❯
Platform for high-throughput, low-latency workloads. Implement infrastructure-as-code (Terraform/Bicep) and automated release workflows that enable true continuous delivery. Drive observability: log aggregation, metrics, distributed tracing and on-call runbooks. Champion security, cost-efficiency and performance tuning across our services. Collaborate with product and platform teams … Excellent communication skills and a track record of cross-team collaboration. Nice to have: Kubernetes expertise (GKE/AKS/EKS) and container-native observability stacks (Prometheus/Grafana). NoSQL experience (Firestore, Cosmos DB, DynamoDB, MongoDB). Experience with game-backend scales, real-time services or hybrid cloud/… PostgreSQL, MS SQL Server, Redis. Messaging: Pub/Sub, RabbitMQ, Azure Service Bus. Infra & Ops: Docker, Kubernetes, Terraform/Bicep, GitHub Actions, Cloud Build. Observability: OpenTelemetry, Grafana, Elastic. More ❯
practices Collaborate with multiple product teams and respective owners to design infrastructure as we scale Building custom metrics and features to enhance Primer's observability Infrastructure as code (IaC) development Writing processes and documentation for system design, troubleshooting and maintenance What are we looking for? Strong experience with a cloud … best practices and the ability to implement security controls at the infrastructure level Experience with monitoring and logging tools like DataDog or Grafana's observability stack (Prometheus, Tempo, Loki, Grafana) Familiarity with the open standard OpenTelemetry Excellent written and verbal communication skills, we're a collaborative team! PLEASE NOTE: Our More ❯
Batch, and Spring Cloud Gateway. Champion cloud-native solutions using AWS, Kubernetes (EKS), Terraform, and GitLab CI/CD pipelines. Embed security, resilience, and observability into all parts of the platform architecture. Oversee the platform's end-to-end performance, ensuring high availability and robust monitoring. Design and implement auto … and GraphQL is a plus. Proven delivery of secure, scalable web apps with backend-for-frontend architecture and CDN integration. Proficiency with monitoring and observability tools such as Prometheus, Grafana, and OpenSearch. Deep understanding of CI/CD practices, GitLab pipelines, infrastructure as code, and centralized monitoring. Track record of More ❯
CI/CD-driven platform represents and enables the entire application and analysis lifecycle including interactive development and explorations (notebooks), large-scale batch processing, observability and production application deployments. The optimization team's focus is on maximizing scale and performance of all aspects of the platforms. A Director of AIML … one interpreted and one compiled common industry programming language: e.g., Python, C/C++, Scala, Java, including toolchains for documentation, testing, and operations/observability Hands-on experience with application performance tuning and optimization, including in parallel and distributed computing paradigms and communication libraries such as MPI, OpenMP, Gloo, including More ❯
Locations : Canary Wharf Boston Who We Are Boston Consulting Group partners with leaders in business and society to tackle their most important challenges and capture their greatest opportunities. BCG was the pioneer in business strategy when it was founded in More ❯
Salary banding: £90,000 - £110,000 dependent on experience Working pattern: 1-2 days per week in office Location: London About our Engineering Team As a business which has AI at its core, we need to have a reliable, scalable More ❯
Platform Engineer Observability A leading trading platform provider in London, are looking for a Principal Platform Engineer with excellent an understanding of Python and architectural design principles to help reorganise their SRE and Observability function. Consuming huge amounts of data each day, this Fintech company allows traders to monitor their … years in Python (or Golang) in a DevOps or SRE capacity. Strong Linux experience Understanding of Kubernetes, Public Cloud, Prometheus, Grafana, Telemetry and general Observability Experience with Gitlab, Bitbucket and CI (GitHub/CI/Bamboo) Willingness to engage in technical discussion and commit to producing high quality code Enthusiasm More ❯
Senior Product Manager - Agentic AI & Observability Solutions ITRS Group is a private-equity backed, global leader in real-time monitoring & observability solutions for financial services, ensuring mission-critical systems remain resilient, secure, and optimized. Our clients are the most important and sophisticated investment banks, securities exchanges, hedge funds, and fintechs … in the world. As we expand into AI-driven observability and automation, we are looking for a Senior Product Manager to lead the development of agentic AI capabilities that integrate IT telemetry data, AIOps, and self-healing capabilities into our core platform - Geneos and ITRS Analytics. Scope of Role We … are looking for a Senior Product Manager with deep expertise in AI-driven observability, IT automation, and financial services operations to drive the development and adoption of our next-generation AI-powered monitoring and automation platform. This role will focus on defining the product roadmap, collaborating with engineering teams to More ❯
customer journeys, helping to provide technical oversight, ongoing knowledge transfer and enablement. Use cases extend across all the Elastic Solutions such as Enterprise Search, Observability and Security. We would be open to candidates that specialise in one area or all three. Technical Skills: Hands-on experience and an understanding of More ❯
or C++. Endpoint security, network, and other system extensions. Systems Analytics; dynamic tracing and performance analysis tools such as Instruments, VTune, DTrace, and eBPF. Observability technologies, logging, and metrics. Security principles including PKI, certificates, and cryptography. Proficiency in English. Right to work in the UK preferable. More ❯
and device driver development for Windows, Linux, or Mac. Systems Analytics; Dynamic tracing and performance analysis tools such as Instruments, VTune, DTrace, and eBPF. Observability technologies, logging, and metrics. Security principles including PKI, certificates, and cryptography. More ❯
preferable Strong background in building reliable, scalable batch and streaming pipelines using Spark (ideally Scala) Python and SQL Hands-on experience implementing data quality, observability, and lineage systems across distributed environments Proven leadership in technical design, platform adoption, and mentoring engineers on best practices The role can be worked remotely More ❯
City Of London, England, United Kingdom Hybrid / WFH Options
Harrington Starr
using Git and Python. Implement Infrastructure as Code practices using Terraform. Manage containerised environments with Docker or Kubernetes. Collaborate with dev teams to improve observability, deployment processes, and platform reliability. Build observability and monitoring solutions using Grafana, integrating key metrics to support proactive platform operations. Create and enforce internal standards More ❯
london (city of london), south east england, united kingdom Hybrid / WFH Options
Harrington Starr
using Git and Python. Implement Infrastructure as Code practices using Terraform. Manage containerised environments with Docker or Kubernetes. Collaborate with dev teams to improve observability, deployment processes, and platform reliability. Build observability and monitoring solutions using Grafana, integrating key metrics to support proactive platform operations. Create and enforce internal standards More ❯
types of large-scale, high reliability systems. Building web UIs, CLIs or APIs for use by other engineers. Building infrastructure or platform automation, or observability or release tools. Demonstrable skill in effective software testing. Strong commitment to code quality, and the value of giving and receiving feedback through code reviews. More ❯
improve these skills: Generic Skills: Technical communication, cross-functional collaboration, performance reviews, managing up. Engineering Skills: Python, Data Structures, Machine Learning, LLM fine-tuning, observability, large-scale ML deployments, code quality. Data Skills: Experimentation, measurement framework and metric design, data analysis and data manipulation. Services Offered Career coaching, Interview coaching More ❯
AI interactions Balance sophisticated technical approaches with practical business requirements Infrastructure & Scalability Design and architect our AI/ML platform Implement AI/ML observability systems that ensure reliability and performance Develop reproducible training and deployment pipelines that scale with our business Establish data-centric AI practices that leverage our More ❯
Cloud and Data Center, Security, Security and Observability Job Id: Meet the Team The Cloud Security Product Management team within SBG is responsible for the Security Service Edge/Secure Access Service Edge (SSE/SASE) portfolio across all market segments. This is an outstanding opportunity to work with an More ❯
AI interactions Balance sophisticated technical approaches with practical business requirements Infrastructure & Scalability Design and architect our AI/ML platform Implement AI/ML observability systems that ensure reliability and performance Develop reproducible training and deployment pipelines that scale with our business Establish data-centric AI practices that leverage our More ❯
ML infrastructure: model deployment, training pipelines, inference tooling. Diagnose and optimise performance of large-scale ML models. Build and maintain experiment tracking, monitoring, and observability systems. Collaborate with SWE and infra colleagues to build tooling for data access, cleaning, and delivery. Contribute to the internal “toolbox” enabling repeatable, scalable ML More ❯
ML infrastructure: model deployment, training pipelines, inference tooling. Diagnose and optimise performance of large-scale ML models. Build and maintain experiment tracking, monitoring, and observability systems. Collaborate with SWE and infra colleagues to build tooling for data access, cleaning, and delivery. Contribute to the internal “toolbox” enabling repeatable, scalable ML More ❯