complex data ecosystem Design flexible data ingestion and transformation pipelines for financial market data and trading systems Build and maintain AI/ML infrastructure, including model serving, evaluation, and observability frameworks Collaborate directly with clients to ensure the platform meets real-world enterprise requirements Contribute to both strategic technical direction and hands-on implementation as part of a small, high More ❯
Integrate and extend React Native functionality using native modules in Swift or Kotlin when needed. Take ownership of app store releases, ensuring smooth submission, updates and maintennance processes. Manage observability and performance in production through tools like crash reporting, logging and analytics. Contribute to the architecture and tooling decisions, helping to shape the direction of our mobile stack from the More ❯
Integrate and extend React Native functionality using native modules in Swift or Kotlin when needed. Take ownership of app store releases, ensuring smooth submission, updates and maintennance processes. Manage observability and performance in production through tools like crash reporting, logging and analytics. Contribute to the architecture and tooling decisions, helping to shape the direction of our mobile stack from the More ❯
london (chessington), south east england, united kingdom Hybrid/Remote Options
SoTalent
Resource Management: Oversee project budgets, allocate resources, and lead the monitoring engineering team to deliver on time and within scope. What You'll Bring Strong expertise in system reliability, observability, and monitoring strategy. Deep understanding of end-to-end video processing and broadcast workflows. Proven leadership skills with experience managing engineering teams. Strategic mindset with the ability to align monitoring More ❯
team meetings and performance reviews. Motivate, guide and coach team members to hit agreed targets via formal objectives and supporting development plans. Incident & Problem Management Proactively partner with the Observability Management function to establish trending and opportunities for Customer infrastructure optimisation. Be a process manager and advocate for Problem Management, ensuring root cause analysis takes place on major Incidents and More ❯
TM Forum (eTOM/ODA) and ITIL 4. Strong delivery leadership in Agile/SAFe environments, combined with governance and risk management. Demonstrated success implementing Operational Readiness Reviews (ORR), observability, and support models. Excellent stakeholder management and ability to produce clear written artefacts for both technical and executive audiences. Education Bachelor’s or Master’s degree in Computer Science, Engineering More ❯
end Scale ingestion and indexing for 30+ blockchains, including high-throughput chains Operate a secure fleet of full nodes and indexers with clear SLAs and cost controls Set SLOs, observability, incident management, and make on call boring Build and lead six plus squads. Org design, hiring, mentoring, standards, and SDLC Partner with product, compliance, and customers to turn outcomes into More ❯
end Scale ingestion and indexing for 30+ blockchains, including high-throughput chains Operate a secure fleet of full nodes and indexers with clear SLAs and cost controls Set SLOs, observability, incident management, and make on call boring Build and lead six plus squads. Org design, hiring, mentoring, standards, and SDLC Partner with product, compliance, and customers to turn outcomes into More ❯
cable management, hardware lifecycle planning, and environmental monitoring. Participate in capacity planning and performance tuning to support business growth and infrastructure scalability. Reliability & Monitoring Ensure high availability, security, and observability of systems through best practices in reliability and recoverability. Develop and maintain monitoring systems to ensure compliance with service level objectives. Lead and contribute to incident response, root cause analysis More ❯
business problems at scale. What you’ll bring: Expertise in the deployment of enterprise-grade AI solutions to cloud and on-premise customer environments with a focus on availability, observability and security. Proven track record with at least one of the major cloud providers and an understanding of DevOps best practices. Hands-on experience building production-grade solutions using LLMs More ❯
business problems at scale. What you’ll bring: Expertise in the deployment of enterprise-grade AI solutions to cloud and on-premise customer environments with a focus on availability, observability and security. Proven track record with at least one of the major cloud providers and an understanding of DevOps best practices. Hands-on experience building production-grade solutions using LLMs More ❯
best practices It will also help you to have Experience establishing and enforcing data governance standards through technical architecture (not just documentation) Familiarity with data cataloging, metadata management, and observability tools A systems-thinking mindset-you understand the full data lifecycle and how to maintain integrity from source to dashboard At Booksy, we believe in the power of well-structured More ❯
training jobs to identify their bottlenecks, e.g. using NVIDIA Nsight Systems Design and implement efficiency improvements to maximise MFU, e.g. tensor parallelism, model compilation, mixed precision Design and implement observability tools, e.g. to track MFU Collaborate closely with Research teams to integrate training efficiency improvements and create a culture of performance optimization About you In order to set you up More ❯
ensure accuracy and quality is obtained collaboratively with our 3rd party suppliers. Assess and manage risks associated with services and recurring problems. Work across the ecosystem to continuously improve observability capabilities such as reporting, dashboarding and alerting which will drive robust proactive problem management. Ensure, these are communicated on a weekly basis and available for all of the team to More ❯
Ringway, Altrincham, Cheshire, England, United Kingdom
The Hut Group
potential technical risks and develop strategies to mitigate them, ensuring that the application is secure, robust and reliable Champion performance optimisation across the frontend stack while ensuring accessibility and observability are baked into all solutions Deeply committed to crafting intuitive, impactful, and optimised user experiences that turn complex workflows into seamless, engaging journeys Share your knowledge within a democratic team More ❯
platform. Define the data models, technical architecture, and platform interfaces that power intelligent, context-aware product experiences. Partner with engineering to design and deliver scalable APIs, system components, and observability layers that enable extensibility and reuse. Collaborate with AI/ML teams to integrate capabilities such as semantic search, LLM-powered assistants, personalization, and classification systems. Write PRFAQs and technical More ❯
platform. Define the data models, technical architecture, and platform interfaces that power intelligent, context-aware product experiences. Partner with engineering to design and deliver scalable APIs, system components, and observability layers that enable extensibility and reuse. Collaborate with AI/ML teams to integrate capabilities such as semantic search, LLM-powered assistants, personalization, and classification systems. Write PRFAQs and technical More ❯
Chelmsford, Essex, United Kingdom Hybrid/Remote Options
Brooks Automation, Inc
infrastructure and security services, ensuring operational excellence and incident response readiness. Partner with the CISO to shape long-term strategy and roadmap for secure, resilient IT services. Drive automation, observability, and scalability across the infrastructure and security stack. Serve as a key escalation point for technical troubleshooting and security event resolution. Guide vendor selection, contract negotiations, and service-level adherence More ❯
next-generation AI products. You’ll join a small, experienced team developing an internal Kubernetes-based platform that enables AI innovation across the organisation automating everything from deployments to observability, and helping developers build smarter applications with confidence. What you’ll be doing: Designing, deploying, and maintaining Azure Kubernetes (AKS) environments Managing Infrastructure as Code with Terraform and improving GitOps … workflows (ArgoCD/GitHub Actions) Building observability and monitoring stacks using Prometheus, Grafana, and Loki Supporting AI workloads (LLMs, RAG, and document processing applications) running on Kubernetes Automating platform operations with Python, Go, and shell scripting Implementing security guardrails, PII compliance tooling, and best practices for production AI systems What you’ll need: 3+ years’ experience in DevOps or Platform … Engineering Strong background in Azure and Kubernetes Hands-on experience with Terraform, CI/CD, and container orchestration Familiarity with observability tools (Prometheus, Grafana, Loki) Scripting or programming skills in Python or Go Interest in AI infrastructure, LLMOps, or large language model deployment More ❯
Wigan, Lancashire, England, United Kingdom Hybrid/Remote Options
Searchability
As part of their continued investment in reliability and platform performance, they are now seeking an experienced Site Reliability Engineer to strengthen their engineering function and help evolve their observability and automation capabilities. THE BENEFITS Hybrid working model (office and remote) Opportunity to define and lead SRE strategy within a collaborative culture Exposure to modern cloud-native and containerised environments … and performance of complex online platforms supporting high-volume transactions. Working closely with operations and product teams, you'll monitor production systems, develop automation to improve uptime, and refine observability to provide real-time insight into platform health. You'll also play a key role in performance testing, system tuning and incident management to ensure smooth operation during critical events. … SITE RELIABILITY ENGINEER ESSENTIAL SKILLS At least 2 years' experience working as an SRE Deep understanding of system reliability, scalability and performance tuning Experience with observability tools (Grafana, Prometheus, OpenTelemetry) Proficiency in a programming language such as Go or .NET for automation and debugging Hands-on experience with AWS or another major cloud platform Knowledge of Kubernetes, Terraform, and Infrastructure More ❯
Wigan, Greater Manchester, United Kingdom Hybrid/Remote Options
Searchability (UK) Ltd
As part of their continued investment in reliability and platform performance, they are now seeking an experienced Site Reliability Engineer to strengthen their engineering function and help evolve their observability and automation capabilities. THE BENEFITS Hybrid working model (office and remote) Opportunity to define and lead SRE strategy within a collaborative culture Exposure to modern cloud-native and containerised environments … and performance of complex online platforms supporting high-volume transactions. Working closely with operations and product teams, you'll monitor production systems, develop automation to improve uptime, and refine observability to provide real-time insight into platform health. You'll also play a key role in performance testing, system tuning and incident management to ensure smooth operation during critical events. … SITE RELIABILITY ENGINEER ESSENTIAL SKILLS At least 2 years' experience working as an SRE Deep understanding of system reliability, scalability and performance tuning Experience with observability tools (Grafana, Prometheus, OpenTelemetry) Proficiency in a programming language such as Go or .NET for automation and debugging Hands-on experience with AWS or another major cloud platform Knowledge of Kubernetes, Terraform, and Infrastructure More ❯
Hereford, Herefordshire, England, United Kingdom Hybrid/Remote Options
Hays Specialist Recruitment Limited
role focused on ensuring service availability, performance, and cost-efficiency across both cloud and on-prem infrastructure.You'll work closely with development and support teams to evolve infrastructure, enhance observability, and proactively mitigate reliability risks.Key Responsibilities:Collaborate with software engineers to improve reliability and performanceAutomate operational tasks and reduce alert fatigueEnhance monitoring and observability to pre-empt issuesSupport development environments … protocolsExperience with cloud platforms, ideally AWS (EC2, RDS, S3, Lambda)Desirable:Coding experience in Java, Go, Python or similarKnowledge of cross-domain technologiesExperience in service management environmentsPractical application of observability patternsExperience with AzureAdditional Information:Due to the nature of the work, successful candidates will be required to undergo security vetting.We welcome applications from all backgrounds and are committed to creating More ❯
ll design and implement database services that can be consumed on demand — secure, compliant, and self-service. Working closely with Platform, SRE, and DevOps teams, you’ll bring automation, observability, and scalability to their database layer, enabling hundreds of developers to ship faster with confidence. What You’ll Do 💾 Design, build, and operate PostgreSQL and ElasticSearch clusters for production. ⚙️ Automate … provisioning, upgrades, and HA/DR with Terraform, Ansible, Helm, and Kubernetes Operators. 🌐 Embed databases into the Internal Developer Platform through APIs, GitOps workflows, and self-service tools. 📊 Implement observability with Prometheus, Grafana, and centralized logging. 🧠 Define and maintain SLOs for uptime and performance, embedding compliance and security controls. 🤝 Collaborate with development and platform teams to refine database automation standards … of Kubernetes and stateful workloads . ✅ Proficiency with Infrastructure as Code (Terraform, Ansible, Helm). ✅ Some development experience (Python, Go, or similar) for automation and API integration. ✅ Knowledge of observability tooling – Prometheus, Grafana, ELK, or Datadog. 🎁 Bonus: experience with ElasticSearch , MySQL , or SQL Server , plus exposure to AWS , GCP , or Azure . Why This Role ✨ Greenfield impact – build database-as More ❯
ll design and implement database services that can be consumed on demand — secure, compliant, and self-service. Working closely with Platform, SRE, and DevOps teams, you’ll bring automation, observability, and scalability to their database layer, enabling hundreds of developers to ship faster with confidence. What You’ll Do 💾 Design, build, and operate PostgreSQL and ElasticSearch clusters for production. ⚙️ Automate … provisioning, upgrades, and HA/DR with Terraform, Ansible, Helm, and Kubernetes Operators. 🌐 Embed databases into the Internal Developer Platform through APIs, GitOps workflows, and self-service tools. 📊 Implement observability with Prometheus, Grafana, and centralized logging. 🧠 Define and maintain SLOs for uptime and performance, embedding compliance and security controls. 🤝 Collaborate with development and platform teams to refine database automation standards … of Kubernetes and stateful workloads . ✅ Proficiency with Infrastructure as Code (Terraform, Ansible, Helm). ✅ Some development experience (Python, Go, or similar) for automation and API integration. ✅ Knowledge of observability tooling – Prometheus, Grafana, ELK, or Datadog. 🎁 Bonus: experience with ElasticSearch , MySQL , or SQL Server , plus exposure to AWS , GCP , or Azure . Why This Role ✨ Greenfield impact – build database-as More ❯
in London. Working alongside software and cybersecurity engineers, you’ll help design, build, and automate a hybrid multi-cloud estate across AWS and Azure—enhancing CI/CD pipelines, observability, and developer experience. You’ll take ownership of business-critical infrastructure, shaping cloud strategy end-to-end and collaborating with global teams across the US and Europe to drive efficiency … CI/CD pipelines through tools such as Azure DevOps, GitHub Actions, or Octopus. You’ll also be adept at automating workflows in Python or PowerShell and implementing modern observability solutions including DataDog, OpenSearch, and LogicMonitor. This is a rare opportunity to join a high-performing, global hedge fund where technology and engineering directly drive investment performance and operational scale. More ❯