evaluate and implement new technologies, and oversee their integration. Collaborate with external vendors and partners to ensure high-quality service delivery. Utilise and develop monitoring systems (e.g., Zabbix, Grafana, Prometheus) and oversee client reporting systems. Skills and Qualifications 5-7+ years' experience in a technical leadership role within a 24/7 broadcast, network operations centre (NOC), or Master More ❯
concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL). Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs. Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts. Comfort with Kubernetes , IaC and cloud-native patterns; can debug from network to application layer. Start-up bias for action: you prioritise high More ❯
such as IBM Netcool, Moogsoft, BigPanda, PagerDuty, ServiceNow AIOps. Proficiency in Python, and hands-on knowledge of Ansible Automation Platform. Other highly valued skills include: Knowledge of Observability Platforms: Prometheus, Grafana, ELK, Splunk. Experience with integration into ITSM platforms such as ServiceNow. Experience with Kafka. You may be assessed on the key critical skills relevant for success in role, such More ❯
through previous experience within the financial services sector. Desirable Skills Experience with .Net ecosystem Scripting skills - Unix, RegEx, Powershell Prior experience of working with: Nagios, Splunk, ELK stack, Grafana, Prometheus BitBucket, Git, Octopus MSMQ, Kafka, IBM MQ Automate Enterprise (Help Systems) Salerio (COR Financials) SWIFT Message Types Personal Attributes: Strong analytical and problem-solving skills with ability to assess risk More ❯
through previous experience within the financial services sector. Desirable Skills Experience with .Net ecosystem Scripting skills - Unix, RegEx, Powershell Prior experience of working with: Nagios, Splunk, ELK stack, Grafana, Prometheus BitBucket, Git, Octopus MSMQ, Kafka, IBM MQ Automate Enterprise (Help Systems) Salerio (COR Financials) SWIFT Message Types Personal Attributes: Strong analytical and problem-solving skills with ability to assess risk More ❯
and instruments, with broad asset class understanding, through previous experience within the financial services sector. Experience with .Net ecosystem Prior experience of working with: Nagios, Splunk, ELK stack, Grafana, Prometheus BitBucket, Git, Octopus MSMQ, Kafka, IBM MQ Automate Enterprise (Help Systems) Salerio (COR Financials) SWIFT Message Types Personal Attributes: Strong analytical and problem-solving skills with ability to assess risk More ❯
GitHub Actions) Define and enforce platform standards across environments (dev, staging, prod) Collaborate with developers and DevOps on deployment tooling and security Enable platform observability using tools like Datadog, Prometheus, and CloudWatch Maintain Helm charts and Terraform modules for shared infrastructure Contribute to onboarding documentation and platform adoption practices Participate in incident response and postmortem analysis, where applicable Essential Skills … and secure image management Scripting or programming experience in Bash, Python, or TypeScript Strong understanding of GitOps practices and infrastructure lifecycle management Desirable Skills Experience with observability tooling (Datadog, Prometheus, Fluent Bit) Knowledge of admission controllers, OPA/Gatekeeper (optional for governance) Familiarity with cloud cost optimisation and Kubernetes scaling strategies Exposure to security scanning tools (tfsec, Trivy, Snyk) Interest More ❯
such as Python, Bash or Shell Develop and implement CI/CD pipelines for application deployment on Kubernetes Monitor the health of the platform and applications using tools like Prometheus, Grafana or ELK stack Assist with capacity planning and load testing of the platform and applications Develop and enforce best practices for building container-based applications Troubleshoot issues within the … Experience with Azure cloud platform Experience with Infrastructure as Code (IaC) tools like Terraform Familiarity with CI/CD tools like Argocd , jenkins etc Experience with monitoring tools like Prometheus , Grafana , ELK stack etc Strong scripting skills (Python, Bash, etc.) Ability to troubleshoot complex networking issues BS degree in Computer Science, Engineering or a related field Additional requirements Work experience More ❯
when solving complex problems Good systems design Ability to learn new tech quickly Enjoyable to work with TECHNOLOGY STACK Python, PostgreSQL, FastAPI, Redis, TypeScript, React, Next.js, Tailwind, AWS, Kubernetes, Prometheus, Pinecone, GPT-4 EXAMPLE PROJECTS Use an LLM to identify references to other sections in the text of the law Improve and migrate our data model for the content we More ❯
Manchester, Lancashire, England, United Kingdom Hybrid / WFH Options
Lorien
modern technologies. with clear progression routes available. Key Requirements: Strong troubleshooting and fault-resolution experience across infrastructure and applications Hands-on experience with monitoring tools such as Instana, Splunk, Prometheus, Grafana, or SolarWinds Confident supporting both Windows and Linux operating systems Experience working in ITIL-aligned support environments Understanding of web hosting technologies (DNS, HTTP/S, SSL Certs, and More ❯
on our Azure DevOps journey! Experience of CI/CD pipelines such as Jenkins, GIT, Nexus, Maven, Terraform, Docker, Kubernetes, Harness. Experience of continuous monitoring such as Dynatrace, Splunk, Prometheus, Kibana. Solid coaching expertise, working alongside our feature teams to assist them in understanding a DevOps approach to enable them to contribute themselves. Common programming and scripting languages and frameworks … Advanced Kubernetes skills, including managing ingress controllers, implementing service mesh solutions (like Istio or Linkerd), and handling stateful applications. Experience implementing observability patterns across distributed systems using tools like Prometheus, Grafana, and distributed tracing solutions. Hands-on experience with infrastructure automation using Terraform specifically for Azure resources. Ability to drive architectural decisions related to CI/CD and infrastructure Knowledge More ❯
compatible streaming platform for all tax-relevant financial events. Infrastructure: Cloud platforms (AWS), containerization (Docker, Kubernetes), and Infrastructure as Code (Terraform). Observability: Modern monitoring and observability tools include Prometheus, Grafana, and Datadog. Must-Haves: 5+ years of professional software engineering experience, with a proven track record of shipping and operating complex, large-scale systems in production. Deep, hands-on … and Protocol Buffers (Protobuf). Proficiency with AWS , containerization ( Docker , Kubernetes ), Infrastructure as Code ( Terraform ), and CI/CD pipelines (e.g., Jenkins). Experience with modern observability tools like Prometheus , Grafana , and distributed tracing systems. Prior experience in FinTech , RegTech , or another highly regulated industry with familiarity with financial data or compliance systems. How We Take Care of You: Competitive More ❯
years of experience with containerization and orchestration (Docker + Kubernetes) and confidence operating cloud infrastructures Front-end development experience a plus DevOps skills, especially leveraging open source tools (Kibana, Prometheus, Grafana) a plus Sound understanding of agile software development best practices including CI/CD, testing, monitoring, alerting and documentation Being Cloud agnostic means not being able to use any … managed Kubertnes service, so therefore build own Kubernete - experience with only managed Kubernetes would not be applicable for the role Kubernetes experience on at least one cloud Prometheus stack (Grafana, Prometheus, alertmanager Kubernetes upgrade and maintenance experience Any logging infrastructure experience Terraform Ansible Shell/Python Scripting Gitlab pipelines (or any other CI/CD) Desirable experience: Kubernetes security Kubernetes More ❯
Site Reliability Engineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology More ❯
make a move? Get in touch and apply today! Responsibilities: Respond rapidly to critical AWS incidents, identify root causes, and deploy automated hotfixes. Lead the setup and integration of Prometheus-Grafana observability stack. Refactor and modernize deployment pipelines using GitHub Actions and Kubernetes. Maintain robust monitoring, alerting, and CI/CD systems. Skills/Must have: Strong hands-on experience … with AWS (eg EC2, EKS, CloudWatch, Lambda). Background in incident, change, and problem management; comfortable with on-call rotations. Expertise in Prometheus, Grafana, and Splunk; solid knowledge of PromQL. Proficient in Scripting/programming (Python, Go, Bash, SQL). Salary: £500 per day More ❯
highly available systems within a technologically diverse stack used for global research and trading of FICCO and Cryptoassets. Leveraging technologies such as Terraform, Docker, Kubernetes, CI/CD, Python, Prometheus and Grafana, you will develop repeatable and supportable infrastructure to meet the demanding needs of our business. What you'll do in this role: Collaborate closely with the US Platform … Skills, Experience & Abilities: Proven experience in supporting mission critical, high performance trading infrastructure across various technology stacks. Experience deploying and supporting applications in Kubernetes Previous infrastructure monitoring experience using Prometheus and Grafana Previous experience maintaining and optimizing cloud infrastructure in AWS environments Experience performing database and database infrastructure support for highly available systems Working knowledge of TLS Demonstrated knowledge of More ❯
operating infrastructure on AWS and other providers Operating MongoDB (or other document database) clusters Operating Redis (or other key-value storage) clusters Administering Linux servers Maintaining distributed software Operating Prometheus and Grafana Operating logging collection and analysis systems Participating in the on-call rotation(4:00am - 16:00pm UTC) Skills: Kubernetes & containers (advanced) AWS/EKS (advanced) Linux (advanced) Terraform … and IaC in general (proficient) Helm (proficient) Go and/or Python (familiar) MongoDB (or similar) Redis (or similar) Monitoring - prometheus, grafana, thanos (familiar) Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.) Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP) Proactive, energetic, innovative and change oriented Nice to have: GCP or Azure Bare metal infrastructure More ❯
but is not limited to: Architecting, building, and operating the core cloud-native infrastructure for WunderGraph Cosmo, primarily using Go and Kubernetes. Owning and evolving our observability stack (OpenTelemetry, Prometheus, ClickHouse) and the infrastructure supporting our AI-driven features to ensure deep, actionable insights into our systems. Building and optimizing CI/CD pipelines to improve build times, automate quality … architecture, distributed systems, and the challenges of running high-performance API gateways. Familiarity with GraphQL Federation is a significant plus. Experience building or managing modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ClickHouse). A self-starter attitude and a leader's mindset: you are comfortable with ambiguity, can identify and solve ill-defined problems, and don't need hand-holding. More ❯
new functionality Maintaining and evolving our cloud infrastructure (GCP, Kubernetes) to ensure high availability, security, and performance Managing service observability and reliability, including logging, metrics and alerting (we use Prometheus and Grafana) Handling database and service upgrades (e.g. MySQL, Kubernetes), secrets management and security best practices Taking ownership of platform-level concerns such as deployment pipelines, configuration management, and cost … best practices across infrastructure and applications, including secrets management and credential rotation. Familiarity with infrastructure-as-code or automation tools is a plus Experience with observability tools (such as Prometheus and Grafana), service monitoring, and debugging in production environments A demonstrated interest in staying up-to-date with new technology, new frameworks, new languages and other developments like AI. A More ❯
orchestration. Support multi-tenancy and environment rationalization to reduce duplication and inefficiency. Define and implement observability standards, including logging, metrics, tracing, and alerting . Use tools like New Relic , Prometheus , and Grafana , alongside building custom instrumentation for key platform services. Drive incident readiness and operational resilience by enabling actionable monitoring and alerting. Drive cloud cost visibility and optimization efforts across … and operating developer platforms and enablement frameworks. Experience with cloud-native technologies, Kubernetes, and Infrastructure as Code (Terraform, Helm, etc.). Strong understanding of observability tooling (especially New Relic, Prometheus, Grafana) and incident response best practices. Familiarity with FinOps, platform cost tracking, and infrastructure efficiency techniques. Excellent communication, leadership, and stakeholder management skills. Attract, hire, and develop talented platform engineers More ❯
This is an office based role , you must be able to commute to and work in the City of London as a norm About Us Archax is an FCA-regulated exchange, broker and custodian for digital assets, targeted at professional More ❯
Position Summary We are looking for an experienced Systems Engineer with strong Linux and Kubernetes experience to join our Group Engineering - Systems team. You will help design, build and operate modern infrastructure platforms that support continually evolving applications and services. More ❯
About us We are Orbital an AI company on a mission to automate the legal segment of every property transaction in the world We iterate rapidly to build products that utilise the bleeding-edge AI models. Products that are powered More ❯