Real Time data, designing systems that can elastically scale to handle surges in throughput and demand. Hands-on experience with modern technologies such as Kubernetes, Kafka, RocksDB, MongoDB, MemSQL, Prometheus, Tempo, and Snowflake is highly desirable. Exposure to cloud-native tooling and practices, with an emphasis on DevOps, cloud computing, Kubernetes, and stream processing is a strong advantage. Comfortable working More ❯
projects and other activities as required. Experience and Skills Essential Experience and demonstratable knowledge of SRE best practices Expert in Git and Gitops Expert in logging and monitoring solutions (Prometheus, Grafana etc.) Demonstratable knowledge of Cloud Expert knowledge of Kubernetes Proficient ability to communicate in English (Written and Verbal) Understanding of non-functional testing Significant DevOps experience Desirable Proven ability More ❯
concurrent users (e.g., multi-tenant PostgreSQL, sharded MySQL). Strong backend fundamentals around concurrency, caching, indexing and distributed systems trade-offs. Proven track record of setting SLOs, building dashboards (Prometheus/Grafana, OpenTelemetry, etc.) and tuning alerts. Comfort with Kubernetes , IaC and cloud-native patterns; can debug from network to application layer. Start-up bias for action: you prioritise high More ❯
to work effectively with internal teams and customer-facing stakeholders. Technologies we use Golang AWS, CDK (TypeScript), Lambda, SQS, EventBridge, RDS, DynamoDB, OpenSearch Github, Github Actions Loki, Tempo, Grafana, Prometheus Event-driven architecture and domain-driven design How we reward our team Dynamic working environment with a diverse and driven team Huge opportunity for learning in a high growth environment More ❯
such as Python, Bash or Shell Develop and implement CI/CD pipelines for application deployment on Kubernetes Monitor the health of the platform and applications using tools like Prometheus, Grafana or ELK stack Assist with capacity planning and load testing of the platform and applications Develop and enforce best practices for building container-based applications Troubleshoot issues within the … Experience with Azure cloud platform Experience with Infrastructure as Code (IaC) tools like Terraform Familiarity with CI/CD tools like Argocd , jenkins etc Experience with monitoring tools like Prometheus , Grafana , ELK stack etc Strong scripting skills (Python, Bash, etc.) Ability to troubleshoot complex networking issues BS degree in Computer Science, Engineering or a related field Additional requirements Work experience More ❯
reporting. Develop and implement TOC strategy, staffing models, and documentation standards. Participate in systems architecture, new tech evaluation, and vendor selection. Manage operational workflows, reporting systems (e.g., Zabbix, Grafana, Prometheus), and support international broadcast teams. Collaborate with leadership on technical direction and TOC transformation. Skills/Must Have: 5-7+ years in a technical leadership role within a TOC More ❯
evaluate and implement new technologies, and oversee their integration. Collaborate with external vendors and partners to ensure high-quality service delivery. Utilise and develop monitoring systems (e.g., Zabbix, Grafana, Prometheus) and oversee client reporting systems. Skills and Qualifications 5-7+ years' experience in a technical leadership role within a 24/7 broadcast, network operations centre (NOC), or Master More ❯
data. • Technically sound experience of Unix environments. • Good understanding of Networking principles. Desirable • Proficient in scripting and automation (preferably Shell and Python). • Familiarity with monitoring tools (e.g. Grafana, Prometheus, Elastic). Diversity & Inclusion Nomura is an equal opportunity employer. We value diversity and are committed to creating an inclusive environment for all our employees. We do not discriminate on More ❯
identified and progressed to resolution. Responsible for generating, developing and curating high quality Networks focused reports and KPIs using the local reporting systems of the platforms including Grafana and Prometheus Who we are The UK's fastest broadband network. The nation's best-loved mobile brand. And, one of the UK's biggest companies too. Diverse, high performing teams - jam More ❯
on autonomy, embrace rapid iteration, and see feedback as fuel rather than friction What will make us extra happy? Practical experience with logging and metrics systems like InfluxDB, ClickHouse, Prometheus, or Vector Familiarity with Deno, V8 isolates, or other edge/serverless runtimes. Experience building DXfocused tools, CLIs, or SDKs Background in multitenant SaaS, usagebased billing, or data privacy across More ❯
years of experience with containerization and orchestration (Docker + Kubernetes) and confidence operating cloud infrastructures Front-end development experience a plus DevOps skills, especially leveraging open source tools (Kibana, Prometheus, Grafana) a plus Sound understanding of agile software development best practices including CI/CD, testing, monitoring, alerting and documentation Being Cloud agnostic means not being able to use any … managed Kubertnes service, so therefore build own Kubernete - experience with only managed Kubernetes would not be applicable for the role Kubernetes experience on at least one cloud Prometheus stack (Grafana, Prometheus, alertmanager Kubernetes upgrade and maintenance experience Any logging infrastructure experience Terraform Ansible Shell/Python Scripting Gitlab pipelines (or any other CI/CD) Desirable experience: Kubernetes security Kubernetes More ❯
Site Reliability Engineer, you will be responsible for designing, developing, and maintaining systems and applications using Golang. You will monitor and optimise system performance with tools such as Grafana, Prometheus, New Relic, and Splunk. Your role will involve identifying and resolving reliability issues, automating processes, and ensuring the seamless operation of the platform. If you have a passion for technology More ❯
a strong sense of ownership, and determination. Openness to constructive feedback and value the ideas and opinions of others. Our technologies Cloud Provider: Amazon AWS Monitoring & Logging: ELK (EFK), Prometheus, Grafana Why joining Smartcat might be your best move so far Fully remote team We are a global team of 200+ enthusiastic people spread across 30+ countries. We have been More ❯
make a move? Get in touch and apply today! Responsibilities: Respond rapidly to critical AWS incidents, identify root causes, and deploy automated hotfixes. Lead the setup and integration of Prometheus-Grafana observability stack. Refactor and modernize deployment pipelines using GitHub Actions and Kubernetes. Maintain robust monitoring, alerting, and CI/CD systems. Skills/Must have: Strong hands-on experience … with AWS (eg EC2, EKS, CloudWatch, Lambda). Background in incident, change, and problem management; comfortable with on-call rotations. Expertise in Prometheus, Grafana, and Splunk; solid knowledge of PromQL. Proficient in Scripting/programming (Python, Go, Bash, SQL). Salary: £500 per day More ❯
highly available systems within a technologically diverse stack used for global research and trading of FICCO and Cryptoassets. Leveraging technologies such as Terraform, Docker, Kubernetes, CI/CD, Python, Prometheus and Grafana, you will develop repeatable and supportable infrastructure to meet the demanding needs of our business. What you'll do in this role: Collaborate closely with the US Platform … Skills, Experience & Abilities: Proven experience in supporting mission critical, high performance trading infrastructure across various technology stacks. Experience deploying and supporting applications in Kubernetes Previous infrastructure monitoring experience using Prometheus and Grafana Previous experience maintaining and optimizing cloud infrastructure in AWS environments Experience performing database and database infrastructure support for highly available systems Working knowledge of TLS Demonstrated knowledge of More ❯
operating infrastructure on AWS and other providers Operating MongoDB (or other document database) clusters Operating Redis (or other key-value storage) clusters Administering Linux servers Maintaining distributed software Operating Prometheus and Grafana Operating logging collection and analysis systems Participating in the on-call rotation(4:00am - 16:00pm UTC) Skills: Kubernetes & containers (advanced) AWS/EKS (advanced) Linux (advanced) Terraform … and IaC in general (proficient) Helm (proficient) Go and/or Python (familiar) MongoDB (or similar) Redis (or similar) Monitoring - prometheus, grafana, thanos (familiar) Grasp of networking concepts (subnets, routing, peering, load balancing, NAT, etc.) Common networking protocols (DNS, TCP/IP, HTTP, TLS, UDP) Proactive, energetic, innovative and change oriented Nice to have: GCP or Azure Bare metal infrastructure More ❯
but is not limited to: Architecting, building, and operating the core cloud-native infrastructure for WunderGraph Cosmo, primarily using Go and Kubernetes. Owning and evolving our observability stack (OpenTelemetry, Prometheus, ClickHouse) and the infrastructure supporting our AI-driven features to ensure deep, actionable insights into our systems. Building and optimizing CI/CD pipelines to improve build times, automate quality … architecture, distributed systems, and the challenges of running high-performance API gateways. Familiarity with GraphQL Federation is a significant plus. Experience building or managing modern observability stacks (e.g., OpenTelemetry, Prometheus, Grafana, ClickHouse). A self-starter attitude and a leader's mindset: you are comfortable with ambiguity, can identify and solve ill-defined problems, and don't need hand-holding. More ❯
This is an office based role , you must be able to commute to and work in the City of London as a norm About Us Archax is an FCA-regulated exchange, broker and custodian for digital assets, targeted at professional More ❯
LSEG (London Stock Exchange Group) is more than a diversified global financial markets infrastructure and data business. We are dedicated, open-access partners with a dedication to excellence in delivering the services our customers expect from us. With extensive experience More ❯
Create the future of travel with us Whether it's to visit the people closest to us, starting an exciting adventure, or a career-defining business trip, travel is an essential part of our lives. Yet we've all experienced More ❯
Infra Automation/DevOps functions, adhering to a strict Infrastructure as Code (IAC) mindset. Additionally, a strong understanding of infrastructure metric collection and visualisation tooling, such as Kibana, Splunk, Prometheus, and Grafana, is highly desirable. If you possess the following skills and experience, we encourage you to apply for this role: Experience in monitoring and automating low-latency network infrastructure … CI/CD, and GitOps to minimise manual interventions. Expertise in defining SLOs, conducting incident reviews, and enhancing observability and alerting for improved reliability. Familiarity with tools like Corvil, Prometheus, Grafana, Kibana, and Splunk to analyse trends and forecast capacity. Collaborative mindset, able to work closely with engineering, DevOps, and research teams to deliver highly resilient and high-performance infrastructure. More ❯
About us We are Orbital an AI company on a mission to automate the legal segment of every property transaction in the world We iterate rapidly to build products that utilise the bleeding-edge AI models. Products that are powered More ❯
We are Orbital an AI company on a mission to automate the legal segment of every property transaction in the world We iterate rapidly to build products that utilise the bleeding-edge AI models. Products that are powered by the More ❯