systems design and share responsibility with them in diagnosing, resolving, and preventing production issues. What We Value Experience with monitoring systems using tools like Prometheus and writing health checks Interest in learning and managing technologies like Spark, Hadoop, Elasticsearch, and Cassandra Familiarity with deploying GPUs Moderate experience with TCP/ More ❯
Search, Discovery & Insights, Company Profiles, Workflow & Efficiency, and many more. Our stack Java 17/21, Spring Boot (MVC, JDBC, Security). Postgres, Docker, Prometheus, K8s, Elastic. Team Stream Development Lead, 2 BE, 1 FE, 1 SDET As a qualified expert, You will Help architect, design, and develop complex, large More ❯
Solid understanding of DevOps practices, such as monitoring, logging, advanced security practices, and incident response procedures. Experience with metrics and log management tools like Prometheus, Grafana, Elasticsearch, and CloudWatch, along with configuring proactive alerting systems for effective monitoring and incident response. Excellent collaboration skills to work effectively with cross-functional More ❯
Knowledge of CI/CD best practices and tools (e.g. GitHub Actions, Jenkins, CodePipeline) Exposure to monitoring and observability tools for ML systems (e.g. Prometheus, Grafana, DataDog, WhyLabs, Evidently, etc.) Experience in building parallelised or distributed model inference pipelines Nice-to-Have Skills Familiarity with feature stores and model registries More ❯
Greater London, England, United Kingdom Hybrid / WFH Options
Focus on SAP
/ELB, Network ACLs, Security Groups, KMS and S3—to meet performance, security and compliance requirements. Monitoring & Observability: Implement application and infrastructure monitoring with Prometheus & Grafana; manage centralized logging with the ELK stack. Web & Reverse Proxy: Configure and tune Nginx for traffic management, load‑balancing and SSL termination. Performance & Reliability … Security Scanning: Skilled in GitLab CI/CD or Jenkins pipelines, integrating tools such as Blackduck, Checkmarx and SonarQube. Monitoring & Logging: Hands‑on with Prometheus, Grafana and ELK for telemetry, alerting and log management. Scripting & Development: Development and testing experience in Python and/or JavaScript for automation tasks and More ❯
london, south east england, United Kingdom Hybrid / WFH Options
Focus on SAP
/ELB, Network ACLs, Security Groups, KMS and S3—to meet performance, security and compliance requirements. Monitoring & Observability: Implement application and infrastructure monitoring with Prometheus & Grafana; manage centralized logging with the ELK stack. Web & Reverse Proxy: Configure and tune Nginx for traffic management, load‑balancing and SSL termination. Performance & Reliability … Security Scanning: Skilled in GitLab CI/CD or Jenkins pipelines, integrating tools such as Blackduck, Checkmarx and SonarQube. Monitoring & Logging: Hands‑on with Prometheus, Grafana and ELK for telemetry, alerting and log management. Scripting & Development: Development and testing experience in Python and/or JavaScript for automation tasks and More ❯
City of London, London, United Kingdom Hybrid / WFH Options
Sanderson Recruitment
root cause analysis programming experience Kubernetes and Docker Deploy and release services experience Experience with Greenfield projects ideally 6+ years relevant experience Grafana/Prometheus ideal Strong communication skills with the ability to proactively engage with a wide range of stakeholders If this sounds of interest to you, please ring More ❯
as a software engineer. Over 5 years in data engineering and pipeline development in high-volume production environments. Experience with monitoring systems such as Prometheus, Grafana, Zabbix, or Datadog. Experience in fintech or trading industries. Strong object-oriented development skills and software engineering fundamentals. Hands-on experience with cloud data More ❯
verbal communication skills Ability to work well on a team as well as independently What will make you stand out: Experience using Splunk, Grafana, Prometheus and other observability tools Experience using Kubernetes to deploy and maintain systems Experience using Jsonnet or other templating tools to render complex YAML/JSON More ❯
driven architectures. Deep understanding of data processing, analytics, and real-time event streaming. Expertise in PostgreSQL, AWS and Kubernetes. Proficiency in monitoring tools like Prometheus, Grafana, and Kibana. Knowledge of security best practices, including OAuth, JWT, and data encryption. Fluent in English with strong communication and collaboration skills. Preferred Qualifications More ❯
in one of the programming languages and paradigms - our systems are written in TypeScript, Java, Golang, Rust, Python and others Desirable Experience with Kubernetes, Prometheus, Terraform, NoSQL or GCP Perks of joining us: Company pension contributions at 5%. Individualised training budget for you to learn on the job and More ❯
internal developer experience and CI/CD pipelines. Monitor and optimize: Ensure the monitoring, observability, and performance of deployed AI features using tools like Prometheus, OpenTelemetry, or DataDog. Best practices advocacy: Promote software engineering best practices and actively participate in architectural decisions. Collaboration features: Collaborate with the design and product More ❯
CDP/LLDP) and network engineering, management, and operations. Experience with search and analytics engines/big data tools (OpenSearch, Kafka, Kibana, Telegraf, InfluxDB, Prometheus). Our Preferred Qualifications for this role: Basic understanding of AI and ML algorithms, including model training, testing, and deployment. Hands-on project experience in More ❯
Terraform or CloudFormation, and manage resources for optimal performance. Monitor, troubleshoot, and resolve incidents, optimizing systems to ensure reliability and minimize downtime. Implement monitoring (Prometheus, Grafana, Datadog) and set up alerting systems to proactively address issues and ensure scalability. Work with DevOps, engineering, and security teams to improve application deployment … networking services. Proficiency in using Terraform, CloudFormation, Ansible, or similar tools for automating infrastructure. Strong experience in monitoring and incident response using tools like Prometheus, Grafana, and ELK Stack. Strong scripting skills in Python, Bash, Go, or Ruby for automating tasks and building custom tools. Experience with CI/CD More ❯
At SAP, we enable you to bring out your best. Our company culture is focused on collaboration and a shared passion to help the world run better. How? We focus every day on building the foundation for tomorrow and creating More ❯
Solace PubSub+ messaging Strong knowledge of production support Good understanding of WAN, networking and latency etc Solid knowledge of tools such as Grafana and Prometheus etc DevOps tooling experience would be ideal Proficiency in troubleshooting message delivery, persistence, and topic routing etc Good Linux/Unix knowledge Excellent communication skills More ❯
Solace PubSub+ messaging Strong knowledge of production support Good understanding of WAN, networking and latency etc Solid knowledge of tools such as Grafana and Prometheus etc DevOps tooling experience would be ideal Proficiency in troubleshooting message delivery, persistence, and topic routing etc Good Linux/Unix knowledge Excellent communication skills More ❯
Solace PubSub+ messaging Strong knowledge of production support Good understanding of WAN, networking and latency etc Solid knowledge of tools such as Grafana and Prometheus etc DevOps tooling experience would be ideal Proficiency in troubleshooting message delivery, persistence, and topic routing etc Good Linux/Unix knowledge Excellent communication skills More ❯
Solace PubSub+ messaging Strong knowledge of production support Good understanding of WAN, networking and latency etc Solid knowledge of tools such as Grafana and Prometheus etc DevOps tooling experience would be ideal Proficiency in troubleshooting message delivery, persistence, and topic routing etc Good Linux/Unix knowledge Excellent communication skills More ❯
engineering team. Nice to Have Experience with Agile development methodologies (Scrum, Kanban). Knowledge of observability practices (logging, metrics, tracing) and monitoring tools (e.g. Prometheus, Grafana). Understanding of cloud security best practices, including IAM policies and secret management. Why You'll Love Working With Us We know that when More ❯
on building repeatable and cost-efficient infrastructure Experience building solutions for problems with no answers on Google Experience working with monitoring solutions in the Prometheus ecosystem; Grafana, Loki, Tempo, VictoriaMetrics Experience managing multi-cluster, multi-cloud Kubernetes deployments Familiarity with incident management Nice to have: Familiarity with Gitops, e.g. Flux More ❯
and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for … messaging-related incidents, including root cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and … in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. More ❯
and reliability across production and non-production environments. This includes working on incident response, capacity planning, WAN optimization, and system observability using tools like Prometheus and Grafana . Key Responsibilities: Administer and maintain Solace PubSub+ appliances and software brokers across environments (on-prem and cloud). Provide production support for … messaging-related incidents, including root cause analysis and resolution. Monitor system performance and health using Prometheus and Grafana ; proactively identify and address anomalies. Configure and optimize Solace across WAN environments , ensuring low-latency, secure, and reliable messaging. Collaborate with development and application support teams to troubleshoot message flow issues and … in a 24x7 enterprise environment. Experience working with distributed systems over WAN , with an understanding of networking, latency, and failover strategies. Solid experience with Prometheus and Grafana for system monitoring and alerting. Proficiency in troubleshooting message delivery, persistence, and topic routing. Experience with capacity management , performance tuning, and system scaling. More ❯
Observability Tools Some examples of observability tools include Prometheus and Grafana. Cloud Environments Familiarity with cloud environments and platforms is essential, with a preference for GCP. Other platforms include Azure, AWS, and Kubernetes. Software Engineering It is important to be familiar with software engineering ways of working and the engagement More ❯
infrastructure. Recruit and lead a growing team of data engineers. Tech Stack Python (3.10+), Pandas, NumPy PostgreSQL (TimescaleDB), SQL optimization RabbitMQ, ZeroMQ, Linux servers Prometheus, Grafana, Zabbix Requirements 5+ years of Data Engineering experience with expertise in Python and SQL. Proven leadership experience guiding teams and projects. Strong background in More ❯