position. Hands-on experience with incident response, including designing and improving incident management processes. Expertise in Observability practices, including metrics, logs, traces, and understanding of distributed tracing tools (e.g., OpenTelemetry). Strong problem-solving skills with a focus on building resilient, fault-tolerant systems. Excellent communication skills and a collaborative mindset. Have to have SEC+ or higher certification or ability More ❯
Hawthorne, California, United States Hybrid / WFH Options
GCR Professional Services
in-the-loop (HIL) testing environments. Improve monitoring, logging, and debugging capabilities for embedded applications. Manage containerization and virtualization of embedded development environments using tools like Kubernetes, Grafana and OpenTelemetry Research and implement best practices for security, performance, and scalability. Automate software releases and version control strategies for embedded firmware. Skills and/or Experience Needed: MS or BS in More ❯
Bash, etc.) Strong problem-solving and analytical abilities. Excellent communication and teamwork skills. Eagerness to learn and adapt in a fast-paced trading environment. Desirable Experience with metrics & monitoring, OpenTelemetry, Splunk, Prometheus, Grafana, etc. Experience and knowledge of working with distributed systems Experience with Kubernetes Knowledge of networking (HTTP/TCP/UDP/IP). Experience in Financial markets. More ❯
/sub messaging frameworks (ex. ActiveMQ, ZeroMQ) Familiarity with integrating software applications as a suite of independent, small and modular services (microservices, OSGi) Experience with system monitoring frameworks (Prometheus, OTEL, InfluxDB/Telegraf, etc) 3 years minimum experience with IP network protocols and development of distributed or networked applications 3rd party and subcontract staffing agencies are not eligible for partnership More ❯
The world can't wait. You Have: 7+ years of experience measuring service SLIs using custom metrics, logs, and t race s and tools such as Prometheus, Grafana, or OpenTelemetry 7+ years of experience developing Infrastructure as Code ( IaC ) in Terraform 7+ years of experience scripting or coding in Python, Go, or Bash 7+ years of experience designing SLIs, SLOs More ❯
experience in technical integrations and POCs Comfortable coding in any high-level programming language (Java, Go, Python) Strong hands-on knowledge of Kubernetes, AWS, Azure, GCP, Docker, Prometheus, and OpenTelemetry Industry knowledge and opinions on Monitoring, Observability, Log Management, SIEM Engineering/DevOps Background - advantage Experience in Technical Sales of Log Analytics/Monitoring/APM/SIEM - advantage Cultural More ❯
capability. Preferred Education, Experience, & Skills Splunk Certified Cloud Architect, Splunk Certified Admin, or Splunk Certified Power User. Cloud certifications (AWS/Azure/GCP). Experience with observability frameworks (OpenTelemetry), metrics pipelines, and metric-to-log correlation. Prior experience operating at enterprise scale (multi TB ingestion/day, global deployments). Proficiency with Terraform, Ansible, or similar IaC tools and More ❯
of ITSM/incident management processes and tools (Halo ITSM, ServiceNow, Jira Service Management) Cloud experience ( AWS, Azure, GCP ) and deploying observability tools in cloud-native environments Understanding of OpenTelemetry and modern observability standards Strong problem-solving skills and ability to work in a fast-paced start-up or consulting environment Why Join: Work with our exclusive client , a high More ❯
London, South East, England, United Kingdom Hybrid / WFH Options
Morela
of ITSM/incident management processes and tools (Halo ITSM, ServiceNow, Jira Service Management) Cloud experience ( AWS, Azure, GCP ) and deploying observability tools in cloud-native environments Understanding of OpenTelemetry and modern observability standards Strong problem-solving skills and ability to work in a fast-paced start-up or consulting environment Why Join: Work with our exclusive client , a high More ❯
Preferred Qualifications: OpenShift certifications (e.g., Red Hat Certified Specialist in OpenShift Administration). Experience with multi-cluster and hybrid cloud OpenShift deployments. Familiarity with monitoring and logging tools (e.g., oTel, Grafana, Splunk stack). Knowledge of OpenShift Operators and Helm charts. Experience with large-scale migration projects. About WIPRO: Wipro is an exciting organization to work for. We ranked as More ❯
handsworth, yorkshire and the humber, united kingdom
Wipro
Preferred Qualifications: OpenShift certifications (e.g., Red Hat Certified Specialist in OpenShift Administration). Experience with multi-cluster and hybrid cloud OpenShift deployments. Familiarity with monitoring and logging tools (e.g., oTel, Grafana, Splunk stack). Knowledge of OpenShift Operators and Helm charts. Experience with large-scale migration projects. About WIPRO: Wipro is an exciting organization to work for. We ranked as More ❯
LangSmith. Experience in prompt engineering and generative AI integration. Skills & Tools Programming: Java, Python, Spring Boot, REST APIs Cloud: AWS (Glue, Kinesis, EMR, Route 53), containerization, serverless architecture Monitoring: OpenTelemetry, Dynatrace, LoadRunner, Splunk Collaboration: Jira, Confluence Testing: Unit, integration, functional, performance Agile: SAFe and Agile methodologies Qualifications Education: Bachelor's Level Degree (Required) The future is what you make it More ❯
serverless architectures. Deep understanding of CI/CD (GitHub Actions, Jenkins, or AWS CodePipeline). Proven ability to secure and scale production systems. Monitoring and observability tools (CloudWatch, Grafana, OpenTelemetry). Familiar with data exchange formats (JSON, YAML, Parquet) and API design. Leadership & Delivery 4-8 years in software development and/or DevOps, including 2+ in a management or More ❯
technical experience in Cloud DevOps, SaaS, or observability, with 5+ years in leadership roles. Strong hands-on experience with AWS, GCP, Azure, K8S, Terraform and observability tools: Prometheus, Grafana, OpenTelemetry, ELK, Splunk, Datadog, and similar. Proficiency with metrics, logs, traces and APM. Leadership & Global Operations Proven success leading multi-regional or global technical teams with direct management of managers. Demonstrated More ❯
architectures . Deep understanding of CI/CD (GitHub Actions, Jenkins, or AWS CodePipeline). Proven ability to secure and scale production systems. Monitoring and observability tools (CloudWatch, Grafana, OpenTelemetry). Familiar with data exchange formats (JSON, YAML, Parquet) and API design. Leadership & Delivery 48 years in software development and/or DevOps , including 2+ in a management or team More ❯
Go-based, making it the most effective language for this role. Experience with, or strong interest in, observability tools (Prometheus, Grafana, Loki, Tempo, ELK/OpenSearch, Clickhouse) and standards (OpenTelemetry, OpenTracing, OpenMetrics). Deep understanding of distributed systems and data models Hands-on experience with Kubernetes, and cloud platforms (AWS, GCP, Azure). Benefits Roku is committed to offering a More ❯
that's building something exceptional. Tech Snapshot (don't worry if you don't know it all): Kotlin, TypeScript, Terraform, Azure/AWS/GCP, Temporal, Postgres, graph databases, OpenTelemetry, Grafana, containerised dev environments, CI/CD pipelines. Perks & Culture ?? Competitive salary + EMI share options ?? Breakfast and dinner on tap, plus snacks that raise the bar ?? Regular socials + More ❯
release validation, and production monitoring Strong communication skills; can adapt output to technical and non-technical audiences Bonus Points: Background in QA, test automation, or release engineering Experience with OpenTelemetry, distributed tracing, or event-driven logs Experience in continuous delivery environments with real-time observability needs Prior involvement in incident reviews or quality postmortems Relevant certifications (e.g., Data Analytics, SQL More ❯
tools like Evidently AI or Alibi Detect to identify shifts in data distributions and trigger alerts or retraining. Logging and Tracing: Set up centralized logging with ELK Stack or OpenTelemetry to capture AI inference events, errors, and audit trails for debugging and compliance. Pipeline Automation: Develop CI/CD pipelines with GitHub Actions or Jenkins to automate model updates, testing … plus. Expertise in containerization (Docker, Kubernetes) and CI/CD tools (GitHub Actions, Jenkins). Knowledge of time-series databases (e.g., InfluxDB, TimescaleDB) and logging frameworks (e.g., ELK Stack, OpenTelemetry). Experience with drift detection tools (e.g., Evidently AI, Alibi Detect) and visualization libraries (e.g., Plotly, Seaborn). AI-Specific Skills: Understanding of model performance metrics (e.g., precision, recall, AUC More ❯
We are seeking a skilled Artificial Intelligence Integration Engineer to join our team and ensure the seamless deployment, monitoring, and optimization of AI models in production. The Artificial Intelligence Integration Engineer will design, implement, and maintain end-to-end machine More ❯
AppDynamics. Provide training and coaching to IT teams on effective use of AppDynamics. Monitoring Strategy & Continuous ImprovementMaintain working knowledge of competing APM/observability solutions (e.g., Splunk, Datadog, LogicMonitor, OpenTelemetry). Evaluate tools and industry trends to recommend improvements to Medline's monitoring strategy. Support integrations with Twilio and Power Automate for alerting workflows. Basic Qualifications 3+ years managing a More ❯
know as soon as possible. What you'll be doing We are looking for a Site Reliability Engineer to join the team and play a key role in scaling OpenTelemetry, driving service health, deep observability, and high availability across our entire technology infrastructure. You will have strong software engineering skills (ideally in TypeScript and Rust) and a deep understanding of … working across infrastructure and application layers, and you will lead by example in everything from SLOs and SLIs to post-incident reviews. What You Will Be Doing: Observability and OpenTelemetry: Own and evolve our observability strategy across services. Lead how we collect, process, sample, and surface trace and metrics data using OpenTelemetry. Focus on high-signal telemetry that enables fast … test/deploy automation Proven ability to lead incident response and post-incident review processes Strong problem-solving mindset and attention to detail Desirable skills Knowledge and understanding of OpenTelemetry tools, specification, APIs etc. Some experience in Rust or similar compiled language e.g. Go Experience instrumenting and running OpenTelemetry in production at scale. Knowledge of distributed tracing and trace sampling More ❯