Platform Engineer - Observability

Key Responsibilities:

Observability Platform Implementation:

Deliver the implementation of the observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
Design and implement highly available observability services across multiple co-location and production sites.
Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
Implement multi-tenant observability controls and tenant isolation strategies.
Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.

Telemetry Collection & Integration:

Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
Develop and maintain observability integrations using OpenTelemetry standards and protocols.
Establish onboarding processes for new platforms, applications, and infrastructure services.
Collaborate with application teams to define observability requirements and future tracing adoption strategies.

Alerting & Operational Insights:

Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
Develop operational dashboards and service health views for infrastructure, platform, and application services.
Support integration of observability events with ITSM and incident-management platforms.
Define SLIs, SLOs, alert thresholds, and operational KPIs.
Continuously improve platform observability, incident detection, and root-cause analysis capabilities.

Reliability & Automation:

Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
Design and validate disaster recovery, resilience, and failover capabilities across observability services.
Contribute to platform security, compliance, and operational governance initiatives.
Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.

Required Experience & Skills:

Significant experience implementing and operating enterprise observability or monitoring platforms.
Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
Knowledge of Linux systems administration and cloud-native infrastructure.
Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
Skilled in developing automation and operational tooling using Python and/or Go.
Previous exposure to creating technical architecture, operational documentation, and deployment designs.
Experience with object storage technologies and distributed data platforms.
Strong understanding of monitoring, alerting, and operational event management.

Apply Now

Job Details