Platform Engineer - Observability

Key Responsibilities:

Observability Platform Implementation:

  • Deliver the implementation of the observability platform based on Grafana Mimir, Loki, Tempo, Grafana Alloy and Grafana Enterprise tooling.
  • Design and implement highly available observability services across multiple co-location and production sites.
  • Configure telemetry ingestion pipelines for metrics, logs, and future distributed tracing workloads.
  • Develop and maintain observability architecture documentation, high-level designs, low-level designs, and operational runbooks.
  • Define platform standards for telemetry collection, labelling, metadata enrichment, retention policies, and data governance.
  • Implement multi-tenant observability controls and tenant isolation strategies.
  • Configure and maintain object-storage-backed telemetry platforms for long-term retention and scalability.

Telemetry Collection & Integration:

  • Deploy and manage Grafana Alloy collectors across Kubernetes clusters, Linux hosts, network infrastructure, storage platforms, and hardware management systems.
  • Integrate telemetry from Kubernetes, GPU infrastructure, HPE hardware, storage platforms, network devices, and cloud-native services.
  • Develop and maintain observability integrations using OpenTelemetry standards and protocols.
  • Establish onboarding processes for new platforms, applications, and infrastructure services.
  • Collaborate with application teams to define observability requirements and future tracing adoption strategies.

Alerting & Operational Insights:

  • Design and implement alerting frameworks using recording rules, AlertManager, and operational best practices.
  • Develop operational dashboards and service health views for infrastructure, platform, and application services.
  • Support integration of observability events with ITSM and incident-management platforms.
  • Define SLIs, SLOs, alert thresholds, and operational KPIs.
  • Continuously improve platform observability, incident detection, and root-cause analysis capabilities.

Reliability & Automation:

  • Implement Infrastructure-as-Code and GitOps practices for observability platform deployment and configuration management.
  • Develop automation for dashboard provisioning, alert deployment, tenant onboarding, and telemetry configuration.
  • Design and validate disaster recovery, resilience, and failover capabilities across observability services.
  • Contribute to platform security, compliance, and operational governance initiatives.
  • Work with operational teams to ensure observability services remain reliable, scalable, and maintainable.

Required Experience & Skills:

  • Significant experience implementing and operating enterprise observability or monitoring platforms.
  • Strong understanding of metrics, logs, traces, OpenTelemetry, and modern observability principles.
  • Experience with Grafana ecosystem technologies including Grafana, Prometheus, Grafana Mimir, Grafana Loki, Grafana Tempo, and Grafana Alloy.
  • Experience designing Kubernetes-native solutions and operating distributed platforms at scale.
  • Knowledge of Linux systems administration and cloud-native infrastructure.
  • Experience implementing Infrastructure-as-Code and GitOps approaches (preferably including Ansible).
  • Skilled in developing automation and operational tooling using Python and/or Go.
  • Previous exposure to creating technical architecture, operational documentation, and deployment designs.
  • Experience with object storage technologies and distributed data platforms.
  • Strong understanding of monitoring, alerting, and operational event management.

Job Details

Company
Swisstech Recruitment
Location
United Kingdom
Employment Type
Contract
Salary
GBP 500 Daily
Posted