Datadog SME (Developer Experience)
Role Summary:
We are seeking a highly skilled Datadog Subject Matter Expert (SME) with a strong developer mindset to join our client engagement onsite in the UK. The role focuses on designing, implementing, and optimizing observability within modern application and cloud-native ecosystems, working closely with development, SRE, and platform teams.
The ideal candidate will bring deep hands-on Datadog expertise, strong coding/Scripting skills, and experience embedding observability into the software development life cycle (SDLC) for large-scale, mission-critical systems.
Your responsibilities:
1. Datadog Platform Engineering:
Design and implement end-to-end observability solutions using Datadog across metrics, logs, traces, RUM, synthetics, and events.
Develop and maintain custom dashboards, service maps, and business journey views aligned to application and business KPIs.
Configure and optimize alerts, monitors, and SLOs/SLIs to ensure actionable and noise-free alerting.
Drive Datadog best practices adoption across application and platform teams.
2. Developer Enablement & Instrumentation:
Embed Datadog APM, OpenTelemetry, and custom instrumentation into applications (Java, .NET, Python, Node.js, etc.).
Partner with developers to ensure observability-by-design during development, testing, and release cycles.
Develop reusable instrumentation libraries, tagging standards, and data models.
Support shift-left observability across CI/CD pipelines.
3. Cloud & Container Observability:
Implement Datadog monitoring for cloud platforms (AWS/Azure/GCP) including services, networking, and security signals.
Enable deep observability for containerized and Kubernetes environments (EKS/AKS/GKE, OpenShift).
Optimize telemetry ingestion, sampling, and cost management.
4. Automation & Integration:
Build automation using Python, REST APIs, Terraform, or CI/CD tools for Datadog onboarding and configuration.
Integrate Datadog with ITSM, incident management, and collaboration tools (eg, ServiceNow, PagerDuty).
Support automated incident detection, enrichment, and diagnostics.
5. Performance & Reliability Engineering:
Perform performance analysis and root cause investigations using Datadog telemetry.
Support capacity planning, resilience testing, and reliability improvements.
Participate in post-incident reviews and drive continuous improvement actions.
6. Stakeholder & Onsite Responsibilities:
Act as the Datadog technical face-off for UK-based stakeholders and client teams.
Collaborate with application owners, architects, and SRE teams in onsite workshops and design sessions.
Provide guidance, documentation, and knowledge transfer to client teams.
7. Essential skills/knowledge/experience:
Strong hands-on experience with Datadog (APM, Infra, Logs, RUM, Synthetics, Dashboards, Monitors).
Solid programming experience in Java, Python, .NET, Node.js, or similar languages.
Experience with OpenTelemetry and distributed tracing.
Strong understanding of cloud-native architectures and microservices.
Hands-on experience with Kubernetes and container observability.
Working knowledge of CI/CD pipelines, Git, and DevOps practices.
Experience:
6-10+ years of experience in software engineering, SRE, or observability roles.
Proven experience delivering observability solutions in large enterprise or BFSI environments.
Experience working in onsite/client-facing roles, preferably in the UK or Europe.
Soft Skills:
Strong problem-solving and analytical skills.
Ability to communicate complex technical concepts clearly to diverse stakeholders.
Comfortable working across development, operations, and business teams.
Desirable skills/knowledge/experience:
Datadog certifications.
Experience replacing or consolidating tools like AppDynamics, Dynatrace, ELK, or New Relic with Datadog (a pattern seen across your internal case studies).
Exposure to AIOps, event correlation, or SRE practices.
Prior BFSI client experience in the UK.