Site Reliability Engineer
Technical Skills:
- Strong experience as a Senior Site Reliability Engineer, Reliability Engineer, or Platform Engineer operating at L7 level.
- Deep expertise in application monitoring, observability, alerting, incident management, and production reliability.
- Hands-on experience assessing, selecting, and implementing monitoring and observability tools, frameworks, and integration approaches.
- Strong understanding of SRE principles including SLIs, SLOs, error budgets, and resilience engineering.
- Design and operation of highly available, fault-tolerant, multi-region systems
- Advanced capacity planning, load modeling, and traffic forecasting
- Deep expertise in metrics, logs, traces, and event-based telemetry
Process Skills:
- Assess current monitoring, alerting, and incident management mechanisms to identify gaps and improvement opportunities.
- Define and implement an end-to-end application monitoring and observability model aligned across the SDLC.
- Identify risks related to reliability, performance, availability, and operational readiness and recommend mitigation strategies.
- Establish SRE best practices including proactive alerting, error budgets, operational runbooks, and reliability metrics.
- Articulate expected operational benefits such as improved system stability, faster incident resolution, reduced operational risk, and improved customer experience.