Skip to content
SRE – Data Platforms
- Own production on-call responsibilities including incident response, mitigation, and post-mortem analysis.
- Troubleshoot complex system failures across distributed Linux/Unix environments.
- Design, deploy, and operate containerized applications in production infrastructure.
- Build and maintain highly available, scalable distributed services.
- Write, test, and release production-quality code in Python, Go, or similar languages.
- Improve observability using monitoring, logging, and alerting practices.
- Automate operational workflows to reduce manual intervention and MTTR.
- Collaborate with engineering teams to improve reliability, performance, and release readiness.
- Perform capacity planning, performance tuning, and resilience testing.
- Drive continuous improvements in reliability, operational excellence, and system stability.