SRE – Data Platforms

Own production on-call responsibilities including incident response, mitigation, and post-mortem analysis.
Troubleshoot complex system failures across distributed Linux/Unix environments.
Design, deploy, and operate containerized applications in production infrastructure.
Build and maintain highly available, scalable distributed services.
Write, test, and release production-quality code in Python, Go, or similar languages.
Improve observability using monitoring, logging, and alerting practices.
Automate operational workflows to reduce manual intervention and MTTR.
Collaborate with engineering teams to improve reliability, performance, and release readiness.
Perform capacity planning, performance tuning, and resilience testing.
Drive continuous improvements in reliability, operational excellence, and system stability.

Apply Now

Job Details