Site Reliability Engineer (SRE)
We are seeking a Site Reliability Engineer (SRE) to design, build, and maintain highly available, resilient, and scalable systems. You will collaborate closely with engineering, product, and operations teams to ensure our Java/Spring Boot applications run smoothly 24/7 in a cloud environment. Additionally, you will drive the adoption of analytics and data-driven insights to optimize system performance and extract value from operational data.
Key Responsibilities
- Reliability & Scalability: Design, implement, and maintain systems that are robust, scalable, and highly available, supporting millions of daily transactions.
- Cloud Migration: Lead and support migration of applications and infrastructure to public cloud platforms, ensuring best practices in security, reliability, and cost management.
- Automation & Infrastructure as Code: Develop and maintain automation scripts and infrastructure using Kubernetes and Terraform.
- Monitoring & Incident Response: Build and enhance monitoring, alerting, and observability solutions. Respond to incidents, perform root cause analysis, and drive continuous improvement.
- Collaboration: Partner with software engineers, product managers, and business stakeholders to deliver solutions that meet business needs and operational requirements.
- Analytics & Data Insights: Leverage cloud-based analytics tools to monitor system health, optimize performance, and extract actionable insights.
- Continuous Improvement: Identify and implement opportunities to improve reliability, efficiency, and scalability of the platform.
Required Qualifications
- Proven experience as a Site Reliability Engineer, DevOps Engineer, or similar role supporting large-scale, mission-critical systems.
- Strong hands-on experience with Kubernetes and Terraform.
- Experience deploying and operating applications in public cloud environments (AWS, Azure, GCP).
- Solid understanding of Java and Spring Boot applications.
- Experience with monitoring, logging, and observability tools (Prometheus, Grafana, ELK, Splunk).
- Strong troubleshooting and problem-solving skills.
- Excellent communication and collaboration skills.
Preferred Qualifications
- Experience in financial services or payments/transaction processing environments.
- Familiarity with cloud-based analytics platforms and data engineering concepts.
- Experience with CI/CD pipelines and automation tools (Jenkins, GitHub Actions).
- Knowledge of security best practices in cloud environments.