Site Reliability Engineer - SC Cleared
Excellent opportunity for Site Reliability Engineer to be part of our Cloud Infrastructure & Security services practice. Cognizant Infrastructure Services – Provides IT infrastructure & Cloud services for clients across industry verticals, including both Consulting/Professional and Managed Services, across Enterprise Computing, Cloud services, Security Services, DevOps, Data Centres, End User Computing, Service Desk, Network Services and Environment Management Services.
Candidate should be SC Cleared
Key responsibilities
- Build CI/CD you can trust: Design, implement, and operate pipelines in GitHub Actions and Jenkins that deliver zero‑touch, repeatable releases with quality gates, automated tests, and policy‑as‑code controls. Containerise services with Docker and standardise build images.
- Provision everything as code: Model cloud resources using Terraform (workspaces, modules, registries, drift detection), enabling composable, reviewed changes across environments.
- Run scalable compute: Stand up and operate container platforms — Kubernetes (incl. EKS, AKS, GKE), ECS, and Azure Container Instances (ACI) — including cluster lifecycle, node pools, autoscaling, ingress, service mesh, secrets, and backup/restore.
- Observability : Instrument services and infra with New Relic, Grafana (incl. Loki/Tempo where applicable) and cloud‑native telemetry. Define SLIs/SLOs, build actionable dashboards, alerts, and runbooks that drive fast MTTR.
- Engineer for reliability & cost: Apply SRE practices (error budgets, change management, resilience testing), right‑size resources, and use cloud provider tooling for security/cost posture.
- Incident response & on‑call: Participate in a fair, documented on‑call rota; lead and/or contribute to incident handling, comms, post‑incident reviews, and corrective actions.
- Security & compliance by design: Embed IAM least‑privilege, secrets management, image/provenance scanning, and guardrails into pipelines and Terraform modules.
Key Skills and Experience:
- Proven experience operating production systems on a major cloud (AWS/Azure/GCP) with solid cloud fundamentals (networking, IAM, storage, compute, HA/DR).
- Hands‑on IaC with Terraform (modules, state, CI validation, policy checks).
- Strong CI/CD skills in GitHub Actions and/or Jenkins (runners/agents, reusable workflows, secrets, matrix builds, artefact management).
- Containers & orchestration: Kubernetes administration knowledge (controllers, scheduling, ingress, autoscaling, troubleshooting) and experience with EKS/AKS/GKE and/or ECS/ACI.
- Observability: Practical use of New Relic and Grafana to define metrics/traces/logs, tune alerts, and drive SLOs.
- Scripting & automation: Proficiency in Python and Bash; experience with boto3 or equivalent SDKs.
- Incident management: Exposure to production incidents, on‑call participation, and post‑incident review practices.
- Clear communication, stakeholder partnership, and a bias to automate, document, and simplify.