Site Reliability Engineer (SRE) – Job Description
Job Title: Site Reliability Engineer (SRE)
Experience: 4–8 Years
Location: Delhi NCR
Employment Type: Full -Time
Job Summary
We are looking for a Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of our cloud infrastructure and applications. The ideal candidate will have strong experience in cloud platforms, Kubernetes, automation, monitoring, incident management, and DevOps practices.
Key Responsibilities
Maintain and improve the reliability, availability, and performance of production systems.
Design, implement, and manage monitoring, alerting, and observability solutions.
Manage and support Kubernetes clusters and containerized workloads.
Automate operational tasks using Infrastructure as Code (Terraform, ARM, Bicep, etc.).
Collaborate with development teams to improve application resilience and deployment processes.
Perform root cause analysis (RCA) for incidents and implement preventive measures.
Define and monitor SLIs, SLOs, and error budgets.
Manage CI/CD pipelines and deployment automation.
Support disaster recovery (DR), backup, and business continuity planning.
Participate in on -call support and incident response activities.
Optimize cloud infrastructure for performance, security, and cost efficiency.
Required Skills
Strong experience with Azure, AWS, or GCP.
Hands -on experience with Kubernetes (AKS/EKS/GKE).
Experience with Terraform, Infrastructure as Code, and automation.
Strong Linux and networking fundamentals.
Experience with GitLab CI/CD, Azure DevOps, or Jenkins.
Monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Azure Monitor.
Scripting experience in Python, Bash, or PowerShell.
Knowledge of incident management, problem management, and change management processes.
Experience with databases, caching solutions, and messaging platforms is desirable.
Preferred Qualifications
Azure Administrator, Azure DevOps, Kubernetes (CKA/CKAD), or similar certifications.
Experience with microservices architecture and cloud -native technologies.
Understanding of security best practices and compliance requirements.
Nice to Have
Service Mesh (Istio/Kiali)
Kafka, Redis, MongoDB, PostgreSQL
Azure APIM, Application Gateway, WAF
Disaster Recovery and High Availability architecture
Key Metrics
Platform Availability (99.9%+)
MTTR (Mean Time to Recovery)
Incident Reduction
Deployment Success Rate
Infrastructure Automation Coverage
Requirements
Required Skills
Strong experience with Azure, AWS, or GCP.
Hands -on experience with Kubernetes (AKS/EKS/GKE).
Experience with Terraform, Infrastructure as Code, and automation.
Strong Linux and networking fundamentals.
Experience with GitLab CI/CD, Azure DevOps, or Jenkins.
Monitoring and observability tools such as Prometheus, Grafana, ELK, Datadog, Azure Monitor.
Scripting experience in Python, Bash, or PowerShell.
Knowledge of incident management, problem management, and change management processes.
Experience with databases, caching solutions, and messaging platforms is desirable.
Preferred Qualifications
Azure Administrator, Azure DevOps, Kubernetes (CKA/CKAD), or similar certifications.
Experience with microservices architecture and cloud -native technologies.
Understanding of security best practices and compliance requirements.
Nice to Have
Service Mesh (Istio/Kiali)
Kafka, Redis, MongoDB, PostgreSQL
Azure APIM, Application Gateway, WAF
Disaster Recovery and High Availability architecture
Key Metrics
Platform Availability (99.9%+)
MTTR (Mean Time to Recovery)
Incident Reduction
Deployment Success Rate
Infrastructure Automation Coverage