About the Role:
- We are seeking a highly skilled Senior DevOps Engineer to lead the design, implementation, and continuous improvement of our cloud infrastructure, kubernetes, CI/CD pipelines, observability systems, and reliability practices. This role is critical in ensuring platform stability, scalability, security, and operational excellence across production and non-production environments. You will work closely with Engineering, Security, and Product teams to build resilient, automated, and high-performing infrastructure systems.
Key Responsibilities:
- Infrastructure & Cloud Engineering: Design, implement, and manage scalable cloud infrastructure (AWS preferred)
- Lead infrastructure-as-code initiatives (Terraform / CloudFormation)
- Improve high availability, disaster recovery, and multi-region resilience
- Optimize cloud cost and resource utilization
- Kubernetes & Container Platform: Architect and manage production-grade Kubernetes clusters
- Improve cluster reliability, auto-scaling, and performance
- Implement workload monitoring, alerting, and SLO-based reliability standards
- Enforce namespace isolation and resource governance
- CI/CD & Automation: Design and optimize CI/CD pipelines (Jenkins, ArgoCD)
- Implement zero-downtime deployment strategies
- Automate environment provisioning (fully touchless builds with seed data)
- Improve deployment reliability and rollback mechanisms
- Observability & Reliability: Own monitoring, alerting, and logging strategy (Prometheus, Grafana, Datadog, etc.)
- Ensure 100% monitoring coverage for critical services
- Reduce Sev1/Sev2 incidents caused by infrastructure
- Create and maintain runbooks (COPs) for incident response
- Define SLOs, SLIs, and error budgets
- Security & Compliance: Implement IAM best practices and least privilege access
- Improve secrets management and credential rotation
- Partner with security team on audits and compliance controls
- Incident Management. Lead root cause analysis for major incidents
- Drive postmortems and preventive improvements
- Improve MTTR and overall operational maturity
Required Skills & Experience:
- 6+ years in DevOps / SRE / Cloud Engineering
- Strong experience with AWS (VPC, IAM, EC2, S3, RDS, EKS, etc.)
- Deep Kubernetes experience (production clusters)
- Strong understanding of networking and Linux systems
- Experience with Infrastructure as Code (Terraform preferred)
- Experience implementing monitoring & alerting systems (Datadog, prometheus.Grafana)
- Strong scripting skills (Python / Bash )
- Experience managing production systems with high availability requirements
- Good understanding on databases like Postgres, MySQL
- Strong communication written and verbal skills
- Ability to follow structured processes while being proactive in identifying improvements.
- Analytical and problem-solving mindset.
- Willingness to work in night shift on a long-term basis.