This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Site Reliability Engineer based in India.

In this role, you will help design, operate, and scale highly available distributed systems that power mission-critical cloud and data platforms. You will work at the intersection of infrastructure, automation, and reliability engineering, ensuring systems remain resilient, observable, and performant under real-world production demands. The environment is fast-paced, cloud-native, and deeply technical, with strong emphasis on Kubernetes-based architectures and modern DevOps practices. You will collaborate closely with engineering, data, and AI/ML teams to support complex workloads across global infrastructure. This role offers the opportunity to solve challenging scalability and performance problems at enterprise scale. It is ideal for engineers who enjoy building automation-first systems and improving reliability through engineering rigor and continuous improvement.

Accountabilities:

Operate and optimize containerized environments using Kubernetes and service mesh technologies such as Istio, ensuring high availability and performance across distributed systems.

Build automation and operational tooling using Go, Python, and Shell scripting to reduce manual intervention and improve system efficiency.

Design and maintain observability stacks using Prometheus, Grafana, and Loki for proactive incident detection and resolution.

Troubleshoot and resolve complex issues across networking, storage, and system performance layers in large-scale distributed environments.

Participate in on-call rotations, incident response, and postmortem analysis to continuously improve reliability and operational maturity.

Collaborate with AI/ML and data engineering teams to ensure infrastructure readiness for model training, inference workloads, and data pipelines.

Requirements

Strong hands-on experience with cloud platforms, particularly Google Cloud, and infrastructure-as-code tools such as Terraform.

Solid understanding of microservices architectures, containerization, and distributed systems, including production use of Kubernetes and Docker.

Strong SRE mindset focused on automation, scalability, observability, and reliability engineering principles.

Practical experience in Linux system administration, networking fundamentals, and security concepts such as PKI and secure service-to-service communication.

Strong problem-solving skills, ability to work in high-pressure environments, and comfort with incident management and operational ownership.

Benefits

Competitive total rewards package aligned with industry standards.

Fully remote work flexibility with no mandatory office presence.

Generous training and certification support to accelerate technical growth.

Dedicated equipment and home-office setup support, including OS choice for your workstation.

Annual wellness budget supporting fitness, health, and personal well-being.

Paid vacation, sick leave, and dedicated volunteer time off.

Exposure to cutting-edge cloud, data, and AI infrastructure environments..

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Site Reliability Engineer

Job Description - Site Reliability Engineer

Accountabilities:

Requirements

Benefits

Similar Site Reliability Engineer Jobs in India

Mobile Apps