Staff Site Reliability Operations Engineer

Company : Calix

Job Type : Full Time

Bengaluru, India

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Staff Site Reliability Operations Engineer

The Calix platform enables Communication Service Providers (CSPs) of all sizes to transform and future-proof their businesses. Through real-time data, automation, and actionable insights delivered via Calix One — our cloud-first, AI-powered platform — CSPs can simplify operations, collapse cost, and accelerate innovation. Calix One brings together the automation of everything and the experience of one, empowering customers to deliver differentiated subscriber experiences while driving acquisition, loyalty, and revenue growth. This is the Calix mission: to enable CSPs of all sizes to simplify, innovate, and grow, strengthening both their businesses and the communities they serve.

We’re at the forefront of a once in a generational change in the broadband industry. Join us as we innovate, help our customers reach their potential, and connect underserved communities with unrivaled digital experiences.

Staff Site Reliability Operations Engineer (AIOps, Grafana, & Google Cloud Platform)

Role Overview

We are seeking a Staff Site Reliability Engineer (SRE) to lead our global platform reliability and drive our next-generation observability strategy on Google Cloud Platform (GCP). In this role, you will leverage Grafana Labs' complete telemetry stack and AIOps methodologies to build intelligent, self-healing infrastructure. You will bring deep expertise in scaling enterprise-grade Google Kubernetes Engine (GKE) topologies, managing high-throughput Kafka event streams, and maintaining high-performance PostgreSQL, AlloyDB, and BigQuery ecosystems at massive scale. Crucially, you will provide deep technical leadership across the entire networking stack, diagnosing complex issues from physical-layer transport up to application-layer protocols.

This position is mostly remote, with some in-office time .

Key Responsibilities

Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration.

Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla).

AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks.

GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters.

Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability.

Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery.

Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts.

Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards.

Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce.

Required Qualifications

Location/Work Style: Proven track record of high autonomy and successful delivery in a 100% remote engineering environment.

Experience: 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles.

Networking Expertise (L1-L7): Deep technical knowledge and debugging mastery across all OSI layers, including:

L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF).

L4: Transport layer tuning (TCP congestion control algorithms, UDP, QUIC).

L5-L7: Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC).

Orchestration & Containerization: Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows.

Data Infrastructure: Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery.

Grafana Ecosystem: Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale.

AIOps Implementation: Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda).

Infrastructure as Code: Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures.

Programming: High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts.

Preferred Attributes

Remote Communicator: Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment.

GCP Expert: Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization.

Systems Thinker: Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump).

Original job Staff Site Reliability Operations Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Staff Site Reliability Operations Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Staff Site Reliability Operations Engineer Jobs in India

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Staff Site Reliability Operations Engineer

Job Description - Staff Site Reliability Operations Engineer

Similar Staff Site Reliability Operations Engineer Jobs in India

Mobile Apps