Number of Applicants
:000+
Let AI Supercharge Your Job Hunt!
JobCopilot scans 500,000+ company career sites daily to find jobs for you
Staff Site Reliability Operations Engineer (AIOps, Grafana, & Google Cloud Platform)
Role Overview
We are seeking a Staff Site Reliability Engineer (SRE) to lead our global platform reliability and drive our next-generation observability strategy on Google Cloud Platform (GCP). In this role, you will leverage Grafana Labs' complete telemetry stack and AIOps methodologies to build intelligent, self-healing infrastructure. You will bring deep expertise in scaling enterprise-grade Google Kubernetes Engine (GKE) topologies, managing high-throughput Kafka event streams, and maintaining high-performance PostgreSQL, AlloyDB, and BigQuery ecosystems at massive scale. Crucially, you will provide deep technical leadership across the entire networking stack, diagnosing complex issues from physical-layer transport up to application-layer protocols.
This position is mostly remote, with some in-office time .
Key Responsibilities
Full-Stack Network Architecture: Architect, optimize, and troubleshoot complex networking infrastructure spanning Layer 1 through Layer 7, ensuring low-latency data transport, secure edge routing, and seamless service mesh integration.
Grafana Stack Architecture: Design, scale, and optimize our unified observability platform using the Grafana Labs suite (Grafana, Mimir, Loki, Tempo, and Beyla).
AIOps & Intelligent Alerting: Deploy machine learning models and automated anomaly detection to cut through telemetry noise, reduce alert fatigue, and predict network or data pipeline bottlenecks.
GKE Platform Engineering: Drive the architecture, scaling, security, and networking of production Google Kubernetes Engine (GKE) clusters.
Data & Event Streaming Reliability: Tune, and maintain high-throughput Apache Kafka clusters to guarantee low-latency event delivery and high availability.
Large-Scale Database Management: Ensure the performance, scalability, and disaster recovery readiness of our transactional and analytical data tiers across PostgreSQL, AlloyDB, and BigQuery.
Automated Incident Response: Integrate AIOps insights with Grafana workflows to automate triage, accelerate root-cause analysis, and trigger auto-remediation scripts.
Technical Leadership: Champion the long-term technical roadmap for distributed infrastructure engineering and GCP cloud-native observability standards.
Mentorship: Coach senior and junior engineers on advanced debugging techniques, distributed systems thinking, and intelligent operations across a distributed workforce.
Required Qualifications
Location/Work Style: Proven track record of high autonomy and successful delivery in a 100% remote engineering environment.
Experience: 8+ years in SRE, Production Engineering, or Distributed Systems infrastructure roles.
Networking Expertise (L1-L7): Deep technical knowledge and debugging mastery across all OSI layers, including:
L1-L3: Physical/fiber infrastructure awareness, switching, and advanced routing protocols (BGP, OSPF).
L4: Transport layer tuning (TCP congestion control algorithms, UDP, QUIC).
L5-L7: Session management, TLS termination, DNS architecture, and advanced application protocols (HTTP/3, gRPC).
Orchestration & Containerization: Expert-level mastery of Google Kubernetes Engine (GKE) internals, custom controllers, multi-cluster networking, and GitOps workflows.
Data Infrastructure: Proven track record managing high-throughput Apache Kafka pipelines and large-scale data environments across PostgreSQL, AlloyDB, and BigQuery.
Grafana Ecosystem: Deep, hands-on experience deploying and managing Grafana Enterprise/Cloud, Prometheus/Mimir, Loki, and Tempo at scale.
AIOps Implementation: Track record applying AI/ML techniques for time-series anomaly detection, log clustering, and correlation (e.g., Grafana Adaptive Metrics, BigPanda).
Infrastructure as Code: Advanced, production-scale expertise utilizing HashiCorp Terraform exclusively to provision and manage multi-region GCP cloud architectures.
Programming: High proficiency in Go and Python for building custom infrastructure tooling, Kubernetes operators, and data integration scripts.
Preferred Attributes
Remote Communicator: Exceptional written and verbal communication skills, with an emphasis on creating clear documentation for asynchronous alignment.
GCP Expert: Deep knowledge of Google Cloud architectural best practices, Cloud SDN, Cloud Armor, Interconnect, Identity and Access Management (IAM), and cost optimization.
Systems Thinker: Deep understanding of Linux internals, eBPF-based monitoring, kernel-level networking, and packet analysis tools (Wireshark, tcpdump).
Auto-Apply to Staff Site Reliability Operations Engineer Jobs with your AI JobCopilot
Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.