About the opportunity
We are hiring on behalf of a well -established global IT consulting and implementation firm with offices across North America, Europe, and India (HITEC City, Hyderabad). The organisation delivers technology solutions across Cloud, DevOps, SAP, and AI for enterprise clients globally and has a strong people -first, learning -oriented culture.
Role overview
We are looking for a Site Reliability Engineer with a strong Observability specialisation to drive service reliability, reduce operational toil, and build best -in -class monitoring and alerting infrastructure. The ideal candidate brings deep Grafana expertise and will take ownership of SLO/SLA definition, distributed system visibility, and driving the shift from reactive to proactive operations.
Key responsibilities
• Define, track, and report on Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets across platform services
• Build, maintain, and optimise observability infrastructure using Grafana, Prometheus, Loki, Tempo, and related open -source tooling
• Develop dashboards and alerting rules that provide actionable, low -noise insights for engineering and operations teams
• Lead blameless post -incident reviews (PIRs) and drive systemic reliability improvements from learnings
• Partner with engineering teams to instrument applications with distributed tracing, structured logging, and custom metrics
• Reduce operational toil through automation — scripting runbooks, auto -remediation workflows, and self -healing infrastructure
• Define on -call practices, escalation policies, and runbooks; contribute to a sustainable on -call culture
• Evaluate and implement new observability tooling as the stack evolves (e.g., OpenTelemetry, Jaeger, VictoriaMetrics)
Required skills & experience
• 8+ years of combined SRE / DevOps / Platform Engineering experience
• Strong hands -on expertise with Grafana — dashboards, alerting, data sources
• Proficiency in Prometheus — PromQL, exporters, alertmanager
• Experience with log aggregation using Loki, ELK stack, or equivalent
• Solid understanding of distributed systems principles, microservices architecture, and container orchestration (Kubernetes)
• Proficiency in Python, Go, or Bash for automation and tooling
• Strong analytical thinking for root cause analysis and capacity planning
Good to have
• Hands -on experience with OpenTelemetry instrumentation
• Exposure to Grafana OnCall, Grafana Incident, or PagerDuty for incident management
• Familiarity with eBPF -based observability tools (Cilium, Parca)
• Azure or AWS certifications
What's on offer
• End -to -end ownership of observability — not just maintaining dashboards
• Hybrid work flexibility from HITEC City, Hyderabad
• Exposure to global -scale distributed systems for international clients
• Certification reimbursement and structured learning pathways
Location: Hyderabad (Hybrid)
Experience: 8+ years
Employment type: Full -time
Specialisation: Observability – Grafana, Prometheus, Loki stack