The Senior Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, performance, and operability of production systems across our platforms, by applying software engineering practices to operations, with a focus on automation, observability, and incident response.
Responsibilities:
- Own and improve the reliability, availability, and performance of production services in Google Cloud (GCP).
- Participate in incident management, including detection, triage, mitigation, escalation, and recovery.
- Use and improve incident workflows and tooling (e.g., ServiceNow) to ensure clear ownership and timely communication.
- Design, implement, and operate observability solutions including monitoring, logging, tracing, synthetics, and dashboards (e.g., Splunk Observability, OpenTelemetry).
- Reduce operational toil through automation and engineering-led solutions, proactively introducing and driving SRE best practices.
- Support on-call rotations across multiple time zones, contributing to a sustainable 24/7 support model.
- Define, monitor, and report SLIs, SLOs, and error budgets for critical services.
- Drive and be accountable for best-in-class service availability through SRE principles, automation, and proactive reliability engineering.
Essential skills and/or Certifications:
- Bachelor’s degree in Computer Science, Information Technology or related field
- Strong experience with cloud-native concepts and technologies, with a strong preference for Google Cloud Platform (GCP) and Kubernetes (GKE).
- Proven experience with Site Reliability Engineering and production incident management, ideally using platforms such as ServiceNow.
- Experience with monitoring and observability tools, including metrics, logs, traces, and synthetics (e.g., Splunk Observability, OpenTelemetry).
- Exposure to reliability testing, resilience engineering, or cost optimisation initiatives.
- Excellent analytical and problem-solving skills, with the ability to diagnose complex production issues quickly.
- Software development or automation experience using Python, shell scripts, or similar languages.
- Hands-on experience operating production cloud infrastructure at scale.
- Experience managing multi-region, high-availability production systems with a focus on scalability, resilience, and minimising service disruption during failures.
- Proficiency in Microsoft Office Suites Skills
- Show an ownership mindset in everything you do; be a problem solver, be curious and be inspired to take action, be proactive, seek ways to collaborate and connect with people and teams in support of driving success.
- Continuous growth mindset, keep learning through social experiences and relationships with stakeholders, experts, colleagues and mentors as well as widen and broaden your competencies through structural courses and programs.
- Where applicable, fluency in English and languages relevant to the working market.