I

Lead Platform/Site Reliability Engineer

icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
icon loader

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Lead Platform/Site Reliability Engineer

What You'll Do:

As a Lead SRE, you'll be instrumental in shaping our systems' future. Your responsibilities will include:

  • System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing.
  • Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions. 
  • You'll advocate for treating "operations as a software problem" throughout the organization.
  • Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues.
  • Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence.
  • Expect to participate in on-call rotations as a primary escalation point.
  • Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle. 
  • You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability.
  • Capacity Planning & Optimization: Monitor and analyze system capacity and
  • performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth.
  • Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency.
  • Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members.

What You'll Bring:

  • We're looking for a highly experienced and passionate SRE leader with:
  • 12+ years of experience in Site Reliability Engineering, DevOps, or a related critical
  • operations role, with a proven track record of leading significant reliability initiatives.
  • A Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience.
  • Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system
  • integrations.
  • Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes).
  • Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures.
  • Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles.
  • Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry).
  • Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure.
  • Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams.
  • Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment.
  • Expert-level debugging and root cause analysis capabilities across complex distributed systems.

Bonus Points For:

  • Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi).
  • Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies.
  • Significant experience designing, implementing, and operating microservices architectures.
  • Contributions to open-source projects related to SRE, operations, or cloud-native technologies.
  • This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability.
Original job Lead Platform/Site Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Platform/Site Reliability Engineer Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Platform/Site Reliability Engineer Jobs in Hong Kong

GrabJobs is the no1 job portal in Hong Kong, connecting you to thousands of jobs fast! Find the best jobs in Hong Kong, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.