Lead Platform/Site Reliability Engineer

Company : Io Tech Solutions Limited

Job Type : Full Time

Hongkong Hong Kong

Job Description - Lead Platform/Site Reliability Engineer

What You'll Do:

As a Lead SRE, you'll be instrumental in shaping our systems' future. Your responsibilities will include:

System Reliability Leadership: Develop and execute strategies to achieve unparalleled service reliability and availability. You'll implement cutting-edge best practices, design resilient monitoring solutions, and conduct comprehensive failure injection and failover testing.
Advanced Automation: Spearhead automation initiatives to streamline complex operational tasks, enhancing efficiency and reducing manual interventions.
You'll advocate for treating "operations as a software problem" throughout the organization.
Comprehensive Monitoring & Performance: Design and maintain advanced monitoring and alerting systems to assess system health, performance, and user experience. You'll conduct in-depth analysis of metrics and logs to proactively identify and resolve complex issues.
Incident Management & Prevention: Lead during critical incidents, ensuring rapid resolution and clear communication. You'll conduct thorough post-mortem analyses, implement sustainable solutions, and share insights to prevent recurrence.
Expect to participate in on-call rotations as a primary escalation point.
Strategic Collaboration: Work closely with development and operations teams to embed reliability principles throughout the software development lifecycle.
You'll provide expert guidance, promote SRE best practices, and foster a culture of shared ownership for system reliability.
Capacity Planning & Optimization: Monitor and analyze system capacity and
performance data, forecast future demands, and lead efforts to scale infrastructure efficiently to meet growth.
Continuous Improvement & Innovation: Identify areas for systemic improvement in systems, tools, and processes. You'll lead the design and implementation of innovative solutions to enhance reliability, performance, and operational efficiency.
Mentorship & Leadership: Provide technical leadership and mentorship to SREs and other team members, fostering growth and skill development. You'll also contribute to hiring and onboarding processes for new team members.

What You'll Bring:

We're looking for a highly experienced and passionate SRE leader with:
12+ years of experience in Site Reliability Engineering, DevOps, or a related critical
operations role, with a proven track record of leading significant reliability initiatives.
A Bachelors degree in Computer Science, Engineering, or a related technical field, or equivalent extensive practical experience.
Exceptional proficiency in scripting and programming languages (e.g., Python, Go, Java, Ruby, Bash) for developing advanced automation, tooling, and system
integrations.
Extensive hands-on experience with major cloud platforms (e.g., AWS, Google Cloud Platform, Azure) and deep expertise in containerization technologies (Docker, Kubernetes).
Profound understanding of Linux/Unix systems internals, networking protocols, and distributed system architectures.
Expertise in designing and managing CI/CD pipelines and robust version control systems (e.g., Git), advocating for GitOps principles.
Mastery of monitoring, logging, and alerting tools (e.g., Datadog, Prometheus, Grafana, ELK stack, OpenTelemetry).
Superior problem-solving skills, critical thinking, and meticulous attention to detail, especially under pressure.
Outstanding communication, interpersonal, and collaboration skills, with the ability to influence and lead cross-functional teams.
Proven ability to thrive and lead in a fast-paced, highly dynamic, and complex technical environment.
Expert-level debugging and root cause analysis capabilities across complex distributed systems.

Bonus Points For:

Extensive experience with infrastructure as code (IaC) tools (e.g., Terraform, Ansible, Pulumi).
Deep knowledge of various database systems (relational and NoSQL) and advanced data management strategies.
Significant experience designing, implementing, and operating microservices architectures.
Contributions to open-source projects related to SRE, operations, or cloud-native technologies.
This role offers a unique opportunity to make a significant impact on our core services and directly influence our engineering culture around reliability.

Original job Lead Platform/Site Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Share Job

Get your Resume Reviewed for Free

Similar Platform/Site Reliability Engineer Jobs in Hong Kong

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Lead Platform/Site Reliability Engineer

Job Description - Lead Platform/Site Reliability Engineer

Similar Platform/Site Reliability Engineer Jobs in Hong Kong

Mobile Apps