J

Lead Site Reliability Engineer

icon building Company : Jobgether
icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
Apply Now
icon loader Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Lead Site Reliability Engineer










This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Lead Site Reliability Engineer based in Canada.


This is a high-impact technical leadership role focused on improving reliability across large-scale distributed systems that directly impact millions of customers. You will sit at the core of incident response and production stability, working across engineering teams to identify systemic failure patterns and eliminate them at the root. The role blends hands-on engineering with cross-functional influence, requiring you to translate real production incidents into durable architectural and operational improvements. You will help define and elevate reliability standards across the organization, shaping how systems are built, deployed, and operated. Beyond incident response, you will drive long-term resilience through observability, automation, and safer deployment practices. This is a highly collaborative environment where influence matters as much as execution, and where your work compounds across teams and services. You will also help mature a growing SRE practice, moving it from reactive incident handling to proactive reliability engineering.










Accountabilities:



  • Identify and analyze recurring failure patterns across production systems by reviewing incidents, postmortems, and operational data to uncover systemic root causes.

  • Prioritize reliability improvements based on impact to MTTR, MTTD, and overall customer-facing stability, focusing on high-leverage engineering interventions.

  • Drive the design and adoption of reliability patterns such as resilience mechanisms, safe deployment strategies, observability standards, and dependency protection across engineering teams.

  • Collaborate directly with product and platform teams through code contributions, reviews, and hands-on engineering support to implement durable fixes.

  • Lead technical discussions in incident reviews and operational forums, ensuring clear identification of gaps in monitoring, recovery, and system design.

  • Influence engineering teams across the organization without direct authority to adopt reliability best practices and shared standards.

  • Evangelize and document reliability improvements, ensuring knowledge is shared and scaled across teams rather than remaining isolated fixes.

  • Participate in on-call rotations and incident leadership, supporting real-time response when critical issues arise.

  • Contribute to the evolution of internal SRE practices by mentoring engineers and helping shift the organization toward proactive reliability engineering.


Requirements:



  • 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Infrastructure Engineering in large-scale distributed systems.

  • Strong programming ability in at least one language (e.g., Python, Go, Java, or TypeScript) with production-grade code delivery experience.

  • Proven experience improving system reliability through cross-team adoption of engineering practices such as observability, resilience patterns, or deployment safety mechanisms.

  • Strong systems thinking with the ability to identify root causes of complex production issues and apply simple, high-impact solutions.

  • Hands-on experience with cloud environments (AWS, GCP, or Azure), CI/CD pipelines, infrastructure-as-code, and modern observability platforms.

  • Experience defining and operating with SLOs/SLIs and improving system performance based on reliability metrics.

  • Ability to influence and drive change across teams without direct managerial authority.

  • Clear and effective communicator, able to translate complex technical issues into actionable recommendations.

  • Preferred: experience with chaos engineering, incident management tooling, or internal developer platforms.

  • Preferred: exposure to AI/LLM tools in engineering workflows or production systems.


Benefits:



  • Competitive compensation aligned with experience and market benchmarks

  • Remote-first flexibility within Canada

  • Comprehensive health, dental, and wellness benefits

  • Equity and long-term incentive programs (subject to eligibility)

  • Paid time off and flexible work arrangements

  • Learning and development support for technical growth

  • Opportunity to shape and mature a high-impact SRE function

  • Exposure to large-scale distributed systems and global engineering teams


How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!


 

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

 

 

#LI-CL1
We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.
Original job Lead Site Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Apply Now
Share Job
Share Job

Auto-Apply to Lead Site Reliability Engineer Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Lead Site Reliability Engineer Jobs in Canada

GrabJobs is the no1 job portal in Canada, connecting you to thousands of jobs fast! Find the best jobs in Canada, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.