This position is listed on behalf of a partner company, who manages all applications and next steps. Our partner is looking for a Senior Site Reliability Engineer based in Canada.

You will join a global, fully distributed engineering organization building critical infrastructure that powers complex, high-scale platforms used across multiple countries. In this role, you will be responsible for ensuring the reliability, scalability, and security of production systems that support modern cloud-native services and AI-driven workflows. You will design and operate infrastructure that must remain resilient across high-demand, multi-tenant environments. The work is deeply technical and highly impactful, requiring strong ownership of system reliability, observability, and automation. You will collaborate closely with engineering, product, and security teams to build systems that are performant, cost-efficient, and safe by design. This is a role for someone who thrives in asynchronous, remote-first environments and enjoys solving large-scale infrastructure challenges with autonomy and precision.

Accountabilities:

Design, implement, and maintain Infrastructure-as-Code using Terraform and Kubernetes to support scalable, production-grade environments.

Build and improve observability systems, including monitoring, logging, alerting, and dashboards to ensure system visibility and reliability.

Lead incident response processes, perform root cause analysis, and drive post-incident improvements to reduce system downtime.

Collaborate with security teams to embed security and compliance requirements into infrastructure design across global jurisdictions.

Optimize system performance, cloud resource utilization, and infrastructure costs while maintaining reliability standards.

Identify and eliminate operational toil through automation, improving engineering efficiency and platform scalability.

Support and enhance CI/CD pipelines and deployment strategies to ensure safe, reliable, and repeatable releases.

Work closely with platform and product engineering teams to improve API reliability, developer experience, and system observability.

Produce clear documentation, runbooks, and operational guidelines to support engineering excellence and knowledge sharing.

Requirements:

Senior-level experience in Site Reliability Engineering, DevOps, or Systems Engineering roles operating production systems at scale.

Strong hands-on experience with Kubernetes in production environments.

Solid expertise with AWS cloud services, including compute, networking, storage, and managed services.

Proficiency with Infrastructure-as-Code tools such as Terraform.

Experience building and managing CI/CD pipelines using tools like GitHub Actions, GitLab CI, or Jenkins.

Strong Linux systems knowledge, including debugging, scripting (Bash), and log analysis.

Experience designing and operating observability stacks (e.g., Prometheus, Grafana, Datadog, ELK).

Strong communication skills, with the ability to explain complex technical topics to both technical and non-technical audiences.

Nice to have: experience with backend programming languages (Go, Python, Java, Node.js, etc.).

Nice to have: experience in multi-tenant platforms, consultancy environments, or large-scale distributed systems.

Benefits:

Annual salary range: USD $54,000 – $150,000 (based on experience and location).

Fully remote work from anywhere in the world.

Flexible working hours with an async-first culture.

Flexible paid time off policy.

16 weeks of paid parental leave.

Stock options as part of the compensation package.

Home office and IT equipment budget.

Learning and professional development budget.

Mental health and wellness support services.

Budget for local coworking spaces or in-person team gatherings.

Inclusive, global work environment with strong emphasis on autonomy and work-life balance.

How Jobgether works:

We use an AI-powered matching process to ensure your application is reviewed quickly, objectively, and fairly against the role's core requirements. Our system identifies the top-fitting candidates, and this shortlist is then shared directly with the hiring company. The final decision and next steps (interviews, assessments) are managed by their internal team.

We appreciate your interest and wish you the best!

Why Apply Through Jobgether?

Data Privacy Notice: By submitting your application, you acknowledge that Jobgether will process your personal data to evaluate your candidacy and share relevant information with the hiring employer. This processing is based on legitimate interest and pre-contractual measures under applicable data protection laws (including GDPR). You may exercise your rights (access, rectification, erasure, objection) at any time.

#LI-CL1

We may use artificial intelligence (AI) tools to support parts of the hiring process, such as reviewing applications, analyzing resumes, or assessing responses and identifying potential inconsistencies or verification signals in application materials based on available information. These tools assist our recruitment team but do not replace human judgment. Final hiring decisions are ultimately made by humans. If you would like more information about how your data is processed, please contact us.

Senior Site Reliability Engineer

Job Description - Senior Site Reliability Engineer

Accountabilities:

Requirements:

Benefits:

Similar Senior Site Reliability Engineer Jobs in Canada

Mobile Apps