T

Senior Site Reliability Engineer (SRE)

salary Salary :

$10,000 - 20,000 monthly

icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Senior Site Reliability Engineer (SRE)

Responsibilities

We are looking for an experienced Site Reliability Engineer who is passionate about building reliable, scalable, and automated infrastructure to support mission-critical platform services.

What You'll Do

  • Ensure the reliability, availability, and operational excellence of critical platform services and infrastructure.
  • Design, deploy, maintain, and optimize cloud-native infrastructure based on Kubernetes and Docker.
  • Build and improve observability systems including monitoring, alerting, logging, and distributed tracing.
  • Participate in architecture reviews and provide reliability-focused recommendations for high-concurrency, low-latency distributed systems.
  • Develop and maintain CI/CD pipelines to improve engineering productivity and deployment quality.
  • Lead capacity planning, performance tuning, disaster recovery planning, and resilience engineering initiatives.
  • Drive Infrastructure as Code (IaC) adoption and automation to reduce operational overhead and human error.
  • Define and continuously improve SLI/SLO/SLA frameworks across critical services.
  • Participate in incident response, root cause analysis (RCA), and postmortem reviews for production issues.
  • Collaborate closely with engineering, QA, product, and security teams to continuously improve platform reliability, scalability, and efficiency.
  • Leverage AI-powered tools (e.g., Cursor, Claude Code, ChatGPT) to enhance operational automation, troubleshooting, and productivity.

Requirements

Must-Have Skills

  • Bachelor's degree or above in Computer Science or a related field.
  • 5+ years of experience in SRE, DevOps, Infrastructure Engineering, or related roles.
  • Strong knowledge of Linux systems and performance optimization.
  • Proficiency in at least one programming language such as Go, Python, Java, or Rust.
  • Hands-on experience with Kubernetes, Docker, and cloud-native ecosystems.
  • Experience with CI/CD tools such as GitHub Actions, GitLab CI, or Jenkins.
  • Solid understanding of networking fundamentals including TCP/IP, HTTP, and WebSocket.
  • Strong troubleshooting, performance analysis, and capacity planning skills.
  • Experience building automation tools and operational platforms.
  • Demonstrated proficiency in AI-assisted development and operations tools such as Cursor and Claude Code.

Technical Stack

Container Platforms

  • Kubernetes
  • Docker

Observability

  • Prometheus
  • Grafana
  • Loki
  • ELK
  • OpenTelemetry

Messaging Systems

  • Kafka
  • RocketMQ
  • Redis

Databases

  • MySQL
  • PostgreSQL
  • ClickHouse
  • Time-Series Databases

Infrastructure Automation

  • Terraform
  • Ansible
  • Helm

Cloud Platforms

  • AWS
  • GCP
  • Alibaba Cloud
  • Tencent Cloud

CI/CD

  • GitHub Actions
  • GitLab CI
  • Jenkins

Preferred Experience

  • Experience in large-scale internet, SaaS, fintech, e-commerce, or mission-critical platform environments.
  • Experience supporting high-concurrency distributed systems.
  • Strong understanding of distributed system architecture, scalability, and reliability engineering principles.
  • Experience operating multi-region or multi-datacenter infrastructure.

Nice to Have

  • Experience managing large-scale Kubernetes clusters (1,000+ nodes).
  • Hands-on experience with Service Mesh technologies (e.g., Istio) and OpenTelemetry.
  • Expertise in Kafka, ClickHouse, and large-scale distributed system optimization.
  • Experience implementing Chaos Engineering practices.
  • Strong background in incident management and large-scale production recovery.
  • Experience with AIOps, intelligent alerting, and automated fault diagnosis systems.
Original job Senior Site Reliability Engineer (SRE) posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Similar Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI
💰

Technology Salaries

Similar Jobs in Singapore

GrabJobs is the no1 job portal in Singapore, connecting you to thousands of jobs fast! Find the best jobs in Singapore, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.