Build internal platforms, services, and APIs that enable self-service provisioning, safe deployments
Enhance CI/CD workflows
Implement and evolve Infrastructure-as-Code using Terraform
OVERVIEW We're hiring a Site Reliability Engineer to support a key global technology client. You'll join a modern, cloud-native engineering environment and partner closely with development teams to improve the reliability, scalability, and automation of distributed platforms. The role blends software engineering with reliability ownership: you'll design and build internal services and tooling, streamline CI/CD, implement Infrastructure-as-Code at scale, and strengthen observability so issues are found and fixed before they impact users.
This position offers high autonomy and visibility. You'll work across well-documented systems and established tooling, prepare proof-of-concepts to influence change, and drive pragmatic automation (in Go or Python) that reduces manual effort and makes releases safer and faster. If you enjoy hands-on engineering, diagnosing complex problems, and landing improvements in real production environments, this is an opportunity to make a clear and measurable impact.
DESCRIPTION As a Site Reliability Engineer, you will:
Build internal platforms, services, and APIs that enable self-service provisioning, safe deployments, and efficient day-to-day operations.
Enhance CI/CD workflows (e.g., Jenkins or similar) to increase deployment reliability, add guardrails, and improve developer experience and velocity.
Implement and evolve Infrastructure-as-Code using Terraform (and related patterns) to standardize environments, reduce configuration drift, and improve repeatability.
Define and operationalize SLIs/SLOs and error budgets, build actionable dashboards, and tune alerts to reflect user experience and business risk.
Operate Kubernetes workloads at scale; improve resilience, performance, and cost-efficiency through sound engineering and automation.
Strengthen observability (metrics, logs, traces) using Prometheus and complementary platforms; drive root-cause analysis and preventative fixes.
Automate routine work and periodic upgrade cycles (preferably in Go/Python) to eliminate toil and reduce change risk.
Troubleshoot complex incidents across compute, networking, containers, and deployments; participate in a shared on-call rotation and contribute to post-incident reviews.
Collaborate with engineers, architects, and product stakeholders to translate requirements into secure, observable, and scalable infrastructure solutions.
Document patterns and best practices; mentor teams on reliability-first ways of working and platform standards.
QUALIFICATIONS
Strong hands-on experience with AWS (production environments) and cloud-native architectures; familiarity with hybrid or multi-cloud concepts is a plus.
Practical expertise operating Kubernetes (deployments, day-2 operations, and troubleshooting).
Solid CI/CD skills with Jenkins or similar tools (pipeline design, release safety, rollbacks).
Proficiency in Infrastructure-as-Code (Terraform) and Git-based workflows for environment management.
Programming/automation in Go and/or Python (production-quality code; tooling and services, not just scripts).
Observability experience with Prometheus and dashboards/alerting tuned to SLIs/SLOs; familiarity with platforms such as Grafana, Datadog, or CloudWatch is welcome.
Networking fundamentals for distributed systems, DNS, load balancing, VPC design, security groups, and layer-7 routing/proxies.
Sound understanding of secure system design (least privilege, secrets management, change control) and performance/reliability trade-offs.
Excellent communication skills and the ability to operate independently in distributed, asynchronous teams while influencing stakeholders through clear proposals and POCs.
7+ years in SRE/DevOps/Infrastructure/Software Engineering with a track record of operating production-grade systems at scale.
PROFESSIONAL ATTRIBUTES
Ownership: You're accountable across both build and run; you close the loop with measurable outcomes.
Automation first: You remove toil with durable solutions, not quick fixes.
Engineering rigor: You apply design patterns, testing, and code reviews to platform work.
Influence without authority: You use documentation, POCs, and calm communication to align teams.
Proactive and visible: You work independently across time zones and keep stakeholders informed.
We regret to inform that only shortlisted candidates will be notified / contacted.
EA Registration No: R21103843, Andrew Jonas Matthew
Allegis Group Singapore Pte Ltd, Company Reg No. 200909448N, EA License No. 10C4544
Manage reference data requirements for APAC Global Markets with accuracy and efficiency. Drive automation initiatives and process improvements across operations. Collaborate with stakeholders to design and implement new processes and resolve escalations We are seeking a Reference Data Management Spe...
All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.
Be the first to receive the latest Others Contract Jobs in Singapore.
Setup your job alert:
By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime.
Skip
GrabJobs is the no1 job portal in Singapore, connecting you to thousands of jobs fast!
Find the best jobs in Singapore, apply in 1 click and get a job today!