Maintain high availability of production systems by focusing on resilient cloud architecture, fast detection and mitigation of incidents, postmortem-driven improvements, and automated routine checks for risks.
Leverage AI assistant and coding agents to reduce toil and improve incident diagnosis, infrastructure automation, knowledge management, and operational efficiency.
Participate in on-call rotations and respond to production incidents to ensure service continuity. Compensatory time off will be provided in accordance with company policy.
Perform server administration and cloud operations, including troubleshooting, monitoring, and capacity management of Linux cloud virtual servers and Kubernetes workload.
Analyze distributed system issues including performance and cost efficiency bottlenecks, and drive improvements in reliability, performance, scalability and cost optimization.
Define and monitor service level objectives (SLOs) and continuously improve system resilience and availability.
Perform proof-of-concept evaluations, including setup, testing, and production validation of cloud solutions and services before mass adoption.
Deploy and configure cloud solutions in production environments, including resource provisioning, configuration, monitoring, and operational support across cloud platforms.
Job Requirements
Bachelor’s degree in Computer Science, Information Technology, Programming & Systems Analysis, Science (Computer Studies), or a related field.
Alternatively, a minimum of 3–5 years of relevant experience in Site Reliability Engineering, Cloud Engineering, DevOps, or related roles.
Proficiency in at least one programming or scripting language such as Python, Go, or Bash.
Experience with cloud platforms such as Alibaba Cloud, AWS, Azure, or equivalent; experience in multi-cloud or hybrid cloud environments is preferred.
Strong understanding of Linux systems (kernel, memory, process, etc.), networking (TCP/IP, DNS, TLS), load balancing, high availability architecture, and observability platforms such as Prometheus, Grafana, Loki and the ELK stack.
Expert knowledge and hands-on experience in incident handling, especially in Kubernetes and container environments.
Hands-on experience with nginx preferred, including configuration, troubleshooting, and performance tuning.
Experience in deploying and operating large-scale production distributed systems, including server administration, microservice architecture, cloud load balancing (L4/L7), IP routing, reverse proxy architecture, and cloud support.
Experience with automation of cloud operations and infrastructure, including scripting and CI/CD process, is preferred.
Experience using AI-assisted engineering tools and coding agents to improve automation, incident response, troubleshooting, and operational efficiency is an advantage.
A strong team player with good communication skills, responsible, self-driven, and highly motivated.
Ability to communicate in Mandarin and English, in order to support coordination and collaboration with Mandarin-speaking stakeholders, teams, and business partners across regional markets.
All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.
Be the first to receive the latest Civil Engineer Full-Time Jobs in Singapore.
Setup your job alert:
By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime.
Skip
GrabJobs is the no1 job portal in Singapore, connecting you to thousands of jobs fast!
Find the best jobs in Singapore, apply in 1 click and get a job today!