Site Reliability Engineer

icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.

Job Description - Site Reliability Engineer

Role Overview: As a Site Reliability Engineer (SRE), you will be responsible for designing, implementing, and maintaining highly reliable and scalable software systems and infrastructure. You will apply software engineering principles to automate operations tasks, improve system performance, and enhance reliability, while collaborating closely with development, operations, and other cross-functional teams to achieve organizational goals.

Key Responsibilities:

  1. Reliability and Availability Engineering:
    • Design, implement, and maintain highly available and fault-tolerant systems and services to meet SLAs (Service Level Agreements) and business requirements.
    • Implement monitoring, alerting, and incident response mechanisms to proactively detect and mitigate system failures, anomalies, and performance issues.
    • Conduct post-incident reviews and root cause analyses to identify systemic issues and implement preventive measures.
  2. Automation and Tooling:
    • Develop and maintain automation scripts, tools, and infrastructure to streamline operations tasks, deployment processes, and system management.
    • Implement infrastructure as code (IaC) practices to automate provisioning, configuration, and deployment of infrastructure resources.
    • Build and maintain CI/CD pipelines to automate software builds, testing, and deployment processes.
  3. Scalability and Performance Optimization:
    • Design and implement solutions to improve system scalability, performance, and efficiency to handle increasing workload and user traffic.
    • Conduct capacity planning and performance tuning exercises to optimize resource utilization and mitigate bottlenecks.
    • Implement caching, load balancing, and other optimization techniques to improve application and system performance.
  4. Incident Management and Response:
    • Participate in on-call rotation and respond to system incidents, outages, and emergencies to minimize downtime and service disruptions.
    • Coordinate and collaborate with cross-functional teams to troubleshoot and resolve complex technical issues and incidents.
    • Document incident response procedures, best practices, and lessons learned for future reference and improvement.
  5. Security and Compliance:
    • Implement and enforce security measures and controls to protect systems, networks, and data from cyber threats and vulnerabilities.
    • Conduct security reviews, audits, and vulnerability assessments to identify and remediate security risks.
    • Ensure compliance with industry regulations, standards, and best practices related to data privacy and security.

Qualifications and Skills:

  • Bachelor's degree in Computer Science, Information Technology, or related field.
  • Proven experience in software engineering, systems administration, or infrastructure operations roles.
  • Strong programming and scripting skills with proficiency in languages such as Python, Go, or Java.
  • Experience with cloud computing platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).
  • Knowledge of infrastructure as code (IaC) tools and practices (e.g., Terraform, Ansible, Chef, Puppet).
  • Familiarity with monitoring and observability tools (e.g., Prometheus, Grafana, ELK stack) for system performance and health monitoring.
  • Excellent problem-solving, analytical, and troubleshooting skills.
  • Effective communication and collaboration skills with cross-functional teams and stakeholders.

Additional Requirements:

  • Certification in relevant technologies and platforms (e.g., AWS Certified Solutions Architect, Certified Kubernetes Administrator) is a plus.
  • Experience with distributed systems, microservices architecture, and cloud-native technologies.
  • Knowledge of DevOps practices, SRE methodologies, and agile development methodologies.
  • Willingness to learn new technologies and adapt to evolving industry trends and advancements.
Original job Site Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Share this job with your friends

icon get direction How to get there?

icon geo-alt Pasig

icon get direction How to get there?
View similar Admin / Operations jobs below

Similar Jobs in the Philippines

GrabJobs is the no1 job portal in the Philippines, connecting you to thousands of jobs fast! Find the best jobs in the Philippines, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2024 Grabjobs Pte.Ltd. All Rights Reserved.