About the Role
About the Role
We are seeking a proactive and detail -oriented Site Reliability Engineer (SRE) with 3+ years
of experience to ensure high availability, reliability, and performance of production systems.
This role focuses on automation,
incident management, and cross -team
coordination to drive operational excellence.
Key Responsibilities
• Maintain reliable, scalable, and secure production environments.
• Implement and manage monitoring, alerting, and logging solutions.
• Contribute to defining and tracking SLIs/SLOs and support error budget practices.
• Automate operational tasks to improve efficiency and reduce manual effort.
• Perform troubleshooting and Root Cause Analysis (RCA) for production incidents.
• Optimize system performance, availability, and capacity.
• Maintain SOPs, and incident documentation in Confluence.
• Adhere to change management, deployment governance, and disaster recovery
standards.
• Support incident response for critical production services.
Collaboration & Tools
• Coordinate with external vendors and internal cross -functional teams.
• Work closely with Engineering, Product Owners, and Operations teams.
• Manage incidents and changes using ServiceNow & JIRA.
• Collaborate through Slack and structured communication channels.
Technical Skills
Systems & Clouds
• Strong knowledge of Windows and Linux/Unix systems
• Solid understanding of networking fundamentals (DNS, TCP/IP, Load Balancing,
Firewalls).
• Experience with at least one cloud platform (AWS, Azure, or GCP).
• Automation & CI/CD
• Proficiency in one scripting/programming language (Python, Go, Bash, PowerShell, or
Java).
• Understanding of CI/CD pipelines and automation practices.
Containers
• Hands -on experience with Docker and Kubernetes
• Experience with monitoring tools such as or Power BI.
• Ability to analyze logs, metrics, and traces for troubleshooting.
ITSM & Documentation
• Experience with ServiceNow & JIRA (incident/change/problem workflows)
• Working knowledge of Confluence for technical documentation and knowledge
management.
Additional Experience (Preferred)
• Background in DevOps, Cloud Engineering, or Platform Engineering
• Understanding of security best practices and compliance standards.
• Familiarity with AI -assisted engineering tools (Claude Code, Jellyfish, GitHub Copilot
• Exposure to large -scale or production -grade systems.
Soft Skills
• Strong analytical and troubleshooting mindset
• Excellent written and verbal communication skills
• Ownership driven and composed during high level severity incidents
Accessibility & Inclusion Statement
We are committed to creating an inclusive environment for all employees, including persons
with disabilities. Reasonable accommodations will be provided upon request.