How you'll add value:
- Execution & Collaboration
- Respond to production incidents, perform triage and troubleshooting, and contribute to post-incident analysis.
- Identify and automate manual processes to improve efficiency and reduce risk.
- Enhance and evolve monitoring tools and platforms to improve observability.
- Promote and apply best practices for reliability, scalability, and performance across engineering.
- Implement and support cloud automation using Terraform, Ansible, or CloudFormation.
- Work within change management protocols to provide maximum uptime for production systems.
- Participate in on-call rotation, providing 24x7 support for incidents and contributing to root cause analysis.
- Partner with developers, architects, vendors, and IT teams to ensure reliable system operations.
- Research and remediate vulnerabilities in coordination with security teams.
- Maintain documentation of infrastructure, monitoring, runbooks, and incident response procedures.
- Standards & Process
- Apply company policies and procedures when handling operational tasks and incidents.
- Suggest and implement improvements to operational processes and monitoring practices.
- Contribute to technical diagrams, documentation, and runbooks for system reliability.
- Learning & Growth
- Expand expertise in cloud services (Azure, AWS, or GCP) and container platforms (EKS, ECS, AKS).
- Build proficiency with observability and monitoring tools (Prometheus, Grafana, ELK, Site24x7, Nagios).
- Develop scripting and automation skills using Python, Bash, PowerShell, or similar.
- Participate in planning discussions by contributing technical input on system stability and reliability.
What you'll need to be successful in this role:
- BS in Computer Science, Information Systems, or related field (or equivalent experience).
- 2–4 years of experience in site reliability engineering, DevOps, or cloud operations.
- Experience with cloud platforms (Azure or AWS), including services such as AKS, ECS/EKS, Functions/Lambda, S3, and Blob storage.
- Proficiency with infrastructure-as-code and automation (Terraform, Ansible, YAML, Python, Bash, PowerShell).
- Strong Linux engineering skills; working knowledge of Windows administration.
- Experience supporting production environments and participating in on-call rotations.
- Familiarity with web servers and middleware (Nginx, Apache Tomcat).
- Experience with CI/CD tools (GitLab, Git, or similar).
- Strong written, oral, and interpersonal communication skills.
- Experience with monitoring tools (Prometheus, Grafana, ELK, Site24x7, Nagios).
- Knowledge of performance analysis and system vulnerability remediation.
- Cloud certification (AWS or Azure) preferred.
- Familiarity with restaurant industry SaaS platforms and customer-facing applications.
R365 Team Member Benefits & Compensation
- This position has a salary range of $98,583-$138,016 annually. The above range represents the expected salary range for this position. The actual salary may vary based upon several factors, including, but not limited to, relevant skills/experience, time in the role, business line, and geographic location. Restaurant365 focuses on equitable pay for our team and aims for transparency with our pay practices.
- Comprehensive medical benefits, 100% paid for employee
- 401k + matching
- Equity Option Grant
- Unlimited PTO + Company holidays
- Wellness initiatives