Site Reliability Engineer

Salary :

$130,000 - 150,000 yearly

Company : Uscs External Positions

Job Type : Full Time

2 Aquarium Drive, Camden

Job Description - Site Reliability Engineer

Site Reliability Engineer (SRE)

Engineer Reliability into the Systems That Move the Nation’s Food Supply

Who We Are

US Cold owns and operates one of the most complex temperature-controlled logistics networks in North America. Every day, our systems coordinate the storage and movement of food at national scale across a network of state-of-the-art distribution centers, including multiple highly automated warehouse facilities.

We continue to advance our core warehouse and logistics platforms. Our current focus is on modular, event-driven, API-first and cloud architectures. We continue to enhance reliability and accelerate engineering productivity by strengthening our SRE and AI practices. This is a large investment in innovation to continue to drive operational excellence at our facilities.

If you want to build durable systems that operate in the physical world at scale, this is that opportunity.

The Role

The Site Reliability Engineer is a founding member of US Cold’s SRE practice.

This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail — particularly our Phenix WMS and facility automation interfaces — and design controls, automation, and observability that reduce incidents over time.

Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible.

You will work closely with application, infrastructure, and operations teams and participate directly in onâcall and incident response.

What You Will Own

Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces)

Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability

Observability across the full stack, correlating cloud services, APIs, and onâpremise facility operations

Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks

Development of selfâhealing behaviors for common failure modes

Participation in onâcall rotations and leadership of blameless postâincident reviews

Design and execution of disaster recovery tests across SaaS, cloud, and onâpremise environments

This is handsâon reliability engineering. The systems you improve will directly impact daily warehouse operations.

Technical Environment

Hybrid environments spanning cloud and onâpremise infrastructure

Azure cloud services

Warehouse Management Systems (Phenix WMS) and facility automation interfaces

Observability tooling across logs, metrics, and alerting

Automation using Python, PowerShell, Bash, or Ansible

CI/CD tools and modern deployment practices

Exposure to containerized and distributed systems environments

What We’re Looking For

3+ years of experience in SRE, DevOps, Systems Engineering, or related roles

Strong Linux and Windows systems administration and troubleshooting skills

Handsâon experience with automation and scripting

Experience designing and operating monitoring, alerting, and observability solutions

Practical experience working in Azure environments

Strong analytical skills and a bias toward eliminating root causes, not symptoms

Ability to collaborate across application, infrastructure, and operations teams

Experience supporting warehouse management systems or industrial automation platforms

Exposure to Kubernetes, microservices, or container orchestration

Hands on experience with infrastructureâasâcode tools such as Terraform or Ansible

Understanding of distributed systems and highâavailability design

Experience with SRE practices such as SLOâbased operations, runbook automation, or chaos testing

Why This Role Is Different

This is not an inherited SRE function.
There is no mature framework to maintain.

You will:

Help define what reliability means at US Cold

Work on systems that operate in the physical world

Engineer solutions that reduce toil and operational load

See the direct impact of your work on warehouse uptime and performance

Build practices that scale as the platform modernizes

This is an opportunity to grow as an SRE while helping establish the reliability foundation of a missionâcritical platform.

Compensation & Structure

Location: Hybrid – Camden NJ

Reports to: IT – Site Reliability Engineering Manager

Salary Range: $130,000- $150,000

Operational Context

Systems operate continuously across warehouse facilities

Reliability failures have physical and operational consequences

Onâcall participation is part of the role

Work occurs across cloud, SaaS, and onâpremise environments

Original job Site Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Share Job

Get your Resume Reviewed for Free

Similar Site Reliability Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Site Reliability Engineer

Job Description - Site Reliability Engineer

Site Reliability Engineer (SRE)

Engineer Reliability into the Systems That Move the Nation’s Food Supply

Who We Are

If you want to build durable systems that operate in the physical world at scale, this is that opportunity.

The Role

The Site Reliability Engineer is a founding member of US Cold’s SRE practice.

This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail — particularly our Phenix WMS and facility automation interfaces — and design controls, automation, and observability that reduce incidents over time.

Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible.

You will work closely with application, infrastructure, and operations teams and participate directly in onâcall and incident response.

What You Will Own

Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces)

Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability

Observability across the full stack, correlating cloud services, APIs, and onâpremise facility operations

Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks

Development of selfâhealing behaviors for common failure modes

Participation in onâcall rotations and leadership of blameless postâincident reviews

Design and execution of disaster recovery tests across SaaS, cloud, and onâpremise environments

This is handsâon reliability engineering. The systems you improve will directly impact daily warehouse operations.

Technical Environment

Hybrid environments spanning cloud and onâpremise infrastructure

Azure cloud services

Warehouse Management Systems (Phenix WMS) and facility automation interfaces

Observability tooling across logs, metrics, and alerting

Automation using Python, PowerShell, Bash, or Ansible

CI/CD tools and modern deployment practices

Exposure to containerized and distributed systems environments

What We’re Looking For

3+ years of experience in SRE, DevOps, Systems Engineering, or related roles

Strong Linux and Windows systems administration and troubleshooting skills

Handsâon experience with automation and scripting

Experience designing and operating monitoring, alerting, and observability solutions

Practical experience working in Azure environments

Strong analytical skills and a bias toward eliminating root causes, not symptoms

Ability to collaborate across application, infrastructure, and operations teams

Experience supporting warehouse management systems or industrial automation platforms

Exposure to Kubernetes, microservices, or container orchestration

Hands on experience with infrastructureâasâcode tools such as Terraform or Ansible

Understanding of distributed systems and highâavailability design

Experience with SRE practices such as SLOâbased operations, runbook automation, or chaos testing

Why This Role Is Different

This is not an inherited SRE function. There is no mature framework to maintain.

You will:

Help define what reliability means at US Cold

Work on systems that operate in the physical world

Engineer solutions that reduce toil and operational load

See the direct impact of your work on warehouse uptime and performance

Build practices that scale as the platform modernizes

This is an opportunity to grow as an SRE while helping establish the reliability foundation of a missionâcritical platform.

Compensation & Structure

Location: Hybrid – Camden NJ

Reports to: IT – Site Reliability Engineering Manager

Salary Range: $130,000- $150,000

Operational Context

Systems operate continuously across warehouse facilities

Reliability failures have physical and operational consequences

Onâcall participation is part of the role

Work occurs across cloud, SaaS, and onâpremise environments

Similar Site Reliability Engineer Jobs in the US

Mobile Apps

You will work closely with application, infrastructure, and operations teams and participate directly in onâcall and incident response.

Observability across the full stack, correlating cloud services, APIs, and onâpremise facility operations

Development of selfâhealing behaviors for common failure modes

Participation in onâcall rotations and leadership of blameless postâincident reviews

Design and execution of disaster recovery tests across SaaS, cloud, and onâpremise environments

This is handsâon reliability engineering. The systems you improve will directly impact daily warehouse operations.

Hybrid environments spanning cloud and onâpremise infrastructure

Handsâon experience with automation and scripting

Hands on experience with infrastructureâasâcode tools such as Terraform or Ansible

Understanding of distributed systems and highâavailability design

Experience with SRE practices such as SLOâbased operations, runbook automation, or chaos testing

This is not an inherited SRE function.
There is no mature framework to maintain.

This is an opportunity to grow as an SRE while helping establish the reliability foundation of a missionâcritical platform.

Onâcall participation is part of the role

Work occurs across cloud, SaaS, and onâpremise environments