Reliability Engineer

Company : Systems Engineering Solutions Corporation

Job Type : Full Time

Boston, Massachusetts

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Reliability Engineer

This role supports the U.S. Air Force Cloud One Architecture and Common Shared Services contract and currently has an opening for a Reliability Engineer. The Reliability Engineer is responsible for ensuring the availability, performance, scalability, and resiliency of mission‑critical systems. This role applies software engineering principles to infrastructure and operations, with a strong emphasis on automation, monitoring, incident response, and continuous reliability improvement. The reliability engineer serves as the bridge between development, operations, and platform teams to ensure production systems consistently meet defined service level objectives (SLOs) while supporting rapid, safe delivery of new capabilities.

Location: This position will be hybrid remote. Candidates will be required to work onsite as needed. Candidates preferred to be located near Hanscom AFB (Boston, MA).

System Reliability & Availability

Design, implement, and maintain highly available, fault-tolerant systems in cloud and hybrid environments
Define, measure, and report Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets
Identify reliability risks and implement mitigation strategies across the system lifecycle
Conduct capacity planning and performance modeling to ensure systems scale to meet demand

Monitoring, Observability & Alerting

Implement and manage monitoring, logging, and tracing solutions to provide full system observability
Define actionable alerting thresholds that minimize noise and enable rapid incident detection
Analyze trends and metrics to proactively identify potential reliability issues

Incident Response & Problem Management

Participate in on‑call rotations and lead incident response activities for production systems
Coordinate troubleshooting efforts across development, infrastructure, and security teams
Conduct post‑incident reviews (PIRs) and develop corrective and preventive action plans
Track recurring issues and ensure root causes are resolved

Automation & Engineering Excellence

Automate operational tasks to reduce manual intervention and operational risk
Develop scripts, tools, and services that improve system reliability and reduce mean time to recovery (MTTR)
Promote “automation over toil” and standardize operational workflows

Reliability‑Focused Engineering

Participate in architecture and design reviews with an emphasis on reliability, resiliency, and recoverability
Validate disaster recovery (DR) and business continuity plans; test failover mechanisms
Support chaos engineering, fault injection testing, and resilience validation where appropriate

Collaboration & Governance

Partner with DevOps, Platform, and Security teams to ensure reliability aligns with delivery and compliance objectives
Document system reliability standards, runbooks, and operational procedures
Support compliance and audit activities (e.g., FedRAMP, FISMA, internal operational controls)

Required Skills:

· Bachelors and eight (8) years or more of experience; Masters and six (6) years or more of experience. Additional experience may be accepted in lieu of degree.

· Active Secret clearance at a minimum required to start

· US citizenship required

· Experience with cloud platforms (AWS, Azure, OCI, or GCP), including managed services

· Experience with containerized environments (Docker, Kubernetes)

· Familiarity with CI/CD pipelines and deployment automation

· SLOs and error budgets

· Capacity modeling and performance testing

· Strong understanding of:

· Distributed systems and high‑availability architectures

· Linux/Windows system administration

· Networking fundamentals (DNS, TCP/IP, load balancing)

· Hands-on experience with:

· Monitoring and observability tools (e.g., Prometheus, Grafana, ELK/Elastic, Datadog, Azure Monitor)

· Infrastructure as Code (Terraform, ARM, CloudFormation)

· Scripting or programming languages (Python, Bash, Go, PowerShell, or similar)

· Experience supporting incident management and on‑call operations

Preferred Skills

Experience with USAF Cloud One or Platform 1.
Experience with Zero Trust Architecture
Cloud certifications in AWS, Azure, Google, or Oracle clouds

SES provides a competitive salary and the following benefits:

Medical
Dental
Vision
AD&D
STD
LTD
Company paid Life Insurance
401k with employer contribution
Paid Time Off
Pet Insurance

Original job Reliability Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Reliability Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Reliability Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Reliability Engineer

Job Description - Reliability Engineer

Similar Reliability Engineer Jobs in the US

Mobile Apps