Principal Software Engineer, At-Scale Reliability and Fleet Intelligence CSP Engagements

Salary :

$272,000 - 431,250 yearly

Company : Nvidia

Job Type : Full Time

United States

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Principal Software Engineer, At-Scale Reliability and Fleet Intelligence CSP Engagements

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone.

What you'll be doing:

Drive reliability work streams with CSP engineering teams — ensuring shared understanding of MTBI measurement methodology, failure classification, and health monitoring architecture
Gather and synthesize CSP fleet reliability data — identify failure patterns that appear across multiple customers and champion improvements back into NVIDIA's firmware, driver, and hardware teams
Define consistent MTBI measurement methodology that works across different CSP monitoring environments and operational practices
Conduct fleet-scale failure pattern analysis using statistical methods (Pareto, survival analysis, Weibull) to classify failures as systemic, environmental, or configuration-specific
Drive fleet health monitoring integration architecture — ensure NVIDIA's health agents, telemetry, and reporting align with CSP operational workflows and automation
Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams, validating with customers that criteria are meaningful
Collaborate with CSPs to ensure reliability-related integration work (health monitoring deployment, telemetry pipeline, alerting configuration) is complete ahead of at-scale launch
Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments

What we need to see:

15+ years of experience in systems software at datacenter scale, or reliability engineering with focus on at-scale challenges.
BS or MS in Computer Science, Electrical Engineering, Statistics, or related field (or equivalent experience)
Deep expertise in multi-NUMA, rack-scale system software and firmware. Statistical failure analysis methods: MTBF/MTBI calculation, Pareto analysis, root cause classification
Experience with fleet-level telemetry and observability systems: time-series databases, anomaly detection, health scoring, event correlation
Understanding of hardware failure modes in large-scale GPU/accelerator deployments — ability to classify and prioritize across compute, interconnect, memory, power, and thermal domains
Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems. Familiarity with predictive maintenance or anomaly detection approaches applied to fleet health data
Customer obsession — genuine passion for understanding fleet reliability challenges at scale and translating them into actionable engineering priorities
Strong communication — ability to present statistical reliability findings to both deep technical audiences and executive leadership. Demonstrated success driving cross-functional improvements across hardware, firmware, and software teams without direct authority

Ways to stand out from the crowd:

Experience in fleet reliability at a hyperscaler (hardware health, fleet reliability at leading CSP/Hyperscaler)
Familiarity with NVIDIA GPU error taxonomy (Xid errors, NVLink error counters, thermal events, CPER records)
Experience building health scoring or predictive failure models for accelerator or HPC infrastructure
Background in defining MTBI/MTBF measurement standards or certification programs for complex multi-component systems
Understanding of how reliability data flows from device firmware through telemetry pipelines to fleet-level dashboards and automated remediation

NVIDIA is leading the way in groundbreaking developments in Artificial Intelligence, High-Performance Computing and Visualization. The GPU, our invention, serves as the visual cortex of modern computers and is at the heart of our products and services. We have some of the most forward-thinking and hardworking people on the planet working for us. If you're creative, hardworking and self-motivated, we want to hear from you!

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 30, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering an inclusive work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Original job Principal Software Engineer, At-Scale Reliability and Fleet Intelligence CSP Engagements posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Principal Software Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Principal Software Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Principal Software Engineer, At-Scale Reliability and Fleet Intelligence CSP Engagements

Job Description - Principal Software Engineer, At-Scale Reliability and Fleet Intelligence CSP Engagements

Similar Principal Software Engineer Jobs in the US

Mobile Apps