Staff Observability Platform Engineer

Company : Nscale

Job Type : Full Time

United States

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Staff Observability Platform Engineer

About Nscale

Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility.

We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency while contributing to the technology that powers the future.

About the Role

As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them.

You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence.

You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization.

This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight.

What You'll Do

Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines.

Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure.

Partner with SRE, infrastructure, platform, and AI/ML teams to ensure observability is embedded throughout the software and infrastructure lifecycle.

Drive improvements in monitoring coverage, alert quality, service health visibility, and incident response effectiveness.

Develop standards, frameworks, and reusable patterns that simplify observability adoption across engineering teams.

Identify reliability risks and operational blind spots, helping teams proactively address them before they impact customers.

Contribute to architectural decisions around telemetry collection, storage, retention, cardinality management, and performance optimization.

Lead technical initiatives and projects that improve platform scalability, reliability, and operational efficiency.

Mentor engineers and provide technical guidance through design reviews, code reviews, and knowledge sharing.

Participate in incident investigations and postmortems, translating operational learnings into durable platform improvements.

Evaluate new observability technologies and practices, balancing innovation with operational simplicity and long-term maintainability.

About You

6+ years of experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines.

Strong experience building and operating observability platforms in cloud-native, distributed environments.

Deep hands-on experience with several of the following technologies: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms.

Strong software engineering skills with proficiency in Go, Python, or equivalent languages.

Experience operating and troubleshooting Kubernetes-based platforms at scale.

Strong understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices.

Experience designing systems with scalability, reliability, performance, and operational simplicity in mind.

Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent.

Ability to lead technical initiatives and influence engineering decisions across multiple teams.

Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions.

Preferred

Experience operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments.

Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms.

Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit.

Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems.

Experience mentoring engineers and helping grow technical capability across teams.

Equal Opportunities Statement

We strongly encourage applications from people of color, the LGBTQ+ community, people with disabilities, neurodivergent individuals, parents, carers, and people from lower socio-economic backgrounds.

If there's anything we can do to accommodate your specific situation, please let us know.

Note: Responsibilities outlined are not exhaustive and may evolve as business needs change.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

Original job Staff Observability Platform Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Staff Observability Platform Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Staff Observability Platform Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip