N

Staff Observability Platform Engineer

icon building Company : Nscale
icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
Apply Now
icon loader Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - Staff Observability Platform Engineer

About Nscale


Nscale is the GPU cloud engineered for AI. We provide cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers. Nscale simplifies AI development while enabling superior results, supporting strategic business outcomes such as cost management, rapid innovation, and environmental responsibility.


We thrive on a culture of relentless innovation, ownership, and accountability, where every team member takes pride in their work and drives it with excellence and urgency. As an Nscaler, you'll build trust through openness and transparency while contributing to the technology that powers the future.


About the Role


As a Staff Observability Platform Engineer, you'll play a critical role in building and evolving Nscale's observability platform, enabling deep visibility into GPU clusters, AI workloads, and the infrastructure that powers them.


You view observability as a product, not simply a collection of tools. You'll help define and implement scalable, reliable observability solutions that empower engineering teams to understand system behavior, diagnose issues quickly, and operate complex distributed systems with confidence.


You'll combine technical leadership with hands-on engineering, partnering across SRE, infrastructure, platform, and AI/ML teams to improve reliability, operational efficiency, and developer experience. You'll influence architectural decisions, establish engineering best practices, and help drive the evolution of observability capabilities across the organization.


This is a role for someone who enjoys solving difficult infrastructure problems, building platforms that scale, and helping engineering teams succeed through better visibility and operational insight.


What You'll Do




  • Design, build, and evolve observability platforms across metrics, logs, traces, alerting, and telemetry pipelines.




  • Lead the implementation of scalable observability solutions that support Nscale's growing GPU and AI infrastructure.




  • Partner with SRE, infrastructure, platform, and AI/ML teams to ensure observability is embedded throughout the software and infrastructure lifecycle.




  • Drive improvements in monitoring coverage, alert quality, service health visibility, and incident response effectiveness.




  • Develop standards, frameworks, and reusable patterns that simplify observability adoption across engineering teams.




  • Identify reliability risks and operational blind spots, helping teams proactively address them before they impact customers.




  • Contribute to architectural decisions around telemetry collection, storage, retention, cardinality management, and performance optimization.




  • Lead technical initiatives and projects that improve platform scalability, reliability, and operational efficiency.




  • Mentor engineers and provide technical guidance through design reviews, code reviews, and knowledge sharing.




  • Participate in incident investigations and postmortems, translating operational learnings into durable platform improvements.




  • Evaluate new observability technologies and practices, balancing innovation with operational simplicity and long-term maintainability.




About You




  • 6+ years of experience in SRE, platform engineering, infrastructure engineering, observability engineering, or related disciplines.




  • Strong experience building and operating observability platforms in cloud-native, distributed environments.




  • Deep hands-on experience with several of the following technologies: Prometheus, Thanos, VictoriaMetrics, Grafana, Loki, Tempo, OpenTelemetry, ClickHouse, Elastic, or similar platforms.




  • Strong software engineering skills with proficiency in Go, Python, or equivalent languages.




  • Experience operating and troubleshooting Kubernetes-based platforms at scale.




  • Strong understanding of monitoring, logging, tracing, telemetry pipelines, and modern observability practices.




  • Experience designing systems with scalability, reliability, performance, and operational simplicity in mind.




  • Proficiency with Infrastructure-as-Code tools such as Terraform, Ansible, or equivalent.




  • Ability to lead technical initiatives and influence engineering decisions across multiple teams.




  • Excellent communication skills with the ability to explain technical tradeoffs and align stakeholders around pragmatic solutions.




Preferred




  • Experience operating observability systems in GPU, AI/ML, HPC, or large-scale compute environments.




  • Familiarity with Slurm, Kubernetes GPU scheduling, or AI infrastructure platforms.




  • Experience with high-volume telemetry pipelines and streaming technologies such as Kafka, Vector, or Fluent Bit.




  • Knowledge of observability challenges related to model training, inference workloads, GPU utilization, and distributed AI systems.




  • Experience mentoring engineers and helping grow technical capability across teams.




Equal Opportunities Statement


We strongly encourage applications from people of color, the LGBTQ+ community, people with disabilities, neurodivergent individuals, parents, carers, and people from lower socio-economic backgrounds.


If there's anything we can do to accommodate your specific situation, please let us know.


Note: Responsibilities outlined are not exhaustive and may evolve as business needs change.

For information on how Nscale handles candidate personal data, please see our Employee & Candidate Privacy Notice: Here.

Original job Staff Observability Platform Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Apply Now
Share Job
Share Job

Auto-Apply to Staff Observability Platform Engineer Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Staff Observability Platform Engineer Jobs in the US

GrabJobs is the no1 job portal in the US, connecting you to thousands of jobs fast! Find the best jobs in the US, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.