HPC Performance and Validation Engineer

Company : NorthMark Strategies

Job Type : Full Time

Dallas, United States

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - HPC Performance and Validation Engineer

The Company

NorthMark Compute & Cloud (NMC²) is backed by dedicated leadership and investment, with a clear mission as it operates at the bleeding edge of technology. Its goal is to scale and enhance the high-performance computing (HPC) and cloud infrastructure that supports its clients' research, production, and delivery, enabling breakthroughs that shape the industries of tomorrow. Its engineers build critical infrastructure to eliminate friction in scientific research, simulations, analysis, and decision-making, accelerating discovery and driving faster innovation.

The Position

As an HPC Validation and Performance Engineer at NMC², you will take ownership of the validation and optimization of our HPC CPU and GPU calc farms. This critical role will involve developing a validation and performance baselining framework, which ensures system readiness for AI/ML and HPC workloads across multiple architectures. Your role will be essential in providing continuous performance benchmarking, real-time observability, and long-term strategic readiness. You will drive the implementation of advanced tooling and frameworks, maintaining an infrastructure that is crucial to our cutting-edge research efforts. You will be accountable for providing data driven performance metrics to support architectural design choices as we continue to globally scale our datacenter footprint. We are looking for someone with deep technical expertise in compute, storage or networking optimizations and performance engineering who can develop solutions that scale with our growing infrastructure. This role demands a forward-thinking engineer who can anticipate industry trends and adopt emerging architectures and strategies to keep NMC² at the forefront of innovation.

Responsibilities:

Architecting and implementing a validation framework to certify the readiness and utilization of GPU nodes across a large, distributed HPC environment.
Defining methodologies to continually assess performance and optimising infrastructure across AI/ML workloads
Developing and executing comprehensive performance testing using industry and customer specific benchmarks, ensuring optimal performance across HPC compute, storage and networking
Contribute to research reports that will describe the discoveries of the benchmarking, evaluating the complete HW performance and efficiency
Leading efforts to debug, identify and then resolve bottlenecks in system performance
Building robust, scalable tools for automated validation and testing, utilising Python, Go, Kubernetes and CI/CD pipelines to streamline continuous validation and benchmarking processes
Implementing monitoring solutions using Prometheus, Grafana and other modern monitoring technologies to track performance metrics and real-time health of the cluster
Defining and implementing best practice for continuous performance validation, ensuring that the infrastructure remains reliable and efficient as new technologies emerge
Staying informed on industry trends and advancements to ensure long-term strategic alignment
Working cross-functionally with engineering, infrastructure and research teams to align validation efforts with the broader business objectives, ensuring that the platform meets evolving research demands

Requirements:

Accelerator performance experience, including profiling and tuning with large-scale GPU clusters
In-depth understanding of NVIDIA ClusterKit, Nsight and Validation Suite, MLPerf and DCGM tools for GPU and DPUs
Networking & Storage performance experience, including profiling and optimisation with NVIDIA ClusterKit, iPerf or equivalent across InfiniBand/RoCe network implementations
System benchmarking experience across Linux and familiarity with the Phronix suite or equivalent
Experience with HPC workloads across distributed global locations, bringing data driven performance data to compliment key architectural decisions
Strong proficiency in developing automation tools and micro benchmarking frameworks for validation using Python, Go, and Kubernetes in a Ubuntu Linux environment
Expertise with key monitoring platforms including OTEL, Prometheus, ELK and Grafana and in definition and implementing the overall observability strategy for HPC validation and performance monitoring
A deep understanding of emerging technologies, architectures and strategies, with the ability to assess their potential impact on infrastructure and adopt them as part of a long-term plan
Proven ability to lead complex technical projects, influence decisions and engage with stakeholders across technical and research teams

Original job HPC Performance and Validation Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to HPC Performance and Validation Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar HPC Performance and Validation Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

HPC Performance and Validation Engineer

Job Description - HPC Performance and Validation Engineer

Similar HPC Performance and Validation Engineer Jobs in the US

Mobile Apps