Member of Technical Staff Model Optimization and Inference (New Grad)

Salary :

$200,000 - 300,000 yearly

Company : Nuance Labs, Inc.

Job Type : Full Time

Seattle, United States

Job Description - Member of Technical Staff Model Optimization and Inference (New Grad)

About Nuance Labs

Nuance Labs is building photorealistic, real-time AI avatars with emotional intelligence: a full-duplex audiovisual system that can listen, speak, react, interrupt, and respond like a real person.

We're a Series A company ($60M raised) backed by Lightspeed, Accel, South Park Commons, NVentures, and Define Ventures, with PhDs from MIT, UW, Oxford, CMU, and Johns Hopkins, and industry experience from Apple, Meta, Amazon AGI, and Discord. The team is small, the work is real, and the problems are unsolved.

How Nuance Differentiates

Most conversational AI avatars today are hacks — a face slapped on a speech-to-speech pipeline, stuck in the uncanny valley: emotionless, mechanical, one-turn-at-a-time. Current systems take 2–5 seconds to respond; natural conversation requires sub-500ms. That's a 10x improvement, and it demands rethinking the entire stack.

That rethinking starts with full-duplex: an AI that listens and speaks simultaneously, perceives emotion in real time, and responds with a face that actually reflects it. It's an extremely hard problem, and we're developing foundation models designed for it from the ground up.

About the Role

We can train a great model. The next problem is making it fast enough to actually use in a real-time conversation — and that gap is enormous. A model that responds in 3 seconds is a demo. A model that responds in under 500ms is a product.

We’re looking for someone who’s excited about taking trained models and squeezing every last millisecond out of them. You understand — or want to deeply understand — the full stack from model weights to serving infrastructure: quantization, KV cache optimization, kernel-level acceleration, batching strategies. You’ve worked with vLLM, SGLang, or similar frameworks (through coursework, research, internships, or open-source) and have opinions about where they fall short.

This posting is aimed at early-career engineers finishing or recently finished with a BS, MS, or PhD. We don’t require a PhD — we care about systems intuition, engineering chops, and the appetite to go deep.

Our stack is more complex than a standard LLM deployment: we’re serving a full-duplex multimodal system that must satisfy strict real-time latency constraints. There’s a lot of unsolved optimization work here, and we want someone who finds that genuinely exciting and is ready to grow fast alongside people who’ve built these systems before.

What You’ll Do

Contribute to end-to-end inference optimization across our model stack — LLMs, audio models, and diffusion-based components

Implement and tune KV cache strategies for long-context conversations, including eviction policies, compression, and memory-efficient attention

Work with inference serving frameworks (vLLM, SGLang, TensorRT-LLM, etc.) and extend them for our specific workloads

Profile and benchmark end-to-end latency and throughput; identify and systematically eliminate bottlenecks

Build internal tooling that makes optimization work faster and more rigorous — profiling viewers, end-to-end inference test harnesses, and other infrastructure that helps the team move quickly

Accelerate diffusion model inference — consistency models, step distillation, caching strategies, and custom kernel optimizations

Apply quantization techniques (INT8, INT4, GPTQ, AWQ, and beyond) to reduce memory footprint and increase throughput without meaningfully degrading quality

Work closely with research and infrastructure to ensure new models ship with optimized serving from day one

What We’re Looking For

BS, MS, or PhD in CS, ML, or a related field — completed or in the final stretch

Strong fundamentals in LLM inference or ML systems — KV caching, memory layout, attention kernels, batching, or serving — picked up through coursework, research, internships, or open-source. You don’t need to have shipped at production scale yet; you do need to learn fast and go deep.

Exposure to inference serving frameworks (vLLM, SGLang, TensorRT-LLM, or similar) — even at a research or hobby level

Strong Python and PyTorch skills; familiarity with CUDA or Triton is a significant plus

A systematic approach to profiling and optimization — you measure first, then optimize

Curiosity about diffusion inference, speculative decoding, quantization, or other inference-time acceleration techniques

Bonus Points

Internship or research experience with LLM inference, ML systems, or model serving

Contributions to open-source inference frameworks (vLLM, SGLang, TensorRT-LLM, etc.)

CUDA / Triton kernel work, even at a research or hobby scale

Publications or research projects in MLSys, model compression, or inference optimization

Familiarity with multimodal or streaming inference architectures

Experience with hard latency SLAs in any real-time system

Compensation

$200,000 – $300,000 base salary, plus meaningful equity. We think long-term ownership matters and structure equity accordingly.

Logistics

Location: In-person in Seattle, five days a week — we believe in the compounding value of working shoulder-to-shoulder.

Visa sponsorship: We sponsor visas (O-1, H-1B, green card) from day one.

AI-native tooling: Do your best work with the best tools, including unlimited tokens.

Benefits

Health: HSA plan with ~$2,000 in annual company contributions — roughly 2x what most big tech companies put in.

Time off: 15 days of PTO plus public holidays, and we close the office for a full week at year-end.

Food: Lunch, drinks, and snacks on us every workday — the small thing that quietly makes the day better.

Commuter benefits: We help cover the cost of getting to the office.

401(k): In the works.

Nuance Labs is an equal opportunity employer. We believe diverse teams build better AI.

Original job Member of Technical Staff Model Optimization and Inference (New Grad) posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Share Job

Get your Resume Reviewed for Free

Similar Member of Technical Staff Model Optimization and Inference Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip