GPU/ML Systems Engineer
Experience: 3–7 years | Hands -on GPU
optimization required
Location: Office – Coimbatore/Bengaluru
About Aivar Innovations
Aivar is an AI -first technology partner where cutting -edge technology meets industry expertise to supercharge your projects. Our AI -augmented teams accelerate development, reduce time -to -market, and deliver exceptional code quality. We bring together the best minds in tech to craft scalable, repeatable solutions that drive real momentum for your business.
Technical Focus
The specialist who takes AI
deployments from “it works” to “sub -second latency at 40% lower cost.” Own
vLLM/Triton configurations, model quantization (INT8, FP16, 4 -bit), tensor
parallelism on multi -GPU instances, AWS Inferentia optimization, and performance
benchmarking. Proven results: 40% cost reduction on Whisper ASR, 0.41s TTFT on
Llama 70B, 85% throughput gain on YOLO via Inferentia.
Functional Expectations
- Deploy and tune vLLM with multi -GPU tensor
parallelism, dynamic batching, PagedAttention, and KV cache optimization for
LLMs
- Configure NVIDIA Triton for production multi -model
serving with custom backends and model ensembles
- Build TensorRT -LLM optimized model binaries for
maximum throughput on L40S, A100, and H100 GPUs
- Implement AWS Inferentia deployments using Neuron SDK
— model compilation, operator support, performance tuning
- Run comprehensive load testing (Locust) to map
performance cliffs, optimal concurrency, and scaling thresholds
- Execute model quantization (INT8, FP16, GPTQ, AWQ)
with rigorous quality -accuracy tradeoff analysis
- Produce detailed benchmark reports with instance
selection, scaling strategy, and cost -per -token recommendations
- Neuron: Experience in optimizing models for custom
accelerators like AWS Inferentia/Trainiums
Must -Have Technical Skills
- GPU -accelerated ML workloads in production (3+ years)
- LLM serving — vLLM, TensorRT -LLM, or Triton Inference
Server (hands -on)
- GPU architecture — memory hierarchy, tensor cores,
NVLink, NCCL multi -GPU communication
- Model quantization — INT8, FP16, mixed precision,
GPTQ/AWQ
- CUDA ecosystem — drivers, cuDNN, NVIDIA container
toolkit
- Performance engineering — profiling (Nsight,
nvidia -smi, DCGM), bottleneck analysis, load testing
- AWS GPU instances — G -series (L40S), P -series (A100),
instance selection methodology
Core Tech Stack
vLLM, NVIDIA Triton, TensorRT -LLM,
KServe, CUDA/cuDNN/NCCL/DCGM, AWS Inferentia/Neuron SDK, GPTQ/AWQ/bitsandbytes,
Locust, Nsight Systems, Prometheus + DCGM Exporter, AWS (EC2 GPU, EKS, Capacity
Blocks)
Benefits
Why You’ll Love Working at Aivar
- Learn from Experts: Work directly with former AWS leaders and AI pioneers.
- Direct Ownership: Lead high -impact "greenfield" projects from concept to global launch.
- Modern Tech: Master the latest Generative AI frameworks and cloud -native architectures.
- Real -World Impact: Build mission -critical systems used by major global enterprises.
- Rapid Growth: Scale your career quickly in a high -speed
Diversity and Inclusion
Aivar Innovations is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to gender, gender identity, sexual orientation, religion, disability, age, marital status, caste, or any other protected characteristic, and we are committed to building a diverse, inclusive, and respectful workplace for everyone.