Key Responsibilities
- Develop analytical performance models for GPU kernels and inference workloads.
- Build and validate a simulator to estimate theoretical hardware performance limits.
- Compare measured kernel performance against architectural peak throughput.
- Identify performance bottlenecks in compute, memory, communication, and scheduling.
- Analyze GPU execution using NVIDIA Nsight Systems and Nsight Compute.
- Investigate PTX and SASS code generation to understand low-level execution behavior.
- Collaborate with researchers and engineers to optimize inference kernels for transformer-based models.
- Evaluate utilization of Tensor Cores, memory bandwidth, caches, and instruction pipelines.
- Design profiling methodologies for Hopper and Blackwell architectures.
- Document findings and provide actionable recommendations for performance improvements.
Academic Qualifications
Preferred Qualifications
- Experience with CUDA programming and GPU kernel development.
- Understanding of NVIDIA GPU architecture and memory hierarchy.
- Familiarity with performance profiling tools such as Nsight Systems and Nsight Compute.
- Knowledge of PTX, SASS, and low-level GPU execution.
- Experience optimizing CUDA kernels for throughput and latency.
- Understanding of roofline analysis, performance modeling, and hardware utilization metrics.
- Experience with deep learning frameworks such as PyTorch or TensorFlow.
- Strong programming skills in C++, CUDA, and Python.
Desired Skills
- Performance engineering mindset.
- Strong analytical and debugging abilities.
- Interest in AI systems, inference optimization, and hardware-software co-design.
- Ability to work independently on research and engineering challenges.
- Excellent written and verbal communication skills.