Key Responsibilities
- Design, implement, and maintain scalable video data pipelines to support large-scale training.
- Develop data preprocessing, transformation, and synthesis workflows to support world model training.
- Contribute to building high-quality data annotation pipelines to ensure accurate and consistent labels across large-scale datasets.
- Support the training of multimodal foundation models (e.g., video diffusion models, world models) by developing and optimizing distributed training systems.
- Improve inference and serving efficiency for real-time interaction through model optimization and system tuning.
- Monitor system health and performance and contribute to debugging and optimization at scale.
- Work closely with research teams to understand experimental goals and translate ideas into reliable and maintainable infrastructure and tools.
- Integrate novel research prototypes into production-ready systems and ensure reproducibility at scale.
- Participate in design and code reviews, ensuring code quality, efficiency, and compliance with best practices.
- Contribute to the development of tools and infrastructure to evaluate model performance using rigorous quantitative benchmarks, including metrics for physical accuracy and controllability.
- Maintain and extend shared codebases, contribute to internal documentation, and support onboarding of new team members or collaborators.
- Write clean, efficient, and well-tested code for components across the model development lifecycle.
- Support contributions to research papers and demos when engineering work plays a significant role.
- Help represent the team’s engineering excellence in internal and external forums when appropriate.
Academic Qualifications
- MSc or PhD in Machine Learning or Computer Science, or equivalent industry experience.
Professional Experience - Required
- Proficient in data collection, cleaning, and transformation at scale, including designing robust pipelines for multimodal datasets (e.g., video, audio, text).
- Practical experience with web scraping and crawling frameworks (e.g., scrapy, selenium, playwright, BeautifulSoup) to collect and curate high-quality web-scale datasets.
- Experience in large-scale model training (LLMs or Diffusion Models) on large clusters.
- Hands-on experience with state-of-the-art video generative models (e.g., Sora, Veo2, MovieGen, CogVideoX, etc.).
- Experiences in building and optimizing large-scale video data pipelines.
- Experience in accelerating diffusion model inference for improved efficiency.
- Exceptional problem-solving and troubleshooting skills to tackle complex technical challenges.
- Strong systems and engineering expertise in deep learning frameworks such as PyTorch.
- Strong communication and collaboration skills for effective cross-functional teamwork.
- Demonstrated ability to solve complex system-level challenges and debug failures across the training/inference stack (e.g., memory issues, deadlocks, I/O bottlenecks).