Responsibilities
Build and refine datasets for video understanding and multimodal reasoning, including temporal QA, action recognition, event prediction, and spatial understanding.
Evaluate video-language models (Video-LLMs) and audio-visual datasets, including those derived from large-scale sources such as HowTo100M.
Conduct experiments analyzing long-context modeling efficiency, compression strategies, and data optimization techniques.
Contribute to benchmark standardization efforts and assist in setting up public leaderboards for evaluation and comparison.
Qualifications
Strong background in computer vision, video analytics, or multimodal learning.
Proficient in building and managing video data processing pipelines.
Understanding of transformer-based temporal models (e.g., TimeSformer, VideoGPT, etc.).
(Preferred) Experience with video-QA, action recognition, or multimodal reasoning datasets.
(Preferred) Relevant publications in top-tier conferences.