Responsibilities
Design workflows to curate high-quality image editing and generation datasets for controllable diffusion and instruction tuning.
Conduct evaluations of vision-language models, including image understanding, caption alignment, and editing precision.
Assist in the training and evaluation of diffusion models or reward models.
Explore visual reasoning datasets that bridge images and text prompts.
Qualifications
Strong background in computer science, data engineering, artificial intelligence, or related fields, with hands-on experience in large-scale vision data systems.
1+ years of experience in computer vision or multimodal machine learning (e.g., PyTorch, Diffusers, CLIP, BLIP, etc.).
Solid understanding of image-text alignment and latent-space editing.
(Preferred) Familiarity with aesthetic models, diffusion-based editing, vision-language modeling (VLM), or visual question answering (VQA) tasks.
(Preferred) Relevant publications in top conferences.