F

AI Training Infras Engineer

salary Salary :

$15,000 - 18,000 monthly

icon briefcase Job Type : Full Time

Number of Applicants

 : 

000+

Click to reveal the number of candidates who applied for this job.

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Job Description - AI Training Infras Engineer

About Fragment Works and Moment AI

FRAGMENT WORKS PTE. LTD. is a Singapore-incorporated technology company operating the Moment AI platform at https://momentvideo.ai/.

Moment AI is a leading VC-backed video AI foundation model company developing advanced video AI technologies and business-to-business infrastructure for enterprise customers. The company focuses on building scalable, production-grade video AI systems that support real-world commercial applications across video generation, video understanding, model deployment, and model-sales operations.

Through API-based capabilities, Moment AI serves companies in the short-video, creator economy, advertising, media, and AI application sectors. Its platform is designed to help enterprises integrate video AI into their products and workflows efficiently, enabling automated content creation, intelligent video analysis, and next-generation AI-powered media experiences.

Moment AI is founded by a team of video industry veterans with deep operational and technical experience in scaling short-video platforms to millions of users. The founding team brings together battle-tested entrepreneurs, infrastructure builders, and AI researchers with hands-on experience across video platforms, creator ecosystems, high-concurrency video infrastructure, and multimodal AI.

About the Role

We are building our own foundation model for video generation, based on DiT and Flow Matching architectures. We are looking for a Training Infrastructure Engineer who can turn cutting-edge research code into a stable, scalable, and high-throughput training system running on large-scale GPU clusters.

This role is ideal for an engineer who enjoys solving deep systems problems at the intersection of distributed training, CUDA performance, video data pipelines, model training stability, and large-scale ML infrastructure. You will work closely with researchers and platform engineers to ensure that our video generation training stack can reliably produce results at the thousand-GPU scale.

Key Responsibilities

  • You will design, optimise, and maintain large-scale distributed training systems for video generation foundation models. This includes implementing and improving training strategies such as FSDP, tensor parallelism, context parallelism, and Ulysses-style sequence parallelism, with a strong focus on improving throughput, scaling efficiency, and MFU.
  • You will build and optimise PB-scale video data pipelines, including NVDEC-based video decoding, VAE latent caching, variable-resolution bucket sampling, and efficient data loading for high-throughput model training.
  • You will work on memory and performance optimisation across the training stack, including FlashAttention,FP8 mixed precision, Triton kernels, CUDA-aware profiling, activation check pointing strategies, and communication-computation overlap.
  • You will also be responsible for training stability and reliability. This includes identifying the root causes of loss spikes, divergence, slow nodes, communication bottlenecks, check point failures, and data-related instability, as well as designing mechanisms for fast checkpoint recovery and automatic exclusion of problematic nodes.

Requirements

  • The ideal candidate has strong hands-on experience with PyTorch distributed training and a solid understanding of CUDA architecture, GPU memory hierarchy, NCCL communication, and performance profiling.
  • You should have source-level familiarity with at least one major large-scale training framework, such as Megatron-LM, DeepSpeed, PyTorch FSDP, or TorchTitan, and be comfortable reading, modifying, and debugging framework internals.
  • You should have at least one year of practical experience training models on large GPU clusters of 256 GPUs or more, with proven experience in debugging distributed training failures and improving system-level training efficiency.
  • Strong candidates will be able to reason across the full training stack, from data ingestion and model parallelism to kernel-level optimisation and fault-tolerant training operations.

Preferred Qualifications

  • Experience with DiT, diffusion models, Flow Matching, or video generation models would be highly advantageous.
  • Experience processing large-scale video datasets, building video decoding pipelines, or working with VAE latent caching systems would be a strong plus.
  • Hands-on experience writing or optimising Triton, CUDA, or CUTLASS kernels would be valuable.
  • Familiarity with open-source video generation projects such as HunyuanVideo, Wan, CogVideoX, or similar systems at source-code level would also be beneficial.

What We Offer

You will work with real large-scale compute resources, including access to thousand-GPU-level training infrastructure. You will join a team that treats large-scale model training as a rigorous engineering discipline, not just a research experiment.

We provide an environment where engineers can work on technically meaningful infrastructure problems, collaborate closely with frontier model researchers, and contribute to open-source or publication work where appropriate and within applicable compliance boundaries.

Employment Practices

We are committed to fair and merit-based hiring. All candidates will be assessed based on job-related skills, experience, and ability to perform the role. We welcome applications from qualified candidates and do not discriminate on the basis of age, race, gender, religion, marital status, family responsibilities, disability, or other non-job-related characteristics.

Original job AI Training Infras Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Similar Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI
💰

Engineering & Technicians Salaries

Similar Jobs in Singapore

GrabJobs is the no1 job portal in Singapore, connecting you to thousands of jobs fast! Find the best jobs in Singapore, apply in 1 click and get a job today!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.