AI Training Infras Engineer

Salary :

$15,000 - 18,000 monthly

Company : FRAGMENT WORKS PTE. LTD.

Job Type : Full Time

20 Anson Road Twenty Anson 079912

Number of Applicants

000+

Apply Now

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - AI Training Infras Engineer

About Fragment Works and Moment AI

FRAGMENT WORKS PTE. LTD. is a Singapore-incorporated technology company operating the Moment AI platform at https://momentvideo.ai/.

Moment AI is a leading VC-backed video AI foundation model company developing advanced video AI technologies and business-to-business infrastructure for enterprise customers. The company focuses on building scalable, production-grade video AI systems that support real-world commercial applications across video generation, video understanding, model deployment, and model-sales operations.

Through API-based capabilities, Moment AI serves companies in the short-video, creator economy, advertising, media, and AI application sectors. Its platform is designed to help enterprises integrate video AI into their products and workflows efficiently, enabling automated content creation, intelligent video analysis, and next-generation AI-powered media experiences.

Moment AI is founded by a team of video industry veterans with deep operational and technical experience in scaling short-video platforms to millions of users. The founding team brings together battle-tested entrepreneurs, infrastructure builders, and AI researchers with hands-on experience across video platforms, creator ecosystems, high-concurrency video infrastructure, and multimodal AI.

About the Role

We are building our own foundation model for video generation, based on DiT and Flow Matching architectures. We are looking for a Training Infrastructure Engineer who can turn cutting-edge research code into a stable, scalable, and high-throughput training system running on large-scale GPU clusters.

This role is ideal for an engineer who enjoys solving deep systems problems at the intersection of distributed training, CUDA performance, video data pipelines, model training stability, and large-scale ML infrastructure. You will work closely with researchers and platform engineers to ensure that our video generation training stack can reliably produce results at the thousand-GPU scale.

Key Responsibilities

You will design, optimise, and maintain large-scale distributed training systems for video generation foundation models. This includes implementing and improving training strategies such as FSDP, tensor parallelism, context parallelism, and Ulysses-style sequence parallelism, with a strong focus on improving throughput, scaling efficiency, and MFU.
You will build and optimise PB-scale video data pipelines, including NVDEC-based video decoding, VAE latent caching, variable-resolution bucket sampling, and efficient data loading for high-throughput model training.
You will work on memory and performance optimisation across the training stack, including FlashAttention,FP8 mixed precision, Triton kernels, CUDA-aware profiling, activation check pointing strategies, and communication-computation overlap.
You will also be responsible for training stability and reliability. This includes identifying the root causes of loss spikes, divergence, slow nodes, communication bottlenecks, check point failures, and data-related instability, as well as designing mechanisms for fast checkpoint recovery and automatic exclusion of problematic nodes.

Requirements

The ideal candidate has strong hands-on experience with PyTorch distributed training and a solid understanding of CUDA architecture, GPU memory hierarchy, NCCL communication, and performance profiling.
You should have source-level familiarity with at least one major large-scale training framework, such as Megatron-LM, DeepSpeed, PyTorch FSDP, or TorchTitan, and be comfortable reading, modifying, and debugging framework internals.
You should have at least one year of practical experience training models on large GPU clusters of 256 GPUs or more, with proven experience in debugging distributed training failures and improving system-level training efficiency.
Strong candidates will be able to reason across the full training stack, from data ingestion and model parallelism to kernel-level optimisation and fault-tolerant training operations.

Preferred Qualifications

Experience with DiT, diffusion models, Flow Matching, or video generation models would be highly advantageous.
Experience processing large-scale video datasets, building video decoding pipelines, or working with VAE latent caching systems would be a strong plus.
Hands-on experience writing or optimising Triton, CUDA, or CUTLASS kernels would be valuable.
Familiarity with open-source video generation projects such as HunyuanVideo, Wan, CogVideoX, or similar systems at source-code level would also be beneficial.

What We Offer

You will work with real large-scale compute resources, including access to thousand-GPU-level training infrastructure. You will join a team that treats large-scale model training as a rigorous engineering discipline, not just a research experiment.

We provide an environment where engineers can work on technically meaningful infrastructure problems, collaborate closely with frontier model researchers, and contribute to open-source or publication work where appropriate and within applicable compliance boundaries.

Employment Practices

We are committed to fair and merit-based hiring. All candidates will be assessed based on job-related skills, experience, and ability to perform the role. We welcome applications from qualified candidates and do not discriminate on the basis of age, race, gender, religion, marital status, family responsibilities, disability, or other non-job-related characteristics.

Original job AI Training Infras Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Apply Now

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Similar Jobs with your AI JobCopilot

Auto-Apply with AI

💰

Browse the Top Paying Jobs Engineering & Technicians Salaries

Browse Engineering & Technicians Salaries

🔎

People also search for

Civil Engineer Jobs

Engineering Jobs Technology Jobs

Part-Time Jobs

Similar Jobs in Singapore

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip