C

Machine Learning Engineer (Distributed Training)

icon building Empresa : Cloudwalk
icon briefcase Tipo de Emprego : Periodo Integral

Número de Aplicantes

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
icon loader

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Descrição do Emprego - Machine Learning Engineer (Distributed Training)

Who we are:


Who We’re Looking For:

We’re looking for a Machine Learning Engineer to own and evolve our distributed training pipeline for large language models. You’ll work inside our GPU cluster to help researchers train and scale foundation models using frameworks like Hugging Face Transformers, Accelerate, DeepSpeed, FSDP, and others. Your focus will be distributed training: from designing sharding strategies and multi-node orchestration to optimizing throughput and managing checkpoints at scale.

This role is not research - it's about building and scaling the systems that let researchers move fast and models grow big. You’ll work closely with MLOps, infra, and model developers to make our training runs efficient, resilient, and reproducible.

What You'll Do:

    • Own the architecture and maintenance of our distributed training pipeline;
    • Train LLMs using tools like DeepSpeed, FSDP, and Hugging Face Accelerate;
    • Design and debug multi-node/multi-GPU training runs (Kubernetes-based);
    • Optimize training performance: memory usage, speed, throughput, and cost;
    • Help manage experiment tracking, artifact storage, and resume logic;
    • Build reusable, scalable training templates for internal use;
    • Collaborate with researchers to bring their training scripts into production shape.

What We’re Looking For:

    • Expertise in distributed training: Experience with DeepSpeed, FSDP, or Hugging Face Accelerate in real-world multi-GPU or multi-node setups;
    • Strong PyTorch background: Comfortable writing custom training loops, schedulers, or callbacks;
    • Hugging Face stack experience: Transformers, Datasets, Accelerate - you know the ecosystem and how to bend it;
    • Infra literacy: You understand how GPUs, containers, and job schedulers work together. You can debug cluster issues, memory bottlenecks, or unexpected slowdowns;
    • Resilience mindset: You write code that can checkpoint, resume, log correctly, and keep running when things go wrong;
    • Collaborative builder: You don’t mind digging into other people’s scripts, making them robust, and helping everyone train faster.

Bonus Points:

    • Experience with Kubernetes-based GPU clusters and Ray;
    • Experience with experiment tracking (MLflow, W&B);
    • Familiarity with mixed precision, ZeRO stages, model parallelism;
    • Comfort with CLI tooling, profiling, logging, and telemetry;
    • Experience with dataloading bottlenecks and dataset streaming.

How We Hire:

    • Online assessment: technical logic and fundamentals (Math/Calculus, Statistics, Probability, Machine Learning/Deep Learning, Code)
    • Technical interview: deep dive into distributed training theory and reasoning (no code)
    • Cultural interview

    • If you are not willing to take an online quiz, do not apply.
If you’ve trained LLMs before - or helped others do it better - this role is for you. Even if you don’t check every box, if you’re confident working with distributed compute and real-world LLM workloads, we want to hear from you.
Original job Machine Learning Engineer (Distributed Training) posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Machine Learning Engineer Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Machine Learning Engineer Jobs in Brazil

O GrabJobs é o portal de empregos número 1 em Brazil, conectando você rapidamente a milhares de empregos de ! Encontre os melhores empregos de em Brazil, candidate-se com apenas 1 clique e consiga um emprego hoje!

Aplicativos de Celular

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.