Logo-of-Black-Forest-Labs-hiring-for-jobs-in-Deutschland-on-GrabJobs

Member of Technical Staff - Training Cluster Engineer

icon building Unternehmen : Black Forest Labs
icon briefcase Auftragstyp : Vollzeit

Anzahl der Bewerber

 : 

000+

Click to reveal the number of candidates who applied for this job.
icon loader
icon loader

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications
happy man
thunder iconActivate JobCopilot

Arbeitsbeschreibung - Member of Technical Staff - Training Cluster Engineer

Black Forest Labs is a cutting-edge startup pioneering generative image and video models. Our team, which invented Stable Diffusion, Stable Video Diffusion, and FLUX.1, is currently looking for a strong candidate to join us in developing and maintaining our large GPU training clusters.


Role & Responsibilities



  • Design, deploy, and maintain large-scale ML training clusters running SLURM for distributed workload orchestration

  • Implement comprehensive node health monitoring systems with automated failure detection and recovery workflows

  • Partner with cloud and colocation providers to ensure cluster availability and performance

  • Establish and enforce security best practices across the ML infrastructure stack (network, storage, compute)

  • Build and maintain developer-facing tools and APIs that streamline ML workflows and improve researcher productivity

  • Collaborate directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning


Required Experience



  • Production experience managing SLURM clusters at scale, including job scheduling policies, resource allocation, and federation

  • Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments

  • Proven track record managingGPU clusters, including driver management and DCGM monitoring


Preferred Qualifications



  • Understanding of distributed training patterns, checkpointing strategies, and data pipeline optimization

  • Experience with Kubernetes for containerized workloads, particularly for inference or mixed compute environments

  • Experience with high-performance interconnects (InfiniBand, RoCE) and NCCL optimization for multi-node training

  • Track record of managing 1000+ GPU training runs, with deep understanding of failure modes and recovery patterns

  • Familiarity with high-performance storage solutions (VAST, blob storage) and their performance characteristics for ML workloads

  • Experience running hybrid training/inference infrastructure with appropriate resource isolation

  • Strong scripting skills (Python, Bash) and infrastructure-as-code experience

Original job Member of Technical Staff - Training Cluster Engineer posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.
Share Job
Share Job

Auto-Apply to Technical Staff Jobs with your AI JobCopilot

thunder icon Auto-Apply with AI

Similar Technical Staff Jobs in Germany

GrabJobs ist das führende Jobportal in Germany und verbindet Sie schnell mit Tausenden von -Jobs! Finden Sie die besten -Jobs in Germany, bewerben Sie sich mit einem Klick und sichern Sie sich noch heute einen Job!

Mobile Apps

Copyright © 2026 Grabjobs Pte.Ltd. All Rights Reserved.