Software Engineer - Distributed Training Infrastructure

Company : Clockwork.io

Job Type : Full Time

Palo Alto, California

Number of Applicants

000+

Let AI Supercharge Your Job Hunt!

JobCopilot scans 500,000+ company career sites daily to find jobs for you

Never miss an opportunity Save hours by auto-filling applications forms Land more interviews with tailored applications

Activate JobCopilot

Job Description - Software Engineer - Distributed Training Infrastructure

Clockwork.io is a Silicon Valley startup that delivers state-of-the-art AI compute acceleration.

We are founded by Stanford researchers and veteran systems engineers with a shared belief: distributed systems powering modern AI require a new approach to managing time, reliability, and performance. Unlike traditional solutions that rely on specialized hardware or embedded telemetry in switches, Clockwork’s system brings insane visibility, resilience, acceleration and efficiency to the network layer entirely through software. As AI workloads continue to scale in size, urgency, and impact, networks must evolve to keep up. Clockwork exists to make that evolution possible.

About Us

Clockwork.io – A Software-Driven Revolution in AI Networking

Clockwork Systems was founded by Stanford researchers and veteran systems engineers who share a vision for redefining the foundations of distributed computing. As AI workloads grow increasingly complex, traditional infrastructure struggles to meet the demands of performance, reliability, and precise coordination. Clockwork is pioneering a software-driven approach to AI networking, delivering deterministic time, ultra-low latency, and seamless scalability for modern distributed systems.

To learn more, visit www.clockwork.io.

About the Role

We are looking for an experienced software engineer to help build, optimize, and maintain large-scale distributed training infrastructure based on the PyTorch ecosystem. This role focuses on production-grade training workflows involving multi-GPU and multi-node orchestration, high-performance communication layers, and advanced parallelism strategies.

You’ll work alongside infrastructure and machine learning teams to ensure training jobs are efficient, scalable, and resilient.

What You will do

Develop and support distributed PyTorch training jobs using torch.distributed / c10d

Integrate and maintain frameworks like Megatron-LM, DeepSpeed, and related LLM training stacks

Diagnose and resolve distributed training issues (e.g., NCCL hangs, OOM, checkpoint corruption)

Optimize performance across communication, I/O, and memory bottlenecks

Implement fault tolerance, checkpointing, and recovery mechanisms for long-running jobs

Write tooling and scripts to streamline training workflows and experiment management

Collaborate with ML engineers to ensure compatibility with orchestration and container environments (e.g., Slurm, Kubernetes)

What We’re Looking For

Deep experience with PyTorch and torch.distributed (c10d)

Hands-on experience with at least one of: Megatron-LM, DeepSpeed, or FairScale

Proficiency in Python and Linux shell scripting

Experience with multi-node GPU clusters using Slurm, Kubernetes, or similar

Strong understanding of NCCL, collective communication, and GPU topology

Familiarity with debugging tools and techniques for distributed systems

Preferred Skills

Experience scaling LLM training across 8+ GPUs and multiple nodes

Knowledge of tensor, pipeline, and data parallelism

Familiarity with containerized training environments (Docker, Singularity)

Exposure to HPC environments or cloud GPU infrastructure

Experience with training workload orchestration tools or custom job launchers

Comfort with large-scale checkpointing, resume/restart logic, and model I/O

⸻

Bonus Skills

Profiling tools: PyTorch Profiler, Nsight, nvprof, or equivalent

Experience with performance tuning in distributed training environments

Contributions to ML infrastructure open-source projects

Familiarity with storage, networking, or RDMA/GPU Direct technologies

Understanding of observability in ML pipelines (metrics, logs, dashboards)

Enjoy

Challenging projects.

A friendly and inclusive workplace culture.

Competitive compensation.

A great benefits package.

Catered lunch

Clockwork is assembling world class teams to build cutting edge software. We look for bright people from all walks of life and we grow together. All qualified applicants will receive consideration for employment without regard to race, color, ancestry, religion, age, sex, sexual orientation, gender identity, national origin, or protected veteran status and will not be discriminated against on the basis of disability.

Original job Software Engineer - Distributed Training Infrastructure posted on GrabJobs ©. To flag any issues with this job please use the Report Job button on GrabJobs.

Auto-Apply to Similar Jobs

Share Job

Get your Resume Reviewed for Free

Automate Job Applications for Similar Jobs

Auto-Apply to Software Engineer Jobs with your AI JobCopilot

Auto-Apply with AI

Similar Software Engineer Jobs in the US

Get your Resume Reviewed for Free

Email address

Why are you reporting this job?

I think it’s a discriminatory or offensive

I think it’s fraudulent or a scam

I think it’s trying to sell something unrelated to the job / it’s asking for money

I think it contains incorrect or broken information

Other

All Job Ads are subject to GrabJobs’s Terms of Service. We allow users to flag postings that may be in violation of those terms. Job Ads may also be flagged by GrabJobs moderation team. However, no moderation system is perfect, and flagging a posting does not ensure that it will be removed.

Setup your job alert:

Frequency

By activating job alerts, I agree to GrabJobs Terms & Privacy Policy. I can unsubscribe to job alerts anytime. Skip

Software Engineer - Distributed Training Infrastructure

Job Description - Software Engineer - Distributed Training Infrastructure

What You will do

What We’re Looking For

Similar Software Engineer Jobs in the US

Mobile Apps