Distributed Training Dashboard with Profiling

ML System Design expert ~55 hours

Build a training orchestration tool that profiles GPU utilization, communication overhead, and MFU across different parallelism strategies (data parallel, tensor parallel, pipeline parallel).

Skills Demonstrated

Distributed training GPU profiling Parallelism strategies Performance optimization

Implementation Steps

Implement data-parallel training with PyTorch DDP
Add FSDP (Fully Sharded Data Parallel) as alternative strategy
Build profiling hooks: GPU utilization, memory, communication time
Calculate and display MFU (Model FLOPs Utilization) in real-time
Create comparison dashboard across parallelism strategies
Implement gradient accumulation with micro-batch scheduling
Add automatic batch size finder with memory profiling

Interview Relevance

Why this project matters for interviews Large-scale training infrastructure is the bottleneck for AI progress. Understanding parallelism strategies and being able to profile them is essential for roles at Anthropic, Google, Meta, and xAI.