ML System Design
expert
~55 hours
Distributed Training Dashboard with Profiling
Build a training orchestration tool that profiles GPU utilization, communication overhead, and MFU across different parallelism strategies (data parallel, tensor parallel, pipeline parallel).
Skills Demonstrated
Distributed training
GPU profiling
Parallelism strategies
Performance optimization
Implementation Steps
- Implement data-parallel training with PyTorch DDP
- Add FSDP (Fully Sharded Data Parallel) as alternative strategy
- Build profiling hooks: GPU utilization, memory, communication time
- Calculate and display MFU (Model FLOPs Utilization) in real-time
- Create comparison dashboard across parallelism strategies
- Implement gradient accumulation with micro-batch scheduling
- Add automatic batch size finder with memory profiling
Interview Relevance
Why this project matters for interviews
Large-scale training infrastructure is the bottleneck for AI progress. Understanding parallelism strategies and being able to profile them is essential for roles at Anthropic, Google, Meta, and xAI.