← All Projects
ML System Design expert ~55 hours
Distributed Training Dashboard with Profiling
Build a training orchestration tool that profiles GPU utilization, communication overhead, and MFU across different parallelism strategies (data parallel, tensor parallel, pipeline parallel).

Skills Demonstrated

Distributed training GPU profiling Parallelism strategies Performance optimization

Implementation Steps

  1. Implement data-parallel training with PyTorch DDP
  2. Add FSDP (Fully Sharded Data Parallel) as alternative strategy
  3. Build profiling hooks: GPU utilization, memory, communication time
  4. Calculate and display MFU (Model FLOPs Utilization) in real-time
  5. Create comparison dashboard across parallelism strategies
  6. Implement gradient accumulation with micro-batch scheduling
  7. Add automatic batch size finder with memory profiling

Interview Relevance

Why this project matters for interviews Large-scale training infrastructure is the bottleneck for AI progress. Understanding parallelism strategies and being able to profile them is essential for roles at Anthropic, Google, Meta, and xAI.
All Projects Back to Interview Prep