Training-serving skew: features computed differently in batch training vs. real-time serving. Common culprits: timestamp handling, aggregation windows, null imputation, and feature transform implementations that differ between training and serving pipelines.
Use the same feature computation code for training and serving (shared library). Log serving-time features and compare distributions with training features (feature distribution monitoring). Implement shadow mode: run old and new models side-by-side, log predictions from both, compare before switching traffic.
The backward pass is approximately 2x the forward pass (computing gradients for both weights and activations), making total training ~3x forward — not 2x. Plus optimizer state updates, gradient accumulation, and communication overhead add more.
Budget training FLOPs as: forward (1x) + backward (2x) + optimizer (0.5-1x) = ~3.5-4x forward. For distributed training, add communication overhead (all-reduce scales with model size). Use activation checkpointing to trade compute for memory. Profile actual FLOP utilization (MFU) — typically 30-50% of theoretical peak.
100% train accuracy isn't always terminal failure. Grokking shows that models can memorize first, then generalize — continued training past apparent overfitting can lead to sudden validation accuracy jumps.
Check if validation accuracy is still improving (even slowly). Monitor the validation loss trend, not just accuracy. Use early stopping with patience (wait N epochs after best validation metric). Apply regularization (dropout, weight decay, data augmentation) rather than stopping prematurely. Consider whether the gap is acceptable for your use case.
Linear learning rate scaling with batch size breaks down beyond a critical batch size. Large batches converge to sharp minima (poor generalization). Communication overhead (all-reduce) doesn't scale linearly.
Use learning rate warmup (gradual increase over first 5-10% of training). Apply LARS/LAMB optimizer for large-batch training. Monitor gradient noise scale to find the critical batch size. Use gradient accumulation instead of larger per-GPU batches if communication is the bottleneck.
Feature stores that join multiple feature groups at serving time create fan-out queries. If any single feature group is slow (cache miss, cold partition), the entire request waits — tail latency is determined by the slowest feature.
Pre-materialize feature vectors: join features offline and store the complete vector per entity. Use a two-tier cache (in-memory L1 + Redis L2). Set timeouts per feature group with graceful degradation (use default values for missing features). Monitor per-feature-group latency independently.
ML System Design — 5 questions
Questions answered before reveal