ML System Design

Question 1 of 5
hard ML System Design
The Training-Serving Skew Trap
Your recommendation model performs well offline (NDCG@10 = 0.45) but click-through rate doesn't improve after deployment. Offline and online metrics disagree. Diagnose the problem.
training-serving skew feature monitoring shadow deployment NDCG

The Trap

Training-serving skew: features computed differently in batch training vs. real-time serving. Common culprits: timestamp handling, aggregation windows, null imputation, and feature transform implementations that differ between training and serving pipelines.

Correct Approach

Use the same feature computation code for training and serving (shared library). Log serving-time features and compare distributions with training features (feature distribution monitoring). Implement shadow mode: run old and new models side-by-side, log predictions from both, compare before switching traffic.

Follow-up Questions

  • How do you detect skew automatically in production?
  • What's the role of feature schemas in preventing skew?
Back to Categories
hard ML System Design
The Backprop FLOP Trap
An intern estimates training FLOPs as 2x the forward pass FLOPs. The actual training is 4x slower than predicted. Where's the gap?
training FLOPs backward pass cost MFU activation checkpointing

The Trap

The backward pass is approximately 2x the forward pass (computing gradients for both weights and activations), making total training ~3x forward — not 2x. Plus optimizer state updates, gradient accumulation, and communication overhead add more.

Correct Approach

Budget training FLOPs as: forward (1x) + backward (2x) + optimizer (0.5-1x) = ~3.5-4x forward. For distributed training, add communication overhead (all-reduce scales with model size). Use activation checkpointing to trade compute for memory. Profile actual FLOP utilization (MFU) — typically 30-50% of theoretical peak.

Follow-up Questions

  • How does activation checkpointing change the compute-memory trade-off?
  • What's a good MFU target for different hardware?
medium ML System Design
The Overfitting Panic Trap
Your model reaches 100% training accuracy while validation accuracy is 85%. A junior engineer wants to stop training immediately. Is this the right call?
grokking early stopping regularization train-val gap

The Trap

100% train accuracy isn't always terminal failure. Grokking shows that models can memorize first, then generalize — continued training past apparent overfitting can lead to sudden validation accuracy jumps.

Correct Approach

Check if validation accuracy is still improving (even slowly). Monitor the validation loss trend, not just accuracy. Use early stopping with patience (wait N epochs after best validation metric). Apply regularization (dropout, weight decay, data augmentation) rather than stopping prematurely. Consider whether the gap is acceptable for your use case.

Follow-up Questions

  • How does grokking relate to weight decay and regularization?
  • When should you actually be concerned about overfitting?
hard ML System Design
The Batch Size Scaling Trap
You scale training from 1 GPU to 64 GPUs with linear batch size scaling. Training speed is 40x (not 64x) and final accuracy drops 2%. What went wrong?
large-batch training learning rate warmup LARS/LAMB gradient noise scale

The Trap

Linear learning rate scaling with batch size breaks down beyond a critical batch size. Large batches converge to sharp minima (poor generalization). Communication overhead (all-reduce) doesn't scale linearly.

Correct Approach

Use learning rate warmup (gradual increase over first 5-10% of training). Apply LARS/LAMB optimizer for large-batch training. Monitor gradient noise scale to find the critical batch size. Use gradient accumulation instead of larger per-GPU batches if communication is the bottleneck.

Follow-up Questions

  • How do you calculate the critical batch size?
  • What's the relationship between batch size and generalization?
hard ML System Design
The Feature Store Latency Trap
Your ML system uses a feature store for real-time serving. P50 latency is 5ms but P99 is 500ms. The model inference itself is only 10ms. What's happening?
feature store tail latency pre-materialization graceful degradation

The Trap

Feature stores that join multiple feature groups at serving time create fan-out queries. If any single feature group is slow (cache miss, cold partition), the entire request waits — tail latency is determined by the slowest feature.

Correct Approach

Pre-materialize feature vectors: join features offline and store the complete vector per entity. Use a two-tier cache (in-memory L1 + Redis L2). Set timeouts per feature group with graceful degradation (use default values for missing features). Monitor per-feature-group latency independently.

Follow-up Questions

  • How do you handle feature freshness vs. latency trade-offs?
  • What's the impact of missing features on model accuracy?

Quiz Complete!

ML System Design — 5 questions

0 / 5

Questions answered before reveal

Retry Quiz All Categories Browse Projects