ML System Design Quiz — AI Interview Prep

medium ML System Design

The Overfitting Panic Trap

Your model reaches 100% training accuracy while validation accuracy is 85%. A junior engineer wants to stop training immediately. Is this the right call?

grokking early stopping regularization train-val gap

The Trap

100% train accuracy isn't always terminal failure. Grokking shows that models can memorize first, then generalize — continued training past apparent overfitting can lead to sudden validation accuracy jumps.

Correct Approach

Check if validation accuracy is still improving (even slowly). Monitor the validation loss trend, not just accuracy. Use early stopping with patience (wait N epochs after best validation metric). Apply regularization (dropout, weight decay, data augmentation) rather than stopping prematurely. Consider whether the gap is acceptable for your use case.

Follow-up Questions

How does grokking relate to weight decay and regularization?
When should you actually be concerned about overfitting?

Back to Categories

hard ML System Design

The Batch Size Scaling Trap

You scale training from 1 GPU to 64 GPUs with linear batch size scaling. Training speed is 40x (not 64x) and final accuracy drops 2%. What went wrong?

large-batch training learning rate warmup LARS/LAMB gradient noise scale

The Trap

Linear learning rate scaling with batch size breaks down beyond a critical batch size. Large batches converge to sharp minima (poor generalization). Communication overhead (all-reduce) doesn't scale linearly.

Correct Approach

Use learning rate warmup (gradual increase over first 5-10% of training). Apply LARS/LAMB optimizer for large-batch training. Monitor gradient noise scale to find the critical batch size. Use gradient accumulation instead of larger per-GPU batches if communication is the bottleneck.

Follow-up Questions

How do you calculate the critical batch size?
What's the relationship between batch size and generalization?

hard ML System Design

The Feature Store Latency Trap

Your ML system uses a feature store for real-time serving. P50 latency is 5ms but P99 is 500ms. The model inference itself is only 10ms. What's happening?

feature store tail latency pre-materialization graceful degradation

The Trap

Feature stores that join multiple feature groups at serving time create fan-out queries. If any single feature group is slow (cache miss, cold partition), the entire request waits — tail latency is determined by the slowest feature.

Correct Approach

Pre-materialize feature vectors: join features offline and store the complete vector per entity. Use a two-tier cache (in-memory L1 + Redis L2). Set timeouts per feature group with graceful degradation (use default values for missing features). Monitor per-feature-group latency independently.

Follow-up Questions

How do you handle feature freshness vs. latency trade-offs?
What's the impact of missing features on model accuracy?

hard ML System Design

The Training-Serving Skew Trap

Your recommendation model performs well offline (NDCG@10 = 0.45) but click-through rate doesn't improve after deployment. Offline and online metrics disagree. Diagnose the problem.

training-serving skew feature monitoring shadow deployment NDCG

The Trap

Training-serving skew: features computed differently in batch training vs. real-time serving. Common culprits: timestamp handling, aggregation windows, null imputation, and feature transform implementations that differ between training and serving pipelines.

Correct Approach

Use the same feature computation code for training and serving (shared library). Log serving-time features and compare distributions with training features (feature distribution monitoring). Implement shadow mode: run old and new models side-by-side, log predictions from both, compare before switching traffic.

Follow-up Questions

How do you detect skew automatically in production?
What's the role of feature schemas in preventing skew?

hard ML System Design

The Backprop FLOP Trap

An intern estimates training FLOPs as 2x the forward pass FLOPs. The actual training is 4x slower than predicted. Where's the gap?

training FLOPs backward pass cost MFU activation checkpointing

The Trap

The backward pass is approximately 2x the forward pass (computing gradients for both weights and activations), making total training ~3x forward — not 2x. Plus optimizer state updates, gradient accumulation, and communication overhead add more.

Correct Approach

Budget training FLOPs as: forward (1x) + backward (2x) + optimizer (0.5-1x) = ~3.5-4x forward. For distributed training, add communication overhead (all-reduce scales with model size). Use activation checkpointing to trade compute for memory. Profile actual FLOP utilization (MFU) — typically 30-50% of theoretical peak.

Follow-up Questions

How does activation checkpointing change the compute-memory trade-off?
What's a good MFU target for different hardware?

ML System Design

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

Quiz Complete!