Reinforcement Learning Quiz — AI Interview Prep

expert Reinforcement Learning

The Gumbel-Softmax Trap

You're training an RL agent with a 100K-token vocabulary action space using Gumbel-Softmax for differentiable sampling. Training is unstable and VRAM usage is exploding. What's wrong?

Gumbel-Softmax discrete action spaces hierarchical actions REINFORCE

The Trap

What works as a differentiable trick in low-dimensional models becomes a VRAM-destroying, gradient-unstable nightmare when your action space is a 100K-token vocabulary.

Correct Approach

Use hierarchical action decomposition: break the vocabulary into clusters (e.g., BPE merge tree), sample the cluster first, then the token within it. Alternatively, use REINFORCE with a learned baseline instead of reparameterization for discrete action spaces this large.

Follow-up Questions

When is Gumbel-Softmax appropriate vs. REINFORCE?
How does temperature annealing affect training stability?

Back to Categories

hard Reinforcement Learning

The Reward Model Scaling Trap

Your RLHF pipeline produces a model that's polite but can't reason. You scale up the reward model 4x. Performance doesn't improve. Why?

RLHF reward decomposition process rewards outcome verification

The Trap

If your RLHF pipeline stalls on reasoning, scaling the reward model only amplifies generic preference signals instead of fixing the missing reasoning-specific reward signal.

Correct Approach

Decompose the reward into components: helpfulness, correctness, safety, reasoning quality. Train separate reward models or use a multi-objective approach. Add process-based rewards (reward each reasoning step, not just the final answer). Use outcome-based verification for math/code.

Follow-up Questions

How do process rewards differ from outcome rewards in practice?
What's the risk of reward hacking with decomposed rewards?

hard Reinforcement Learning

The Sparse Reward Trap

Your RL agent explores a complex environment but only gets reward upon task completion. After 10M steps, it hasn't learned anything. How do you fix this?

sparse rewards intrinsic motivation HER curriculum learning reward shaping

The Trap

With sparse rewards, the probability of randomly stumbling upon the goal decreases exponentially with task complexity. Standard exploration strategies (epsilon-greedy, entropy bonus) are insufficient.

Correct Approach

Use intrinsic motivation: curiosity-driven exploration (ICM, RND), count-based exploration, or goal-conditioned RL with hindsight experience replay (HER). Shape intermediate rewards carefully (potential-based shaping preserves optimal policy). Consider curriculum learning to gradually increase difficulty.

Follow-up Questions

How does potential-based reward shaping guarantee policy invariance?
Compare curiosity-driven vs. count-based exploration methods.

hard Reinforcement Learning

The Dead Gradient Trap

You're training a policy gradient agent. Loss is decreasing but the policy entropy has collapsed to near zero — the agent always picks the same action. What happened?

entropy collapse policy gradient SAC PPO clipping KL divergence

The Trap

Without entropy regularization, policy gradient methods converge to deterministic policies prematurely. Once entropy collapses, gradients for unexplored actions vanish, making recovery nearly impossible.

Correct Approach

Add entropy bonus to the loss function (SAC does this automatically). Use PPO's clipping to prevent too-large policy updates. Monitor KL divergence between successive policies. If already collapsed, restart with higher initial entropy coefficient and anneal it slowly.

Follow-up Questions

How does SAC's automatic temperature tuning prevent this?
What's the relationship between entropy and exploration?

hard Reinforcement Learning

The Happy Path Trap

Your RL agent achieves high reward in simulation but fails completely in the real environment. Sim-to-real transfer is broken. Diagnose the issue.

sim-to-real domain randomization robust RL system identification

The Trap

The agent exploits simulator artifacts — unrealistic physics, deterministic state transitions, or missing noise — to achieve reward in ways that don't transfer to reality.

Correct Approach

Use domain randomization: vary physics parameters, textures, dynamics during training. Add noise to observations and actions. Train with an ensemble of simulators. Use system identification to match sim to real dynamics. Implement robust RL objectives (worst-case over parameter distributions).

Follow-up Questions

How much domain randomization is too much?
Compare domain randomization vs. domain adaptation for transfer.

Reinforcement Learning

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

The Trap

Correct Approach

Follow-up Questions

Quiz Complete!