Reinforcement Learning

Question 1 of 5
hard Reinforcement Learning
The Dead Gradient Trap
You're training a policy gradient agent. Loss is decreasing but the policy entropy has collapsed to near zero — the agent always picks the same action. What happened?
entropy collapse policy gradient SAC PPO clipping KL divergence

The Trap

Without entropy regularization, policy gradient methods converge to deterministic policies prematurely. Once entropy collapses, gradients for unexplored actions vanish, making recovery nearly impossible.

Correct Approach

Add entropy bonus to the loss function (SAC does this automatically). Use PPO's clipping to prevent too-large policy updates. Monitor KL divergence between successive policies. If already collapsed, restart with higher initial entropy coefficient and anneal it slowly.

Follow-up Questions

  • How does SAC's automatic temperature tuning prevent this?
  • What's the relationship between entropy and exploration?
Back to Categories
hard Reinforcement Learning
The Reward Model Scaling Trap
Your RLHF pipeline produces a model that's polite but can't reason. You scale up the reward model 4x. Performance doesn't improve. Why?
RLHF reward decomposition process rewards outcome verification

The Trap

If your RLHF pipeline stalls on reasoning, scaling the reward model only amplifies generic preference signals instead of fixing the missing reasoning-specific reward signal.

Correct Approach

Decompose the reward into components: helpfulness, correctness, safety, reasoning quality. Train separate reward models or use a multi-objective approach. Add process-based rewards (reward each reasoning step, not just the final answer). Use outcome-based verification for math/code.

Follow-up Questions

  • How do process rewards differ from outcome rewards in practice?
  • What's the risk of reward hacking with decomposed rewards?
hard Reinforcement Learning
The Sparse Reward Trap
Your RL agent explores a complex environment but only gets reward upon task completion. After 10M steps, it hasn't learned anything. How do you fix this?
sparse rewards intrinsic motivation HER curriculum learning reward shaping

The Trap

With sparse rewards, the probability of randomly stumbling upon the goal decreases exponentially with task complexity. Standard exploration strategies (epsilon-greedy, entropy bonus) are insufficient.

Correct Approach

Use intrinsic motivation: curiosity-driven exploration (ICM, RND), count-based exploration, or goal-conditioned RL with hindsight experience replay (HER). Shape intermediate rewards carefully (potential-based shaping preserves optimal policy). Consider curriculum learning to gradually increase difficulty.

Follow-up Questions

  • How does potential-based reward shaping guarantee policy invariance?
  • Compare curiosity-driven vs. count-based exploration methods.
hard Reinforcement Learning
The Happy Path Trap
Your RL agent achieves high reward in simulation but fails completely in the real environment. Sim-to-real transfer is broken. Diagnose the issue.
sim-to-real domain randomization robust RL system identification

The Trap

The agent exploits simulator artifacts — unrealistic physics, deterministic state transitions, or missing noise — to achieve reward in ways that don't transfer to reality.

Correct Approach

Use domain randomization: vary physics parameters, textures, dynamics during training. Add noise to observations and actions. Train with an ensemble of simulators. Use system identification to match sim to real dynamics. Implement robust RL objectives (worst-case over parameter distributions).

Follow-up Questions

  • How much domain randomization is too much?
  • Compare domain randomization vs. domain adaptation for transfer.
expert Reinforcement Learning
The Gumbel-Softmax Trap
You're training an RL agent with a 100K-token vocabulary action space using Gumbel-Softmax for differentiable sampling. Training is unstable and VRAM usage is exploding. What's wrong?
Gumbel-Softmax discrete action spaces hierarchical actions REINFORCE

The Trap

What works as a differentiable trick in low-dimensional models becomes a VRAM-destroying, gradient-unstable nightmare when your action space is a 100K-token vocabulary.

Correct Approach

Use hierarchical action decomposition: break the vocabulary into clusters (e.g., BPE merge tree), sample the cluster first, then the token within it. Alternatively, use REINFORCE with a learned baseline instead of reparameterization for discrete action spaces this large.

Follow-up Questions

  • When is Gumbel-Softmax appropriate vs. REINFORCE?
  • How does temperature annealing affect training stability?

Quiz Complete!

Reinforcement Learning — 5 questions

0 / 5

Questions answered before reveal

Retry Quiz All Categories Browse Projects