Without entropy regularization, policy gradient methods converge to deterministic policies prematurely. Once entropy collapses, gradients for unexplored actions vanish, making recovery nearly impossible.
Add entropy bonus to the loss function (SAC does this automatically). Use PPO's clipping to prevent too-large policy updates. Monitor KL divergence between successive policies. If already collapsed, restart with higher initial entropy coefficient and anneal it slowly.
If your RLHF pipeline stalls on reasoning, scaling the reward model only amplifies generic preference signals instead of fixing the missing reasoning-specific reward signal.
Decompose the reward into components: helpfulness, correctness, safety, reasoning quality. Train separate reward models or use a multi-objective approach. Add process-based rewards (reward each reasoning step, not just the final answer). Use outcome-based verification for math/code.
With sparse rewards, the probability of randomly stumbling upon the goal decreases exponentially with task complexity. Standard exploration strategies (epsilon-greedy, entropy bonus) are insufficient.
Use intrinsic motivation: curiosity-driven exploration (ICM, RND), count-based exploration, or goal-conditioned RL with hindsight experience replay (HER). Shape intermediate rewards carefully (potential-based shaping preserves optimal policy). Consider curriculum learning to gradually increase difficulty.
The agent exploits simulator artifacts — unrealistic physics, deterministic state transitions, or missing noise — to achieve reward in ways that don't transfer to reality.
Use domain randomization: vary physics parameters, textures, dynamics during training. Add noise to observations and actions. Train with an ensemble of simulators. Use system identification to match sim to real dynamics. Implement robust RL objectives (worst-case over parameter distributions).
What works as a differentiable trick in low-dimensional models becomes a VRAM-destroying, gradient-unstable nightmare when your action space is a 100K-token vocabulary.
Use hierarchical action decomposition: break the vocabulary into clusters (e.g., BPE merge tree), sample the cluster first, then the token within it. Alternatively, use REINFORCE with a learned baseline instead of reparameterization for discrete action spaces this large.
Reinforcement Learning — 5 questions
Questions answered before reveal