Reinforcement Learning
expert
~60 hours
RLHF Pipeline with Process Rewards
Implement a complete RLHF pipeline: supervised fine-tuning, reward model training with process rewards (per-step), and PPO optimization. Compare outcome-based vs process-based reward models.
Skills Demonstrated
RLHF implementation
Reward model training
PPO with KL penalty
Process vs outcome rewards
Implementation Steps
- Fine-tune a small LM (GPT-2/Phi-2) on math reasoning data
- Collect pairwise preference data (human or AI-generated)
- Train outcome reward model (rates final answers)
- Train process reward model (rates each reasoning step)
- Implement PPO with KL divergence penalty
- Compare final model quality under both reward schemes
- Visualize reward model agreement and training dynamics
Interview Relevance
Why this project matters for interviews
RLHF is the core technique behind ChatGPT, Claude, and Gemini. Understanding process rewards shows depth beyond the basics — key differentiator for Anthropic and OpenAI research roles.