← All Projects
Reinforcement Learning expert ~60 hours
RLHF Pipeline with Process Rewards
Implement a complete RLHF pipeline: supervised fine-tuning, reward model training with process rewards (per-step), and PPO optimization. Compare outcome-based vs process-based reward models.

Skills Demonstrated

RLHF implementation Reward model training PPO with KL penalty Process vs outcome rewards

Implementation Steps

  1. Fine-tune a small LM (GPT-2/Phi-2) on math reasoning data
  2. Collect pairwise preference data (human or AI-generated)
  3. Train outcome reward model (rates final answers)
  4. Train process reward model (rates each reasoning step)
  5. Implement PPO with KL divergence penalty
  6. Compare final model quality under both reward schemes
  7. Visualize reward model agreement and training dynamics

Interview Relevance

Why this project matters for interviews RLHF is the core technique behind ChatGPT, Claude, and Gemini. Understanding process rewards shows depth beyond the basics — key differentiator for Anthropic and OpenAI research roles.
All Projects Back to Interview Prep