Core Techniques II: RL Algorithms — RLHF, PPO, GRPO
The RL half of post-training mechanics: reward signals and preference learning, DPO, the RLHF objective and its KL penalty, the four models of RLHF, PPO's clipped objective, GRPO's group-relative baseline, and the failure mode of mode collapse.
On this page
Reward signals: verifiers and reward models
RL needs something to score the model’s own generations. Two kinds:
- Verifiers — deterministic checks (“is this math answer correct?”, “does the code pass?”). No ambiguity, and far less to game than a learned model — though a weak or incomplete checker can still be exploited. DeepSeek R1 Zero famously used just two: math correctness and format compliance.
- Reward models — neural nets trained on human preference pairs to score open-ended outputs where no verifier exists.
DPO (Direct Preference Optimization) takes a shortcut: it
skips the reward model entirely and trains the policy directly on chosen/rejected
pairs, with an implicit KL term (strength ) to a reference. One model, one fixed
dataset, no RL loop.
The RLHF objective
When you do run the online loop, the RLHF objective is reward maximization regularized toward a reference:
The KL penalty (coefficient ) is load-bearing: it keeps the policy close to the frozen reference , guarding against reward hacking and mode collapse. Too small and the policy drifts into degenerate high-reward regions; too large and it never improves.
The four models of RLHF
FRI-2.5Full PPO-based RLHF holds four models in memory at once:
- Policy — the LLM being trained.
- Reference — a frozen copy, for the KL term.
- Reward model — scores generated outputs.
- Value model (critic) — estimates expected future reward for advantage estimation.
Why it matters: four 7B+ models is enormous memory pressure — the primary reason the field moved toward GRPO, which drops the critic. [V] Verified
PPO and GRPO
PPO (the algorithm behind ChatGPT’s RLHF) updates the policy with a clipped surrogate objective that discourages steps which move the policy too far:
The probability ratio measures how much the new policy changed; clipping (typically ) holds it within on the side that would otherwise inflate the objective — a soft trust region that removes the incentive to move far, not a hard cap on the update (so KL is still worth watching). PPO uses GAE with a learned critic (the 4th model) for per-token credit.
GRPO (DeepSeek’s algorithm for R1) removes the critic. For each prompt it samples a group of outputs and uses the group mean as the baseline:
Each output’s advantage is simply how much better than its peers it scored — no value model, so memory drops from four models to three. (The original DeepSeekMath GRPO also divides by the group’s standard deviation, ; we use the mean-only form here for clarity — the sign and ranking carry the lesson.)
Completion problem. The same prompt is rolled out times, now scoring . Fill the blanks: the group baseline is ___, and the four advantages are ___.
Now you. What are the advantages when every rollout succeeds (), and why does that make the prompt useless for this GRPO step?
PPO vs GRPO
| Aspect | PPO | GRPO | |---|---|---| | Advantage | Learned critic (GAE) | Group-mean baseline | | Credit assignment | Per-token (GAE) | One advantage per sequence | | Models in memory | 4 (policy, ref, reward, critic) | 3 (policy, ref, reward) | | Complexity | Higher | Lower | | Notable use | ChatGPT | DeepSeek-R1 | | Reward | Reward model | Verifiers or reward model |
PPO vs GRPO
- Need token-level credit (long-form generation)? → PPO with GAE.
- Memory-constrained (can’t fit four models)? → GRPO (drops the critic).
- Verifiable rewards (math, code)? → GRPO with verifiers — the DeepSeek R1 recipe.
- General-purpose assistant with a learned reward model? → PPO remains the proven path.
- Want simplicity/stability? → GRPO — the field is trending toward simpler algorithms.
Beyond PPO and GRPO, the 2025–2026 landscape has proliferated into a family of variants — DAPO, KTO, SimPO, and the broader RLVR (RL with verifiable rewards) paradigm. You don’t need each by name; the useful skill is recognizing which of the three axes a new method is turning: the reward source, the baseline, and online vs offline.
Mode collapse
Where practitioners disagree
GRPO ≈ DPO? Recent analysis argues GRPO and DPO optimize closely related objectives — both push probability mass toward preferred outputs and away from dispreferred ones — and the practical gap is mostly online vs offline: GRPO samples fresh rollouts each step, DPO trains once on a fixed preference set. So “which is better” is less a clean algorithmic win than a question of whether you can afford an online loop and have a live grader. [I] Inference A second open question is how far RL can push a base model: there is evidence that RL largely elicits capabilities already latent in the base model rather than creating new ones — which would make base-model choice, not the RL algorithm, the dominant lever. In an interview, name the axis (online/offline, reward source, latent-vs-new capability) rather than declaring a winner.
Retrieval check
Answer from memory, then expand to check — or go deeper in the practice questions.
Write the Bradley-Terry reward-model loss and say what it learns (relative vs absolute). FRI-2.5
, where is preferred. It learns relative preference (which output is better) — never an absolute goodness score, which is all humans reliably annotate.
Name the four RLHF models and which one GRPO removes. FRI-2.5
Policy, reference (frozen, for the KL term), reward model, and value/critic. GRPO removes the value/critic, replacing it with a group-mean baseline → three models.
Compute GRPO advantages for rewards [1, 1, 0, 0]; when does a prompt give zero gradient? FRI-2.5
Baseline = mean = 0.5 → advantages . A prompt gives zero gradient when all rollouts score the same (the baseline equals every reward) — so GRPO needs prompts of intermediate difficulty.
State the role of the KL penalty and what happens if beta is too small. FRI-2.5, FRI-2.6
keeps the policy near the frozen reference, guarding against reward hacking and mode collapse. Too small → the policy drifts and games the reward; too large → it can’t improve.
Distinguish catastrophic forgetting from mode collapse, with one mitigation each. FRI-2.6
Forgetting (SFT side): narrow-data training overwrites general capabilities → mitigate with LoRA / a general-data mix / low epochs. Mode collapse (RL side): reward pressure converges on one output mode → mitigate with the KL penalty / entropy bonus / GRPO.
Summary
RL post-training optimizes a scalar reward on the model’s own generations, regularized toward a reference by a KL penalty. Rewards come from deterministic verifiers (clean, ungameable) or learned reward models trained with the Bradley-Terry preference loss; DPO skips the reward model and trains offline on preference pairs. The online algorithms trade off along clear axes: PPO uses a learned critic for token-level credit (four models), GRPO replaces the critic with a group-mean baseline (three models, sequence-level) and powers verifier-driven reasoning models like DeepSeek R1. The recurring failure mode is mode collapse — loss of diversity under reward pressure — held off by the KL penalty, entropy bonuses, and group-relative comparison. That completes the core mechanics; later chapters turn to evaluation, data, and production.