Part 1 Chapter 2 Last verified 2026-06-18

Core Techniques II: RL Algorithms — RLHF, PPO, GRPO

The RL half of post-training mechanics: reward signals and preference learning, DPO, the RLHF objective and its KL penalty, the four models of RLHF, PPO's clipped objective, GRPO's group-relative baseline, and the failure mode of mode collapse.

On this page
  1. Reward signals: verifiers and reward models
  2. The RLHF objective
  3. PPO and GRPO
  4. PPO vs GRPO
  5. Mode collapse
  6. Where practitioners disagree
  7. Retrieval check
  8. Summary

Reward signals: verifiers and reward models

RL needs something to score the model’s own generations. Two kinds:

  • Verifiers — deterministic checks (“is this math answer correct?”, “does the code pass?”). No ambiguity, and far less to game than a learned model — though a weak or incomplete checker can still be exploited. DeepSeek R1 Zero famously used just two: math correctness and format compliance.
  • Reward models — neural nets trained on human preference pairs to score open-ended outputs where no verifier exists.

DPO (Direct Preference Optimization) takes a shortcut: it skips the reward model entirely and trains the policy directly on chosen/rejected pairs, with an implicit KL term (strength β\beta) to a reference. One model, one fixed dataset, no RL loop.

The RLHF objective

When you do run the online loop, the RLHF objective is reward maximization regularized toward a reference:

maxθ  ExD,  yπθ(x)[rϕ(x,y)    βKL(πθ(x)πref(x))]\max_\theta\; \mathbb{E}_{x \sim D,\; y \sim \pi_\theta(\cdot|x)} \Bigl[\, r_\phi(x, y) \;-\; \beta\,\mathrm{KL}\bigl(\pi_\theta(\cdot|x)\,\|\,\pi_{\text{ref}}(\cdot|x)\bigr) \,\Bigr]

The KL penalty (coefficient β\beta) is load-bearing: it keeps the policy close to the frozen reference πref\pi_{\text{ref}}, guarding against reward hacking and mode collapse. Too small and the policy drifts into degenerate high-reward regions; too large and it never improves.

Key concept

The four models of RLHF

FRI-2.5

Full PPO-based RLHF holds four models in memory at once:

  1. Policy πθ\pi_\theta — the LLM being trained.
  2. Reference πref\pi_{\text{ref}} — a frozen copy, for the KL term.
  3. Reward model rϕr_\phi — scores generated outputs.
  4. Value model (critic) — estimates expected future reward for advantage estimation.

Why it matters: four 7B+ models is enormous memory pressure — the primary reason the field moved toward GRPO, which drops the critic. [V] Verified

PPO and GRPO

PPO (the algorithm behind ChatGPT’s RLHF) updates the policy with a clipped surrogate objective that discourages steps which move the policy too far:

LPPO=min ⁣(ρtA^t,  clip(ρt,1ϵ,1+ϵ)A^t),ρt=πθ(as)πold(as)\mathcal{L}^{\text{PPO}} = \min\!\Bigl(\rho_t\,\hat{A}_t,\; \mathrm{clip}(\rho_t,\,1-\epsilon,\,1+\epsilon)\,\hat{A}_t\Bigr), \quad \rho_t = \frac{\pi_\theta(a|s)}{\pi_{\text{old}}(a|s)}

The probability ratio ρt\rho_t measures how much the new policy changed; clipping (typically ϵ=0.2\epsilon = 0.2) holds it within [1ϵ,1+ϵ][1-\epsilon,\,1+\epsilon] on the side that would otherwise inflate the objective — a soft trust region that removes the incentive to move far, not a hard cap on the update (so KL is still worth watching). PPO uses GAE with a learned critic (the 4th model) for per-token credit.

GRPO (DeepSeek’s algorithm for R1) removes the critic. For each prompt it samples a group of GG outputs and uses the group mean as the baseline:

A^i=r(yi)1Gj=1Gr(yj)\hat{A}_i = r(y_i) - \frac{1}{G}\sum_{j=1}^{G} r(y_j)

Each output’s advantage is simply how much better than its peers it scored — no value model, so memory drops from four models to three. (The original DeepSeekMath GRPO also divides by the group’s standard deviation, A^i=(rimean)/std\hat{A}_i = (r_i - \text{mean})/\text{std}; we use the mean-only form here for clarity — the sign and ranking carry the lesson.)

Completion problem. The same prompt is rolled out G=4G = 4 times, now scoring r=[1,1,1,0]r = [1, 1, 1, 0]. Fill the blanks: the group baseline is ___, and the four advantages are ___.

Now you. What are the advantages when every rollout succeeds (r=[1,1,1,1]r = [1, 1, 1, 1]), and why does that make the prompt useless for this GRPO step?

PPO vs GRPO

| Aspect | PPO | GRPO | |---|---|---| | Advantage | Learned critic (GAE) | Group-mean baseline | | Credit assignment | Per-token (GAE) | One advantage per sequence | | Models in memory | 4 (policy, ref, reward, critic) | 3 (policy, ref, reward) | | Complexity | Higher | Lower | | Notable use | ChatGPT | DeepSeek-R1 | | Reward | Reward model | Verifiers or reward model |

Decision tree

PPO vs GRPO

  • Need token-level credit (long-form generation)?PPO with GAE.
  • Memory-constrained (can’t fit four models)?GRPO (drops the critic).
  • Verifiable rewards (math, code)?GRPO with verifiers — the DeepSeek R1 recipe.
  • General-purpose assistant with a learned reward model?PPO remains the proven path.
  • Want simplicity/stability?GRPO — the field is trending toward simpler algorithms.

Beyond PPO and GRPO, the 2025–2026 landscape has proliferated into a family of variants — DAPO, KTO, SimPO, and the broader RLVR (RL with verifiable rewards) paradigm. You don’t need each by name; the useful skill is recognizing which of the three axes a new method is turning: the reward source, the baseline, and online vs offline.

Mode collapse

Where practitioners disagree

GRPO ≈ DPO? Recent analysis argues GRPO and DPO optimize closely related objectives — both push probability mass toward preferred outputs and away from dispreferred ones — and the practical gap is mostly online vs offline: GRPO samples fresh rollouts each step, DPO trains once on a fixed preference set. So “which is better” is less a clean algorithmic win than a question of whether you can afford an online loop and have a live grader. [I] Inference A second open question is how far RL can push a base model: there is evidence that RL largely elicits capabilities already latent in the base model rather than creating new ones — which would make base-model choice, not the RL algorithm, the dominant lever. In an interview, name the axis (online/offline, reward source, latent-vs-new capability) rather than declaring a winner.

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Write the Bradley-Terry reward-model loss and say what it learns (relative vs absolute). FRI-2.5

L=logσ ⁣(r(yw)r(yl))\mathcal{L} = -\log\sigma\!\left(r(y_w) - r(y_l)\right), where ywy_w is preferred. It learns relative preference (which output is better) — never an absolute goodness score, which is all humans reliably annotate.

Name the four RLHF models and which one GRPO removes. FRI-2.5

Policy, reference (frozen, for the KL term), reward model, and value/critic. GRPO removes the value/critic, replacing it with a group-mean baseline → three models.

Compute GRPO advantages for rewards [1, 1, 0, 0]; when does a prompt give zero gradient? FRI-2.5

Baseline = mean = 0.5 → advantages [+0.5,+0.5,0.5,0.5][+0.5, +0.5, -0.5, -0.5]. A prompt gives zero gradient when all rollouts score the same (the baseline equals every reward) — so GRPO needs prompts of intermediate difficulty.

State the role of the KL penalty and what happens if beta is too small. FRI-2.5, FRI-2.6

βKL(πθπref)-\beta\,\mathrm{KL}(\pi_\theta \| \pi_{\text{ref}}) keeps the policy near the frozen reference, guarding against reward hacking and mode collapse. Too small → the policy drifts and games the reward; too large → it can’t improve.

Distinguish catastrophic forgetting from mode collapse, with one mitigation each. FRI-2.6

Forgetting (SFT side): narrow-data training overwrites general capabilities → mitigate with LoRA / a general-data mix / low epochs. Mode collapse (RL side): reward pressure converges on one output mode → mitigate with the KL penalty / entropy bonus / GRPO.

Summary

RL post-training optimizes a scalar reward on the model’s own generations, regularized toward a reference by a KL penalty. Rewards come from deterministic verifiers (clean, ungameable) or learned reward models trained with the Bradley-Terry preference loss; DPO skips the reward model and trains offline on preference pairs. The online algorithms trade off along clear axes: PPO uses a learned critic for token-level credit (four models), GRPO replaces the critic with a group-mean baseline (three models, sequence-level) and powers verifier-driven reasoning models like DeepSeek R1. The recurring failure mode is mode collapse — loss of diversity under reward pressure — held off by the KL penalty, entropy bonuses, and group-relative comparison. That completes the core mechanics; later chapters turn to evaluation, data, and production.