Part 1 Chapter 1 Last verified 2026-06-18

Post-Training Overview

Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, the data and grading each needs, how reasoning and safety alignment are trained, and how reward hacking arises and is mitigated.

On this page

Why post-training matters
The three-stage training pipeline
SFT vs RL: the core tradeoff
When to use which
Data and grading: the two ingredients
Reward hacking
Reasoning: SFT and RL working together
Safety alignment: Constitutional AI / RLAIF
Hands-on: safety evaluation with Llama Guard
Post-training in the wild
Retrieval check
Where practitioners disagree
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to SFT vs RL: the core tradeoff; if any is shaky, read closely — each is developed below.

What are the three stages of LLM training, and what does each optimize for?
In one sentence each, how do SFT and RL differ in what signal they learn from?
Why can a base model not be deployed as a chat assistant as-is?
What is reward hacking, and name one mitigation.

Check your answers

Pre-training (next-token prediction → broad knowledge), mid-training (curated data → fluency, modalities, longer context), post-training (SFT + RL → instruction-following, safety, reasoning).
SFT learns from (input, target-output) demonstrations (imitation); RL learns from a scalar reward scoring its own generations (trial-and-error).
It predicts plausible next tokens but has no notion of following instructions, refusing harmful requests, or showing structured reasoning.
The policy maximizes the reward as specified rather than as intended, exploiting shortcuts. Mitigations: KL penalty to the SFT baseline, deterministic verifiers, reward ensembles, diversity monitoring.

Why post-training matters

A pre-trained model predicts the next token well but has no concept of instructions, safety boundaries, or structured reasoning. Post-training closes that gap. Supervised fine-tuning (SFT) is the mature, efficient technique — LoRA makes it feasible on a single GPU. Reinforcement learning (RL) unlocks capabilities beyond what any human demonstration can teach. The meta-skill tying it together is error analysis: train, evaluate, diagnose, fix the data, repeat.

The arc shows accelerating sophistication: basic task fine-tuning (2018–2020) → InstructGPT / RLHF (2022), the breakthrough that RL alignment sharply improves helpfulness and safety → DPO (2023) simplifying the RL pipeline → reasoning models like DeepSeek R1 (Jan 2025) that push RL past imitation → frontier labs now running multi-stage SFT→RL→SFT→RL pipelines.

The three-stage training pipeline

SFT vs RL: the core tradeoff

The tradeoff in one table — worth being able to reproduce from memory:

| Axis | SFT | RL | |---|---|---| | Signal | (input, output) demonstrations | scalar reward on own outputs | | Stability | High — gradient on gold labels | Lower — policy exploration | | Ceiling | Bounded by demonstrations | Can exceed human demos | | Data | Curated pairs | Reward signal / verifier | | Compute | Moderate (fwd + bwd) | High (generate + score + update) | | Maturity | Very mature; LoRA well-understood | Fast-evolving (PPO → GRPO) |

When to use which

Decision tree

SFT vs RL

Do you have demonstrations that already capture ceiling performance? → Use SFT; RL adds cost with little upside.
Have demonstrations but want to exceed their quality? → SFT first for a strong baseline, then RL.
No demonstrations but a reliable reward signal? → RL directly (rare; see DeepSeek R1 Zero).
Is the task verifiable (code, math, lookup)? → RL with deterministic verifiers — strongest signal, least reward hacking.
Quality is subjective (creative, open-ended)? → SFT + optional RL with LLM-as-judge, but watch for reward hacking.

Data and grading: the two ingredients

Every method needs data to train on and a grading mechanism to score outputs. SFT and RL differ sharply in both.

SFT data is (input, target) pairs as multi-turn chat: chat-history pairs, chain-of-thought derivations, RAG-recovery patterns, and guardrail/refusal demonstrations.

RL grading scores model-generated outputs, in order of reliability: deterministic verifiers (code tests, math checkers — no ambiguity) → an LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).

Reward hacking

The KL penalty is the regularizer that keeps the policy $\pi_\theta$ close to the reference $\pi_{\text{ref}}$ (the SFT model): the objective adds a term $-\beta\, \mathrm{KL}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right)$ , so straying far enough to chase a hacked reward is itself penalized.

Reasoning: SFT and RL working together

SFT teaches a process — given a problem, produce a step-by-step CoT then the answer. It learns the format and common strategies from demonstrations, but cannot discover approaches absent from the data. RL with a verifier (e.g. checking the final answer) lets the model explore the space of reasoning paths and find shortcuts or novel decompositions no human demonstrated.

Key concept

DeepSeek R1 Zero

FRI-1.4

RL-only training (no SFT) on a base model produced emergent chain-of-thought reasoning — the model learned to “think out loud” purely from reward signal, never shown a CoT demonstration. This proved RL alone can induce structured reasoning.

When it breaks: RL-only is unstable (language mixing, repetition loops). DeepSeek added SFT stages before and after RL; pure RL-only remains a research result, not a production recipe. [V] Verified

Safety alignment: Constitutional AI / RLAIF

Key concept

Constitutional AI / RLAIF

FRI-1.4

Anthropic’s Constitutional AI replaces human preference labels with AI-generated preferences, scaling alignment without proportional annotation cost:

Write a constitution — principles for desired behavior (“be helpful, harmless, honest”).
Self-critique → SFT data — the model critiques and revises its own responses against the constitution; revisions become SFT data.
RLAIF — the model generates preference pairs; a reward model trains on those AI preferences, then drives RL.

Result (per the paper): equally helpful, significantly more harmless than RLHF baselines, with far fewer human labels.

When it breaks: AI preferences inherit the judge’s blind spots — a vague constitution or weak critic amplifies bias instead of correcting it. Human spot-checks remain essential. [I] Inference

Hands-on: safety evaluation with Llama Guard

Llama Guard is a classifier that labels a prompt–response pair safe or unsafe\nS{code} across a set of harm categories. A minimal parser:

import re

def parse_llama_guard_response(response: str) -> dict:
    """'unsafe\\nS2' -> {'safe': False, 'categories': ['S2']}"""
    lines = response.strip().split("\n")
    is_safe = lines[0].strip().lower() == "safe"
    categories = []
    if not is_safe:
        for line in lines[1:]:
            m = re.match(r"(S\d{1,2})", line.strip())
            if m:
                categories.append(m.group(1))
    return {"safe": is_safe, "categories": categories}

Post-training in the wild

Three flagship open pipelines, three design choices:

DeepSeek R1 — base → cold-start CoT SFT → GRPO RL on math/code verifiers → rejection-sampling SFT (best-of-N) → final alignment RL. Notable for R1 Zero (RL-only reasoning) and open-sourcing the recipe; distilled 671B → 1.5B–70B.
LLaMA 3.1/3.3 (Meta) — 15T+ token pre-train → mid-training (long-context extension to 128K + multilingual data) → SFT → RLHF (PPO) → iterative eval + data refresh. 3.3 (70B) neared 405B quality via aggressive distillation.
Qwen 2.5 (Alibaba) — multilingual pre-train → SFT (code/math/chat) → DPO → domain RL. Shows DPO as a lightweight PPO alternative, consistent across 0.5B–72B.

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Name the three training stages and what each optimizes for. FRI-1.1

Pre-training (next-token on internet text → broad knowledge) → mid-training (curated data → fluency, modalities, longer context) → post-training (SFT + RL → instruction-following, safety, reasoning).

Reproduce the SFT-vs-RL table — at least four axes. FRI-1.2

Signal (demonstrations vs scalar reward), Stability (high vs lower), Ceiling (bounded by demos vs can exceed), Data (curated pairs vs reward/verifier), Compute (moderate vs high). RL trades stability for a higher ceiling.

List the three RL grading mechanisms in reliability order, with an example each. FRI-1.3

Deterministic verifiers (code tests, math checkers — no ambiguity) → LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).

State reward hacking in one sentence; give two mitigations. FRI-1.6

The policy maximizes the reward as specified rather than as intended, exploiting shortcuts. Mitigations (any two): a KL penalty to the SFT baseline, deterministic verifiers, reward ensembles, diversity monitoring.

What is the "frontier recipe" for reasoning, in four stages? FRI-1.4

From the base model: (1) cold-start CoT SFT → (2) RL with verifiers (GRPO) → (3) rejection-sampling SFT (best-of-N) → (4) final alignment RL. The frontier alternates SFT and RL.

Where practitioners disagree

The frontier recipe is not settled. RL-only vs SFT→RL: R1 Zero showed RL-only can induce reasoning, yet every production system reintroduces SFT for stability — so “is SFT necessary?” is contested in principle but near-universal in practice. LLM-as-judge reliability: judges are scalable but gameable and biased toward verbose, confident answers; teams disagree on when a judge is trustworthy enough to drive RL versus only offline eval. When you cite a recipe in an interview, name the tradeoff you’re accepting, not just the steps.

Summary

Post-training turns raw intelligence into an aligned, usable system. The three-stage pipeline separates concerns: broad knowledge, refined fluency, targeted behavior. SFT is stable imitation bounded by demonstration quality; RL is exploration bounded by reward quality. The frontier alternates them, each covering the other’s weakness. Safety scales via Constitutional AI / RLAIF, and the open ecosystem (DeepSeek, LLaMA, Qwen) has made these techniques broadly accessible.