Part 1 Chapter 1 Last verified 2026-06-18

Post-Training Overview

Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, the data and grading each needs, how reasoning and safety alignment are trained, and how reward hacking arises and is mitigated.

On this page
  1. Why post-training matters
  2. The three-stage training pipeline
  3. SFT vs RL: the core tradeoff
  4. When to use which
  5. Data and grading: the two ingredients
  6. Reward hacking
  7. Reasoning: SFT and RL working together
  8. Safety alignment: Constitutional AI / RLAIF
  9. Hands-on: safety evaluation with Llama Guard
  10. Post-training in the wild
  11. Retrieval check
  12. Where practitioners disagree
  13. Summary

Why post-training matters

A pre-trained model predicts the next token well but has no concept of instructions, safety boundaries, or structured reasoning. Post-training closes that gap. Supervised fine-tuning (SFT) is the mature, efficient technique — LoRA makes it feasible on a single GPU. Reinforcement learning (RL) unlocks capabilities beyond what any human demonstration can teach. The meta-skill tying it together is error analysis: train, evaluate, diagnose, fix the data, repeat.

The arc shows accelerating sophistication: basic task fine-tuning (2018–2020) → InstructGPT / RLHF (2022), the breakthrough that RL alignment sharply improves helpfulness and safety → DPO (2023) simplifying the RL pipeline → reasoning models like DeepSeek R1 (Jan 2025) that push RL past imitation → frontier labs now running multi-stage SFT→RL→SFT→RL pipelines.

The three-stage training pipeline

SFT vs RL: the core tradeoff

The tradeoff in one table — worth being able to reproduce from memory:

| Axis | SFT | RL | |---|---|---| | Signal | (input, output) demonstrations | scalar reward on own outputs | | Stability | High — gradient on gold labels | Lower — policy exploration | | Ceiling | Bounded by demonstrations | Can exceed human demos | | Data | Curated pairs | Reward signal / verifier | | Compute | Moderate (fwd + bwd) | High (generate + score + update) | | Maturity | Very mature; LoRA well-understood | Fast-evolving (PPO → GRPO) |

When to use which

Decision tree

SFT vs RL

  • Do you have demonstrations that already capture ceiling performance? → Use SFT; RL adds cost with little upside.
  • Have demonstrations but want to exceed their quality?SFT first for a strong baseline, then RL.
  • No demonstrations but a reliable reward signal?RL directly (rare; see DeepSeek R1 Zero).
  • Is the task verifiable (code, math, lookup)? → RL with deterministic verifiers — strongest signal, least reward hacking.
  • Quality is subjective (creative, open-ended)? → SFT + optional RL with LLM-as-judge, but watch for reward hacking.

Data and grading: the two ingredients

Every method needs data to train on and a grading mechanism to score outputs. SFT and RL differ sharply in both.

SFT data is (input, target) pairs as multi-turn chat: chat-history pairs, chain-of-thought derivations, RAG-recovery patterns, and guardrail/refusal demonstrations.

RL grading scores model-generated outputs, in order of reliability: deterministic verifiers (code tests, math checkers — no ambiguity) → an LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).

Reward hacking

The KL penalty is the regularizer that keeps the policy πθ\pi_\theta close to the reference πref\pi_{\text{ref}} (the SFT model): the objective adds a term βKL ⁣(πθπref)-\beta\, \mathrm{KL}\!\left(\pi_\theta \,\|\, \pi_{\text{ref}}\right), so straying far enough to chase a hacked reward is itself penalized.

Reasoning: SFT and RL working together

SFT teaches a process — given a problem, produce a step-by-step CoT then the answer. It learns the format and common strategies from demonstrations, but cannot discover approaches absent from the data. RL with a verifier (e.g. checking the final answer) lets the model explore the space of reasoning paths and find shortcuts or novel decompositions no human demonstrated.

Key concept

DeepSeek R1 Zero

FRI-1.4

RL-only training (no SFT) on a base model produced emergent chain-of-thought reasoning — the model learned to “think out loud” purely from reward signal, never shown a CoT demonstration. This proved RL alone can induce structured reasoning.

When it breaks: RL-only is unstable (language mixing, repetition loops). DeepSeek added SFT stages before and after RL; pure RL-only remains a research result, not a production recipe. [V] Verified

Safety alignment: Constitutional AI / RLAIF

Key concept

Constitutional AI / RLAIF

FRI-1.4

Anthropic’s Constitutional AI replaces human preference labels with AI-generated preferences, scaling alignment without proportional annotation cost:

  1. Write a constitution — principles for desired behavior (“be helpful, harmless, honest”).
  2. Self-critique → SFT data — the model critiques and revises its own responses against the constitution; revisions become SFT data.
  3. RLAIF — the model generates preference pairs; a reward model trains on those AI preferences, then drives RL.

Result (per the paper): equally helpful, significantly more harmless than RLHF baselines, with far fewer human labels.

When it breaks: AI preferences inherit the judge’s blind spots — a vague constitution or weak critic amplifies bias instead of correcting it. Human spot-checks remain essential. [I] Inference

Hands-on: safety evaluation with Llama Guard

Llama Guard is a classifier that labels a prompt–response pair safe or unsafe\nS{code} across a set of harm categories. A minimal parser:

import re

def parse_llama_guard_response(response: str) -> dict:
    """'unsafe\\nS2' -> {'safe': False, 'categories': ['S2']}"""
    lines = response.strip().split("\n")
    is_safe = lines[0].strip().lower() == "safe"
    categories = []
    if not is_safe:
        for line in lines[1:]:
            m = re.match(r"(S\d{1,2})", line.strip())
            if m:
                categories.append(m.group(1))
    return {"safe": is_safe, "categories": categories}

Post-training in the wild

Three flagship open pipelines, three design choices:

  • DeepSeek R1 — base → cold-start CoT SFT → GRPO RL on math/code verifiers → rejection-sampling SFT (best-of-N) → final alignment RL. Notable for R1 Zero (RL-only reasoning) and open-sourcing the recipe; distilled 671B → 1.5B–70B.
  • LLaMA 3.1/3.3 (Meta) — 15T+ token pre-train → mid-training (long-context extension to 128K + multilingual data) → SFT → RLHF (PPO) → iterative eval + data refresh. 3.3 (70B) neared 405B quality via aggressive distillation.
  • Qwen 2.5 (Alibaba) — multilingual pre-train → SFT (code/math/chat) → DPO → domain RL. Shows DPO as a lightweight PPO alternative, consistent across 0.5B–72B.

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Name the three training stages and what each optimizes for. FRI-1.1

Pre-training (next-token on internet text → broad knowledge) → mid-training (curated data → fluency, modalities, longer context) → post-training (SFT + RL → instruction-following, safety, reasoning).

Reproduce the SFT-vs-RL table — at least four axes. FRI-1.2

Signal (demonstrations vs scalar reward), Stability (high vs lower), Ceiling (bounded by demos vs can exceed), Data (curated pairs vs reward/verifier), Compute (moderate vs high). RL trades stability for a higher ceiling.

List the three RL grading mechanisms in reliability order, with an example each. FRI-1.3

Deterministic verifiers (code tests, math checkers — no ambiguity) → LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).

State reward hacking in one sentence; give two mitigations. FRI-1.6

The policy maximizes the reward as specified rather than as intended, exploiting shortcuts. Mitigations (any two): a KL penalty to the SFT baseline, deterministic verifiers, reward ensembles, diversity monitoring.

What is the "frontier recipe" for reasoning, in four stages? FRI-1.4

From the base model: (1) cold-start CoT SFT → (2) RL with verifiers (GRPO) → (3) rejection-sampling SFT (best-of-N) → (4) final alignment RL. The frontier alternates SFT and RL.

Where practitioners disagree

The frontier recipe is not settled. RL-only vs SFT→RL: R1 Zero showed RL-only can induce reasoning, yet every production system reintroduces SFT for stability — so “is SFT necessary?” is contested in principle but near-universal in practice. LLM-as-judge reliability: judges are scalable but gameable and biased toward verbose, confident answers; teams disagree on when a judge is trustworthy enough to drive RL versus only offline eval. When you cite a recipe in an interview, name the tradeoff you’re accepting, not just the steps.

Summary

Post-training turns raw intelligence into an aligned, usable system. The three-stage pipeline separates concerns: broad knowledge, refined fluency, targeted behavior. SFT is stable imitation bounded by demonstration quality; RL is exploration bounded by reward quality. The frontier alternates them, each covering the other’s weakness. Safety scales via Constitutional AI / RLAIF, and the open ecosystem (DeepSeek, LLaMA, Qwen) has made these techniques broadly accessible.