Post-Training Overview
Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, the data and grading each needs, how reasoning and safety alignment are trained, and how reward hacking arises and is mitigated.
On this page
- Why post-training matters
- The three-stage training pipeline
- SFT vs RL: the core tradeoff
- When to use which
- Data and grading: the two ingredients
- Reward hacking
- Reasoning: SFT and RL working together
- Safety alignment: Constitutional AI / RLAIF
- Hands-on: safety evaluation with Llama Guard
- Post-training in the wild
- Retrieval check
- Where practitioners disagree
- Summary
Why post-training matters
A pre-trained model predicts the next token well but has no concept of instructions, safety boundaries, or structured reasoning. Post-training closes that gap. Supervised fine-tuning (SFT) is the mature, efficient technique — LoRA makes it feasible on a single GPU. Reinforcement learning (RL) unlocks capabilities beyond what any human demonstration can teach. The meta-skill tying it together is error analysis: train, evaluate, diagnose, fix the data, repeat.
The arc shows accelerating sophistication: basic task fine-tuning (2018–2020) → InstructGPT / RLHF (2022), the breakthrough that RL alignment sharply improves helpfulness and safety → DPO (2023) simplifying the RL pipeline → reasoning models like DeepSeek R1 (Jan 2025) that push RL past imitation → frontier labs now running multi-stage SFT→RL→SFT→RL pipelines.
The three-stage training pipeline
SFT vs RL: the core tradeoff
The tradeoff in one table — worth being able to reproduce from memory:
| Axis | SFT | RL | |---|---|---| | Signal | (input, output) demonstrations | scalar reward on own outputs | | Stability | High — gradient on gold labels | Lower — policy exploration | | Ceiling | Bounded by demonstrations | Can exceed human demos | | Data | Curated pairs | Reward signal / verifier | | Compute | Moderate (fwd + bwd) | High (generate + score + update) | | Maturity | Very mature; LoRA well-understood | Fast-evolving (PPO → GRPO) |
When to use which
SFT vs RL
- Do you have demonstrations that already capture ceiling performance? → Use SFT; RL adds cost with little upside.
- Have demonstrations but want to exceed their quality? → SFT first for a strong baseline, then RL.
- No demonstrations but a reliable reward signal? → RL directly (rare; see DeepSeek R1 Zero).
- Is the task verifiable (code, math, lookup)? → RL with deterministic verifiers — strongest signal, least reward hacking.
- Quality is subjective (creative, open-ended)? → SFT + optional RL with LLM-as-judge, but watch for reward hacking.
Data and grading: the two ingredients
Every method needs data to train on and a grading mechanism to score outputs. SFT and RL differ sharply in both.
SFT data is (input, target) pairs as multi-turn chat: chat-history pairs, chain-of-thought derivations, RAG-recovery patterns, and guardrail/refusal demonstrations.
RL grading scores model-generated outputs, in order of reliability: deterministic verifiers (code tests, math checkers — no ambiguity) → an LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).
Reward hacking
The KL penalty is the regularizer that keeps the policy close to the reference (the SFT model): the objective adds a term , so straying far enough to chase a hacked reward is itself penalized.
Reasoning: SFT and RL working together
SFT teaches a process — given a problem, produce a step-by-step CoT then the answer. It learns the format and common strategies from demonstrations, but cannot discover approaches absent from the data. RL with a verifier (e.g. checking the final answer) lets the model explore the space of reasoning paths and find shortcuts or novel decompositions no human demonstrated.
DeepSeek R1 Zero
FRI-1.4RL-only training (no SFT) on a base model produced emergent chain-of-thought reasoning — the model learned to “think out loud” purely from reward signal, never shown a CoT demonstration. This proved RL alone can induce structured reasoning.
When it breaks: RL-only is unstable (language mixing, repetition loops). DeepSeek added SFT stages before and after RL; pure RL-only remains a research result, not a production recipe. [V] Verified
Safety alignment: Constitutional AI / RLAIF
Constitutional AI / RLAIF
FRI-1.4Anthropic’s Constitutional AI replaces human preference labels with AI-generated preferences, scaling alignment without proportional annotation cost:
- Write a constitution — principles for desired behavior (“be helpful, harmless, honest”).
- Self-critique → SFT data — the model critiques and revises its own responses against the constitution; revisions become SFT data.
- RLAIF — the model generates preference pairs; a reward model trains on those AI preferences, then drives RL.
Result (per the paper): equally helpful, significantly more harmless than RLHF baselines, with far fewer human labels.
When it breaks: AI preferences inherit the judge’s blind spots — a vague constitution or weak critic amplifies bias instead of correcting it. Human spot-checks remain essential. [I] Inference
Hands-on: safety evaluation with Llama Guard
Llama Guard is a classifier that labels a
prompt–response pair safe or unsafe\nS{code} across a set of harm categories.
A minimal parser:
import re
def parse_llama_guard_response(response: str) -> dict:
"""'unsafe\\nS2' -> {'safe': False, 'categories': ['S2']}"""
lines = response.strip().split("\n")
is_safe = lines[0].strip().lower() == "safe"
categories = []
if not is_safe:
for line in lines[1:]:
m = re.match(r"(S\d{1,2})", line.strip())
if m:
categories.append(m.group(1))
return {"safe": is_safe, "categories": categories}
Post-training in the wild
Three flagship open pipelines, three design choices:
- DeepSeek R1 — base → cold-start CoT SFT → GRPO RL on math/code verifiers → rejection-sampling SFT (best-of-N) → final alignment RL. Notable for R1 Zero (RL-only reasoning) and open-sourcing the recipe; distilled 671B → 1.5B–70B.
- LLaMA 3.1/3.3 (Meta) — 15T+ token pre-train → mid-training (long-context extension to 128K + multilingual data) → SFT → RLHF (PPO) → iterative eval + data refresh. 3.3 (70B) neared 405B quality via aggressive distillation.
- Qwen 2.5 (Alibaba) — multilingual pre-train → SFT (code/math/chat) → DPO → domain RL. Shows DPO as a lightweight PPO alternative, consistent across 0.5B–72B.
Retrieval check
Answer from memory, then expand to check — or go deeper in the practice questions.
Name the three training stages and what each optimizes for. FRI-1.1
Pre-training (next-token on internet text → broad knowledge) → mid-training (curated data → fluency, modalities, longer context) → post-training (SFT + RL → instruction-following, safety, reasoning).
Reproduce the SFT-vs-RL table — at least four axes. FRI-1.2
Signal (demonstrations vs scalar reward), Stability (high vs lower), Ceiling (bounded by demos vs can exceed), Data (curated pairs vs reward/verifier), Compute (moderate vs high). RL trades stability for a higher ceiling.
List the three RL grading mechanisms in reliability order, with an example each. FRI-1.3
Deterministic verifiers (code tests, math checkers — no ambiguity) → LLM-as-judge (scalable but driftable/gameable) → full environment feedback (task-completion reward from acting in a tool/web environment).
State reward hacking in one sentence; give two mitigations. FRI-1.6
The policy maximizes the reward as specified rather than as intended, exploiting shortcuts. Mitigations (any two): a KL penalty to the SFT baseline, deterministic verifiers, reward ensembles, diversity monitoring.
What is the "frontier recipe" for reasoning, in four stages? FRI-1.4
From the base model: (1) cold-start CoT SFT → (2) RL with verifiers (GRPO) → (3) rejection-sampling SFT (best-of-N) → (4) final alignment RL. The frontier alternates SFT and RL.
Where practitioners disagree
The frontier recipe is not settled. RL-only vs SFT→RL: R1 Zero showed RL-only can induce reasoning, yet every production system reintroduces SFT for stability — so “is SFT necessary?” is contested in principle but near-universal in practice. LLM-as-judge reliability: judges are scalable but gameable and biased toward verbose, confident answers; teams disagree on when a judge is trustworthy enough to drive RL versus only offline eval. When you cite a recipe in an interview, name the tradeoff you’re accepting, not just the steps.
Summary
Post-training turns raw intelligence into an aligned, usable system. The three-stage pipeline separates concerns: broad knowledge, refined fluency, targeted behavior. SFT is stable imitation bounded by demonstration quality; RL is exploration bounded by reward quality. The frontier alternates them, each covering the other’s weakness. Safety scales via Constitutional AI / RLAIF, and the open ecosystem (DeepSeek, LLaMA, Qwen) has made these techniques broadly accessible.