Data-Driven Post-Training
Data is the engine room of post-training: how much you need at each scale, the iterative counterexample loop, RL rollouts and reward shaping, the DeepSeek R1 pipeline, synthetic data generation and filtering, and Constitutional AI / RLAIF for alignment without armies of human labelers.
On this page
How much data do you need?
The first question every practitioner asks is deceptively simple: how much data? The answer depends entirely on what you are training and how.
Pre-training is data-hungry by design. The Chinchilla scaling rule — roughly 20 tokens per parameter — sets the appetite: a 7B model wants about 140B tokens. Post-training is the opposite: surgically precise. You need far less data, but every example carries more weight.
SFT data follows a practical ladder:
| Scale | What it buys you | |---|---| | ~20 examples | Formatting and structure only (JSON, tone, markdown) | | Hundreds | A noticeable behavioral shift | | Thousands | The sweet spot for most production tasks | | Tens of thousands | Frontier-lab scale across diverse tasks | | Hundreds of thousands | An entirely new domain or language |
InstructGPT (ChatGPT’s predecessor, and the one with published figures) used roughly 13K high-quality SFT demonstrations plus thousands of preference pairs — a remarkably small dataset for a model line that changed an industry. That is the whole lesson of post-training data: quality over quantity.
Different data types also saturate at very different scales:
When you use LoRA, data requirements are similar, but use the lowest rank that works ( or ) — fewer trainable parameters means less capacity to memorize noise on a small set.
Data quality is multiplicative
FRI-4.1As a rough heuristic (not a measured law), a 10× increase in data quantity can yield under ~1.3× capability gain if quality is poor. A 10× increase in curation — keeping the top 10%, removing the bottom 10% — often yields 2–3× gains. Frontier labs spend as much effort filtering out bad data as generating new data. [V] Verified
When it breaks: over-filtering on one axis (e.g. only long, detailed responses) introduces distribution bias — excellent at one style, narrow everywhere else. Balance thresholds across correctness, fluency, format, and diversity.
How much SFT data do I need?
- Only adjusting output format (JSON, markdown, tone)? → 20–50 examples. Evaluate before adding more.
- Need domain-specific instruction following? → hundreds to low thousands; grow the set with iterative error analysis.
- Building a production assistant for a well-defined task? → 1K–5K curated examples — the sweet spot.
- Training a general-purpose frontier model? → 10K–100K+ across diverse tasks; dedicated data teams.
- Teaching an entirely new language or domain? → 100K+; ask whether continued pre-training fits better.
The iterative data loop
Fine-tuning data is not collected once and forgotten. The most effective approach is an iterative refinement loop: (1) train, (2) evaluate on held-out sets, (3) identify the worst failure patterns, (4) add targeted counterexamples — then repeat until metrics plateau.
Counterexamples beat volume
FRI-4.2Not all training examples are equally valuable. After an initial round the model has specific failure modes — hallucinated dates, dropped guardrails on certain prompts, verbose answers to simple questions. Rather than adding random data, mine the failure distribution and write examples that directly counter the most common failures. Ten well-chosen counterexamples can outperform a thousand random additions. [V] Verified
When it breaks: over-indexing on counterexamples can cause catastrophic forgetting of previously-correct behavior. Re-run the full eval suite after every targeted addition — fixing one failure mode must not silently regress another.
This is the supply side of error analysis: Chapter 3 decides what to fix (cluster failures, rank by frequency × severity ÷ effort); this chapter supplies the fix (counterexamples that target the top cluster).
For multi-step tasks, mix in chain-of-thought traces — a practical ratio is about 40% reasoning traces, the rest direct-answer. This teaches the model when to reason, not just how.
Data for RL: rollouts and reward shaping
RL data has a fundamentally different structure than SFT pairs. It operates on rollouts: an input prompt, one or more sampled outputs, and a scalar reward for each. A sensible default is 8 outputs per prompt — more gives better signal but costs linearly in compute; fewer risks noisy gradients. Prompt-set scale ladders from 1K–10K (probing) to 10K–20K (default start) to 100K–1M (frontier labs).
For tasks with objectively checkable answers, skip human preference labeling and use a verifier — a deterministic check:
def check_accuracy(solution: str, ground_truth: float) -> float:
nums = re.findall(r"[-+]?\d+\.?\d*", solution)
if not nums:
return 0.0
return 1.0 if abs(float(nums[-1]) - ground_truth) < 0.01 else 0.0
Verifier rewards are cheaper, more reliable, and free from reward-model overfitting — but they only exist where there is a clear right answer. For everything else you build a composite reward: a weighted blend across dimensions, each scored by whatever mechanism fits — verifiers for correctness, learned models for helpfulness, rule-based checks for safety.
def score(accuracy, completeness, verification=0.0):
# per-template weights — correctness dominates, others shape style
return 0.5 * accuracy + 0.3 * completeness + 0.2 * verification
Finally, offline vs online is a data-flow distinction: offline methods (DPO) train on a fixed preference set; online methods (PPO, GRPO) score freshly generated rollouts in real time. Online is more expensive but avoids the distribution shift between stale data and the current policy. Cache reward scores aggressively — the same prompt–output pair should never be scored twice.
The DeepSeek R1 pipeline
Real production pipelines assemble data from multiple sources of differing quality and purpose. DeepSeek R1 is one of the most transparent case studies.
Synthetic data pipelines
When you cannot collect enough real data — novel task, expensive annotation, missing edge-case diversity — synthetic data fills the gap. A pipeline combines four operations: generate (templates, varied temperature, persona sampling), filter (the critical step), transform (restyle, adjust difficulty), and score (multi-axis quality for weighting or selection).
Template engineering is how you inject diversity without changing the task. A template is a prompt skeleton with variable slots filled programmatically:
TEMPLATE = "You are a {persona} helping a {audience} with {topic}. " \
"The user asks: {question}. Give a {style} response."
PERSONAS = ["patient teacher", "senior engineer", "research scientist"]
AUDIENCES = ["beginner", "ML engineer", "executive"]
# 3 personas × 3 audiences × N styles × M questions → a large, diverse pool
The same question answered by a “patient teacher” and a “terse senior engineer” produces structurally different examples, teaching the model stylistic range. Templates are also matched to failure patterns: if the model fails on multi-step math, you template multi-step math problems; if it fails on safety edge cases, you template adversarial personas.
In production this becomes a data flywheel: deploy → collect real queries → identify failure patterns → generate synthetic counterexamples → retrain → redeploy. It accelerates as the model improves, because better models generate higher-quality synthetic data for the next turn.
Filtering: where quality is created
Generation is easy; filtering is where the value is created. A pipeline that generates 100K examples and keeps 10K after rigorous filtering will outperform one that trains on all 100K.
def filter(examples, min_scores, seen):
kept = []
for ex in examples:
# 1. multi-axis rejection: every dimension must clear its threshold
if any(ex.scores.get(d, 0) < t for d, t in min_scores.items()):
continue
# 2. collapse detection: drop near-duplicates of what we already kept
key = ex.response[:200].lower().strip()
if key not in seen:
seen.add(key)
kept.append(ex)
return kept # typical yield: 10–20% of candidates
The four methods worth knowing — and when each fails:
- Rejection sampling — generate , keep only those above a quality threshold. Simple and effective; fails when the threshold is set on one axis and silently biases the distribution.
- LLM-as-judge — a stronger model (or a judge prompt) scores candidates against a rubric. Fails on the judge’s own biases — verbosity, confident tone, self-preference.
- Majority vote — sample multiple answers, keep those where the majority agrees (more likely correct). Fails when a shared misconception makes the wrong answer the consensus.
- Collapse detection — flag and remove near-identical outputs that signal the generator has fallen into a repetitive mode. Fails silently when skipped: your “diverse” set is thousands of memorizable near-duplicates.
Completion problem. You generate 8,000 synthetic examples and filter to keep 1,600. Fill the blank: the yield is ___%.
Now you. Training on the filtered 1,600 reaches 50% accuracy; training on all 8,000 reaches 44%. Which is more efficient per example, and roughly by how much?
Constitutional AI and RLAIF
Constitutional AI (CAI) is Anthropic’s approach to alignment without extensive human labeling. Instead of humans providing preference labels, the model critiques and revises its own outputs against a written constitution — explicit, auditable principles encoding values like helpfulness, harmlessness, and honesty. The payoff is not just cost: a constitution is transparent. You can read, audit, and version-control the rules that govern behavior, and trace an unexpected output back to a specific principle.
That final step is RLAIF — RL from AI feedback. CAI pioneered it, but the paradigm has since expanded well beyond a fixed constitution:
- Self-play preference generation — the model produces both candidate responses and the preference labels, then trains on its own judgments (Meta’s Self-Rewarding Language Models, 2024).
- Principle-conditioned generation — condition on specific principles per query (prioritize harmlessness for safety-adjacent prompts, helpfulness for factual ones) instead of one fixed constitution.
- Binary-feedback simplifiers — good/bad labels from an AI judge replace pairwise comparisons, cutting labeling overhead — one good/bad label per item instead of constructed preference pairs (the exact saving depends on how pairs were sampled).
The common thread: AI feedback is cheaper and more consistent than human feedback — but requires careful calibration against human preferences on a validation set to prevent systematic bias from compounding.
Balancing data and rewards
The last challenge is mixing every source and signal into one coherent recipe. There is no universal optimal mixing ratio — it tracks the target capability:
| Target capability | Task-specific | General | |---|---|---| | General assistant | 40–60% | 40–60% | | Math / reasoning | 60–80% | 20–40% | | Translation | 70–90% | 10–30% | | Code generation | 50–70% | 30–50% | | Safety-critical | 30–50% | 50–70% |
Reading the table: start at the midpoint of the range for your capability, then let eval-regression move it — if a general metric drops, shift toward general; if the target task lags, shift toward task-specific.
Three failures recur, each with a standing fix:
- Capability regression — over-specializing on one task while losing general ability. Fix: keep 20–40% general-purpose data in every mix.
- Reward-dimension collapse — one reward axis dominates, producing single-axis outputs. Fix: normalize dimensions to similar scales before weighting.
- Stale data mix — reusing the same ratios across runs. Fix: re-tune mixing ratios every iteration against current eval results.
Specialization is a genuine trade-off, not a bug: heavy math data will degrade other capabilities. Track regression on a broad eval suite and decide explicitly which capabilities you are willing to sacrifice. And treat composite-reward weights as hyperparameters — small weight changes cause large behavioral shifts, so tune them systematically, never set-and-forget.
Where practitioners disagree
RLAIF vs RLHF — does AI feedback match human feedback? The case for AI feedback is strong: it is cheap, fast, and consistent (no inter-annotator variance), and it scales harmlessness to a breadth no human-labeling budget can reach — it powers CAI and self-rewarding models. The case against is equally real: a model judging its own outputs has no external ground truth, so it can entrench its own blind spots, and AI preferences can drift systematically from human values in ways a fixed constitution will not catch. RLHF grounds the signal in actual human preference, but is expensive, slow, and inconsistent across annotators. [I] Inference
| Axis | RLHF (human feedback) | RLAIF (AI feedback) | |---|---|---| | Cost / speed | Expensive, slow | Cheap, fast | | Consistency | Varies by annotator | High (same model) | | Grounding | Real human preference | No external ground truth | | Scale | Bounded by labeling budget | Effectively unbounded | | Failure mode | Annotator noise / fatigue | Compounding self-bias |
The defensible read is not “RLAIF wins” but a hybrid: AI feedback for breadth and consistency, anchored to a human-validation set that catches systematic drift, with human judgment reserved for the hardest value calls. The interview-grade answer names the trade-off — scalability and consistency against grounding — rather than declaring a winner.
Retrieval check
Answer from memory, then expand to check — or go deeper in the practice questions.
Give the SFT data-scale ladder, and explain why post-training saturates earlier than pre-training. FRI-4.1
~20 examples → formatting only; hundreds → behavioral shift; thousands → production sweet spot; tens of thousands → frontier scale; 100K+ → a new domain/language. Post-training saturates earlier because it steers an already-capable base model rather than teaching it from scratch — each example carries far more weight, so a few thousand high-quality demonstrations move the model more than millions of pre-training tokens would at this stage.
State the four steps of the iterative data loop, and why counterexamples beat random additions. FRI-4.2
Train → evaluate on held-out sets → identify the worst failure patterns → add targeted counterexamples; repeat until metrics plateau. Counterexamples win because they spend every example on the model’s actual failure distribution, while random data mostly re-teaches what the model already knows. Caveat: re-run the full eval each round to catch catastrophic forgetting.
Describe RL rollout structure, the default outputs-per-prompt, and what a widening alignment gap signals. FRI-4.3
A rollout is (prompt, sampled output, scalar reward); sample about 8 outputs per prompt by default. A widening alignment gap — reward-model score rising faster than true human preference — signals reward hacking: the policy is exploiting the reward model rather than genuinely improving. Mitigate with composite rewards, reward-model counterexamples, and halting on the gap.
Name the four operations of a synthetic data pipeline and which is most critical. FRI-4.4
Generate, filter, transform, score. Filtering is most critical: generation is easy and bad-data generation is easier still, so keeping the best 10–20% (multi-axis rejection + collapse detection) is where quality is actually created.
Sketch the seven steps of the Constitutional AI pipeline and what the constitution replaces. FRI-4.5
Write constitution → generate responses → constitutional critique → revision → SFT on revisions → generate preference data from the constitution → RL on those AI preferences (RLAIF). The constitution replaces the human preference labeler with explicit, auditable, version-controllable rules — cheaper, more consistent, and traceable.
Compare rejection sampling, LLM-as-judge, majority vote, and collapse detection — when does each fail? FRI-4.6
Rejection sampling (threshold-keep) fails by biasing the distribution when thresholded on one axis. LLM-as-judge fails on the judge’s biases (verbosity, confidence, self-preference). Majority vote fails when a shared misconception makes the wrong answer the consensus. Collapse detection fails silently when skipped — leaving memorizable near-duplicates. Use them in combination, on orthogonal axes.
Summary
Data is the engine room of post-training. Quantity follows a ladder — about 20 examples to change shape, thousands to change behavior, 100K+ to learn a new domain — and post-training saturates far earlier than pre-training because each example steers an already-capable model. The highest-leverage moves are iterative: mine the failure distribution and add targeted counterexamples rather than volume. RL reshapes data into rollouts scored by verifiers or composite rewards, with the alignment gap as the reward-hacking alarm. When real data runs out, synthetic pipelines generate at scale — but filtering, not generation, creates the quality, a claim the efficiency math makes concrete. Constitutional AI and RLAIF push further, replacing much of the human preference labeling with auditable principles, trading grounding for scale — the central live debate of modern alignment. Chapter 5 turns to production: serving, monitoring, and operating these models once they ship.