Part 1 Chapter 4 Last verified 2026-06-18

Data-Driven Post-Training

Data is the engine room of post-training: how much you need at each scale, the iterative counterexample loop, RL rollouts and reward shaping, the DeepSeek R1 pipeline, synthetic data generation and filtering, and Constitutional AI / RLAIF for alignment without armies of human labelers.

On this page

How much data do you need?
The iterative data loop
Data for RL: rollouts and reward shaping
The DeepSeek R1 pipeline
Synthetic data pipelines
Filtering: where quality is created
Constitutional AI and RLAIF
Balancing data and rewards
Where practitioners disagree
Retrieval check
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Constitutional AI and RLAIF; if any is shaky, read closely — each is developed below.

Your model already answers correctly but outputs prose where you need JSON. Do you start with about 20, 2,000, or 200,000 SFT examples?
You can add 1,000 random new examples or 50 that each target a known failure. Predict which raises eval scores more, and why.
A synthetic pipeline generates 100,000 candidates. Which single step creates most of the quality — generation or filtering?
In Constitutional AI, what replaces the human preference labeler?

Check your answers

About 20. Format-only changes need a handful of demonstrations; reach for thousands only when you are changing behavior, not just shape.
The 50 targeted ones win — they spend every example where the model is currently wrong, while random data mostly re-teaches what it already knows. Targeting the failure distribution is what lets a few counterexamples outperform bulk data.
Filtering. It is easy to generate synthetic data and dangerously easy to generate bad synthetic data — keeping the best 10–20% is where quality is created.
A written constitution — a set of auditable natural-language principles the model uses to critique and revise its own outputs (then to label its own preferences for RLAIF).

How much data do you need?

The first question every practitioner asks is deceptively simple: how much data? The answer depends entirely on what you are training and how.

Pre-training is data-hungry by design. The Chinchilla scaling rule — roughly 20 tokens per parameter — sets the appetite: a 7B model wants about 140B tokens. Post-training is the opposite: surgically precise. You need far less data, but every example carries more weight.

SFT data follows a practical ladder:

| Scale | What it buys you | |---|---| | ~20 examples | Formatting and structure only (JSON, tone, markdown) | | Hundreds | A noticeable behavioral shift | | Thousands | The sweet spot for most production tasks | | Tens of thousands | Frontier-lab scale across diverse tasks | | Hundreds of thousands | An entirely new domain or language |

InstructGPT (ChatGPT’s predecessor, and the one with published figures) used roughly 13K high-quality SFT demonstrations plus thousands of preference pairs — a remarkably small dataset for a model line that changed an industry. That is the whole lesson of post-training data: quality over quantity.

Different data types also saturate at very different scales:

Post-training data saturation (schematic, log-x). SFT saturates earliest (~10K–100K); RL preference data lags; RL rollouts need 100K–1M+ for comparable saturation. Different data types have fundamentally different appetites — budget accordingly.

When you use LoRA, data requirements are similar, but use the lowest rank that works ( $r=4$ or $r=8$ ) — fewer trainable parameters means less capacity to memorize noise on a small set.

Key concept

Data quality is multiplicative

FRI-4.1

As a rough heuristic (not a measured law), a 10× increase in data quantity can yield under ~1.3× capability gain if quality is poor. A 10× increase in curation — keeping the top 10%, removing the bottom 10% — often yields 2–3× gains. Frontier labs spend as much effort filtering out bad data as generating new data. [V] Verified

When it breaks: over-filtering on one axis (e.g. only long, detailed responses) introduces distribution bias — excellent at one style, narrow everywhere else. Balance thresholds across correctness, fluency, format, and diversity.

Decision tree

How much SFT data do I need?

Only adjusting output format (JSON, markdown, tone)? → 20–50 examples. Evaluate before adding more.
Need domain-specific instruction following? → hundreds to low thousands; grow the set with iterative error analysis.
Building a production assistant for a well-defined task? → 1K–5K curated examples — the sweet spot.
Training a general-purpose frontier model? → 10K–100K+ across diverse tasks; dedicated data teams.
Teaching an entirely new language or domain? → 100K+; ask whether continued pre-training fits better.

Worked example: when do diminishing returns end SFT? Worked example

You fine-tune on 20, 50, 100, 200, 500, and 1,000 curated GSM8K examples and measure accuracy in percent: 18, 31, 42, 51, 58, 61. Fit a power law $\text{acc} = a\cdot n^{b}$ (accuracy in percent) to estimate how many examples reach 65%.

Log-log regression ( $\log \text{acc} = \log a + b\log n$ ) gives roughly $a \approx 8.9$ , $b \approx 0.30$ . Setting $8.9\cdot n^{0.30} = 65$ gives $n \approx 750$ .

But read the data, not just the fit: the measured points at $n=500$ and $n=1{,}000$ (58, 61) already sit below the fitted curve — accuracy is saturating faster than a power law, so 750 is optimistic and 65% may need far more, or never arrive efficiently via SFT at all. That is the real signal. Going 500 → 1,000 (+500 examples) bought only 3 points; the marginal return has collapsed. Once you are here, RL with a verifier is more efficient than collecting more SFT data — which is exactly why the next chapters exist. Knowing when to stop adding SFT data matters as much as knowing how to add it.

The iterative data loop

Fine-tuning data is not collected once and forgotten. The most effective approach is an iterative refinement loop: (1) train, (2) evaluate on held-out sets, (3) identify the worst failure patterns, (4) add targeted counterexamples — then repeat until metrics plateau.

Key concept

Counterexamples beat volume

FRI-4.2

Not all training examples are equally valuable. After an initial round the model has specific failure modes — hallucinated dates, dropped guardrails on certain prompts, verbose answers to simple questions. Rather than adding random data, mine the failure distribution and write examples that directly counter the most common failures. Ten well-chosen counterexamples can outperform a thousand random additions. [V] Verified

When it breaks: over-indexing on counterexamples can cause catastrophic forgetting of previously-correct behavior. Re-run the full eval suite after every targeted addition — fixing one failure mode must not silently regress another.

This is the supply side of error analysis: Chapter 3 decides what to fix (cluster failures, rank by frequency × severity ÷ effort); this chapter supplies the fix (counterexamples that target the top cluster).

For multi-step tasks, mix in chain-of-thought traces — a practical ratio is about 40% reasoning traces, the rest direct-answer. This teaches the model when to reason, not just how.

Data for RL: rollouts and reward shaping

RL data has a fundamentally different structure than SFT pairs. It operates on rollouts: an input prompt, one or more sampled outputs, and a scalar reward for each. A sensible default is 8 outputs per prompt — more gives better signal but costs linearly in compute; fewer risks noisy gradients. Prompt-set scale ladders from 1K–10K (probing) to 10K–20K (default start) to 100K–1M (frontier labs).

For tasks with objectively checkable answers, skip human preference labeling and use a verifier — a deterministic check:

def check_accuracy(solution: str, ground_truth: float) -> float:
    nums = re.findall(r"[-+]?\d+\.?\d*", solution)
    if not nums:
        return 0.0
    return 1.0 if abs(float(nums[-1]) - ground_truth) < 0.01 else 0.0

Verifier rewards are cheaper, more reliable, and free from reward-model overfitting — but they only exist where there is a clear right answer. For everything else you build a composite reward: a weighted blend across dimensions, each scored by whatever mechanism fits — verifiers for correctness, learned models for helpfulness, rule-based checks for safety.

def score(accuracy, completeness, verification=0.0):
    # per-template weights — correctness dominates, others shape style
    return 0.5 * accuracy + 0.3 * completeness + 0.2 * verification

Worked example: design a composite reward for code generation Worked example

The model should produce code that (a) passes unit tests, (b) is documented, and (c) is efficient. A defensible composite:

r = 0.6\,\text{test\_pass} + 0.2\,\text{doc\_score} + 0.1\,\text{efficiency} + 0.1\,\text{format}

Justification. Correctness must dominate (over 50% of the weight): without working code, documentation and efficiency are worthless. Note that even at 0.6 a pure weighted sum still pays 0.4 for broken code — when correctness is non-negotiable, gate it (multiply the shaped terms by a correctness factor) rather than only weighting it; anything less invites the model to optimize style at the expense of substance. doc_score (LLM-judged) is second because undocumented code is unmaintainable. Efficiency and formatting get small weights — real signals, but easy to fix post-hoc and not worth distorting the primary objective. The design rule: one dominant correctness term, secondary terms that shape without overwhelming it.

Finally, offline vs online is a data-flow distinction: offline methods (DPO) train on a fixed preference set; online methods (PPO, GRPO) score freshly generated rollouts in real time. Online is more expensive but avoids the distribution shift between stale data and the current policy. Cache reward scores aggressively — the same prompt–output pair should never be scored twice.

The DeepSeek R1 pipeline

Real production pipelines assemble data from multiple sources of differing quality and purpose. DeepSeek R1 is one of the most transparent case studies.

Synthetic data pipelines

When you cannot collect enough real data — novel task, expensive annotation, missing edge-case diversity — synthetic data fills the gap. A pipeline combines four operations: generate (templates, varied temperature, persona sampling), filter (the critical step), transform (restyle, adjust difficulty), and score (multi-axis quality for weighting or selection).

Template engineering is how you inject diversity without changing the task. A template is a prompt skeleton with variable slots filled programmatically:

TEMPLATE = "You are a {persona} helping a {audience} with {topic}. " \
           "The user asks: {question}. Give a {style} response."

PERSONAS  = ["patient teacher", "senior engineer", "research scientist"]
AUDIENCES = ["beginner", "ML engineer", "executive"]
# 3 personas × 3 audiences × N styles × M questions → a large, diverse pool

The same question answered by a “patient teacher” and a “terse senior engineer” produces structurally different examples, teaching the model stylistic range. Templates are also matched to failure patterns: if the model fails on multi-step math, you template multi-step math problems; if it fails on safety edge cases, you template adversarial personas.

In production this becomes a data flywheel: deploy → collect real queries → identify failure patterns → generate synthetic counterexamples → retrain → redeploy. It accelerates as the model improves, because better models generate higher-quality synthetic data for the next turn.

Filtering: where quality is created

Generation is easy; filtering is where the value is created. A pipeline that generates 100K examples and keeps 10K after rigorous filtering will outperform one that trains on all 100K.

def filter(examples, min_scores, seen):
    kept = []
    for ex in examples:
        # 1. multi-axis rejection: every dimension must clear its threshold
        if any(ex.scores.get(d, 0) < t for d, t in min_scores.items()):
            continue
        # 2. collapse detection: drop near-duplicates of what we already kept
        key = ex.response[:200].lower().strip()
        if key not in seen:
            seen.add(key)
            kept.append(ex)
    return kept   # typical yield: 10–20% of candidates

The four methods worth knowing — and when each fails:

Rejection sampling — generate $N$ , keep only those above a quality threshold. Simple and effective; fails when the threshold is set on one axis and silently biases the distribution.
LLM-as-judge — a stronger model (or a judge prompt) scores candidates against a rubric. Fails on the judge’s own biases — verbosity, confident tone, self-preference.
Majority vote — sample multiple answers, keep those where the majority agrees (more likely correct). Fails when a shared misconception makes the wrong answer the consensus.
Collapse detection — flag and remove near-identical outputs that signal the generator has fallen into a repetitive mode. Fails silently when skipped: your “diverse” set is thousands of memorizable near-duplicates.

Worked example: how much does filtering buy you? Worked example

You generate 10,000 synthetic GSM8K examples. Filtering at threshold 0.8 keeps 2,100 (21% yield). Trained on the filtered set the model hits 54%; trained on all 10,000 unfiltered it hits 47%. Compare accuracy per training example:

Filtered: $54\% / 2{,}100 \approx 0.0257\%$ per example.
Unfiltered: $47\% / 10{,}000 \approx 0.0047\%$ per example.

Filtered data is about 5.5× more efficient per example — and scores 7 points higher in absolute terms. That 7-point gain comes from removing 7,900 low-quality examples, not from adding anything. This is the multiplicative-quality principle made quantitative: the filter, not the generator, is where accuracy is created.

Completion problem. You generate 8,000 synthetic examples and filter to keep 1,600. Fill the blank: the yield is ___%.

Now you. Training on the filtered 1,600 reaches 50% accuracy; training on all 8,000 reaches 44%. Which is more efficient per example, and roughly by how much?

Constitutional AI and RLAIF

Constitutional AI (CAI) is Anthropic’s approach to alignment without extensive human labeling. Instead of humans providing preference labels, the model critiques and revises its own outputs against a written constitution — explicit, auditable principles encoding values like helpfulness, harmlessness, and honesty. The payoff is not just cost: a constitution is transparent. You can read, audit, and version-control the rules that govern behavior, and trace an unexpected output back to a specific principle.

That final step is RLAIF — RL from AI feedback. CAI pioneered it, but the paradigm has since expanded well beyond a fixed constitution:

Self-play preference generation — the model produces both candidate responses and the preference labels, then trains on its own judgments (Meta’s Self-Rewarding Language Models, 2024).
Principle-conditioned generation — condition on specific principles per query (prioritize harmlessness for safety-adjacent prompts, helpfulness for factual ones) instead of one fixed constitution.
Binary-feedback simplifiers — good/bad labels from an AI judge replace pairwise comparisons, cutting labeling overhead — one good/bad label per item instead of constructed preference pairs (the exact saving depends on how pairs were sampled).

The common thread: AI feedback is cheaper and more consistent than human feedback — but requires careful calibration against human preferences on a validation set to prevent systematic bias from compounding.

Balancing data and rewards

The last challenge is mixing every source and signal into one coherent recipe. There is no universal optimal mixing ratio — it tracks the target capability:

| Target capability | Task-specific | General | |---|---|---| | General assistant | 40–60% | 40–60% | | Math / reasoning | 60–80% | 20–40% | | Translation | 70–90% | 10–30% | | Code generation | 50–70% | 30–50% | | Safety-critical | 30–50% | 50–70% |

Reading the table: start at the midpoint of the range for your capability, then let eval-regression move it — if a general metric drops, shift toward general; if the target task lags, shift toward task-specific.

Three failures recur, each with a standing fix:

Capability regression — over-specializing on one task while losing general ability. Fix: keep 20–40% general-purpose data in every mix.
Reward-dimension collapse — one reward axis dominates, producing single-axis outputs. Fix: normalize dimensions to similar scales before weighting.
Stale data mix — reusing the same ratios across runs. Fix: re-tune mixing ratios every iteration against current eval results.

Specialization is a genuine trade-off, not a bug: heavy math data will degrade other capabilities. Track regression on a broad eval suite and decide explicitly which capabilities you are willing to sacrifice. And treat composite-reward weights as hyperparameters — small weight changes cause large behavioral shifts, so tune them systematically, never set-and-forget.

Where practitioners disagree

RLAIF vs RLHF — does AI feedback match human feedback? The case for AI feedback is strong: it is cheap, fast, and consistent (no inter-annotator variance), and it scales harmlessness to a breadth no human-labeling budget can reach — it powers CAI and self-rewarding models. The case against is equally real: a model judging its own outputs has no external ground truth, so it can entrench its own blind spots, and AI preferences can drift systematically from human values in ways a fixed constitution will not catch. RLHF grounds the signal in actual human preference, but is expensive, slow, and inconsistent across annotators. [I] Inference

| Axis | RLHF (human feedback) | RLAIF (AI feedback) | |---|---|---| | Cost / speed | Expensive, slow | Cheap, fast | | Consistency | Varies by annotator | High (same model) | | Grounding | Real human preference | No external ground truth | | Scale | Bounded by labeling budget | Effectively unbounded | | Failure mode | Annotator noise / fatigue | Compounding self-bias |

The defensible read is not “RLAIF wins” but a hybrid: AI feedback for breadth and consistency, anchored to a human-validation set that catches systematic drift, with human judgment reserved for the hardest value calls. The interview-grade answer names the trade-off — scalability and consistency against grounding — rather than declaring a winner.

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Give the SFT data-scale ladder, and explain why post-training saturates earlier than pre-training. FRI-4.1

~20 examples → formatting only; hundreds → behavioral shift; thousands → production sweet spot; tens of thousands → frontier scale; 100K+ → a new domain/language. Post-training saturates earlier because it steers an already-capable base model rather than teaching it from scratch — each example carries far more weight, so a few thousand high-quality demonstrations move the model more than millions of pre-training tokens would at this stage.

State the four steps of the iterative data loop, and why counterexamples beat random additions. FRI-4.2

Train → evaluate on held-out sets → identify the worst failure patterns → add targeted counterexamples; repeat until metrics plateau. Counterexamples win because they spend every example on the model’s actual failure distribution, while random data mostly re-teaches what the model already knows. Caveat: re-run the full eval each round to catch catastrophic forgetting.

Describe RL rollout structure, the default outputs-per-prompt, and what a widening alignment gap signals. FRI-4.3

A rollout is (prompt, sampled output, scalar reward); sample about 8 outputs per prompt by default. A widening alignment gap — reward-model score rising faster than true human preference — signals reward hacking: the policy is exploiting the reward model rather than genuinely improving. Mitigate with composite rewards, reward-model counterexamples, and halting on the gap.

Name the four operations of a synthetic data pipeline and which is most critical. FRI-4.4

Generate, filter, transform, score. Filtering is most critical: generation is easy and bad-data generation is easier still, so keeping the best 10–20% (multi-axis rejection + collapse detection) is where quality is actually created.

Sketch the seven steps of the Constitutional AI pipeline and what the constitution replaces. FRI-4.5

Write constitution → generate responses → constitutional critique → revision → SFT on revisions → generate preference data from the constitution → RL on those AI preferences (RLAIF). The constitution replaces the human preference labeler with explicit, auditable, version-controllable rules — cheaper, more consistent, and traceable.

Compare rejection sampling, LLM-as-judge, majority vote, and collapse detection — when does each fail? FRI-4.6

Rejection sampling (threshold-keep) fails by biasing the distribution when thresholded on one axis. LLM-as-judge fails on the judge’s biases (verbosity, confidence, self-preference). Majority vote fails when a shared misconception makes the wrong answer the consensus. Collapse detection fails silently when skipped — leaving memorizable near-duplicates. Use them in combination, on orthogonal axes.

Summary

Data is the engine room of post-training. Quantity follows a ladder — about 20 examples to change shape, thousands to change behavior, 100K+ to learn a new domain — and post-training saturates far earlier than pre-training because each example steers an already-capable model. The highest-leverage moves are iterative: mine the failure distribution and add targeted counterexamples rather than volume. RL reshapes data into rollouts scored by verifiers or composite rewards, with the alignment gap as the reward-hacking alarm. When real data runs out, synthetic pipelines generate at scale — but filtering, not generation, creates the quality, a claim the efficiency math makes concrete. Constitutional AI and RLAIF push further, replacing much of the human preference labeling with auditable principles, trading grounding for scale — the central live debate of modern alignment. Chapter 5 turns to production: serving, monitoring, and operating these models once they ship.