Part 1 Chapter 2 Last verified 2026-06-18

Core Techniques I: Supervised Fine-Tuning & LoRA

The mechanics of supervised fine-tuning: data preparation and splitting, tokenization, the cross-entropy objective and its gradient, loss masking and teacher forcing, the hyperparameters that decide success, and parameter-efficient fine-tuning with LoRA and QLoRA.

On this page
  1. Data preparation: quality over quantity
  2. Tokenization: what the model actually sees
  3. The SFT objective: cross-entropy and its gradient
  4. Hyperparameters and diagnosis
  5. Parameter-efficient fine-tuning: LoRA
  6. LoRA vs QLoRA vs full fine-tuning
  7. Retrieval check
  8. Summary

Data preparation: quality over quantity

The single biggest lever in post-training is data quality, not quantity. Frontier labs routinely discard ~99% of candidate data, keeping only the cleanest subset that measurably improves the model. SFT data is (input, target-output) pairs — for reasoning tasks the target includes a chain-of-thought followed by the final answer.

Splitting decides whether your eval means anything:

  • Avoid random splits — near-duplicate examples leak across train/test and inflate scores.
  • Prefer time-based splits (train on 2024, test on 2025) to measure real generalization.
  • Always hold out an out-of-distribution set of long-tail inputs.

Deduplication catches leakage from similar, not just identical, examples. The standard pipeline is exact-hash → MinHash LSH (approximate near-duplicate detection) → manual review. This is where most of that 99% gets discarded.

Tokenization: what the model actually sees

Models never see text — they see tokens, subword units produced by a tokenizer, then mapped to vectors. Byte-Pair Encoding iteratively merges the most frequent byte pairs; a typical vocabulary is ~50k tokens.

Two operational gotchas: sequences in a batch are padded to uniform length for GPU efficiency, and tokenizers are not interchangeable across model families — a tokenizer mismatch is a silent killer (the model trains on garbage indices with no error).

The SFT objective: cross-entropy and its gradient

For a target token yty_t at position tt with model probability p(yty<t,x)p(y_t \mid y_{<t}, x):

LCE=t=1Tlogp(yty<t,x)\mathcal{L}_{\text{CE}} = -\sum_{t=1}^{T} \log\, p(y_t \mid y_{<t}, x)

Completion problem. The model places probability p=0.8p = 0.8 on the correct token. Fill the two blanks: the cross-entropy loss is ln(0.8)=-\ln(0.8) = ___ nats, and the gradient on the correct logit is p1=p - 1 = ___.

Now you. A different token gets only p=0.5p = 0.5. Without computing logs, is its gradient magnitude larger or smaller than the p=0.8p = 0.8 case — and what does that imply about which tokens dominate the update?

Key concept

Loss masking: train only on completions

FRI-2.2

SFT feeds the model the full sequence (prompt + completion) but computes loss only on the completion tokens — the prompt is masked. In TRL this is completion_only_loss=True.

Why it matters: for instruction data the prompt is often ~80% of the sequence. Unmasked, 80% of gradient updates go to reproducing the prompt the model already received — a catastrophic misallocation. Masking spends all capacity on the response. [V] Verified

When it breaks: a few-shot in-context task may want the model to learn prompt patterns — there, a partial mask (mask the system prompt, keep the few-shot examples) is the middle ground.

Hyperparameters and diagnosis

The learning rate is the single most impactful knob; the standard AdamW starting point is 5×1055\times10^{-5}. A linear warmup (ramp from 0 over the first 5–10% of steps) avoids destabilizing the randomly-initialized parameters, then cosine annealing decays toward zero.

A loss curve is a diagnostic instrument — most failures are readable at a glance:

| Symptom | Likely cause | Fix | |---|---|---| | Loss → NaN | LR too high | Reduce LR 10× | | Flat from step 0 | LR too low / data pipeline bug | Raise LR; verify the data | | Train ↓, val ↑ | Overfitting | Fewer epochs, more data, regularize | | Both plateau high | Underfitting | More epochs, higher LR, larger model | | Mid-run spikes | Corrupted batch / bad examples | Inspect the data | | Oscillating | LR too high / batch too small | Lower LR or raise batch |

Two reproducibility notes: seed all RNGs (random, numpy, torch + CUDA), and even then expect minor run-to-run drift from GPU floating-point atomics — report results averaged over 3–5 seeds.

Parameter-efficient fine-tuning: LoRA

LoRA adds a trainable low-rank branch (A, B) alongside the frozen weight W. Only A and B receive gradients; their product is scaled by α/r and added to the frozen output.

The forward pass becomes h=Wx+αrBAxh = Wx + \tfrac{\alpha}{r}BAx, where rank rr sets the capacity of the update and alpha α\alpha scales it — the ratio α/r\alpha/r is what actually controls adaptation magnitude.

LoRA vs QLoRA vs full fine-tuning

QLoRA goes further: it stores the frozen base in 4-bit NF4 quantization while keeping the LoRA adapters in full precision — ~4× less memory for a slight quality cost. The memory story on a 7B model:

| Method | Trainable | Memory (7B) | Quality | |---|---|---|---| | Full fine-tuning | 100% | ~60 GB | Best | | LoRA (r=8r{=}8) | ~0.1% | ~16 GB | Near-full | | QLoRA (r=8r{=}8) | ~0.1% | ~6 GB | Slightly below LoRA |

Decision tree

Which fine-tuning method?

  • Multiple GPUs, >80 GB aggregate VRAM?Full fine-tuning for best quality.
  • Single 16–24 GB GPU?LoRA, r=816r{=}8\text{–}16, target Q+V (or all-linear if quality lags).
  • Single consumer GPU under 16 GB?QLoRA (4-bit, bitsandbytes).
  • Serving many task-specific models?LoRA adapters hot-swapped on one base model — a large serving win.

Start by targeting the query and value projections; expand to all-linear only if quality is insufficient. The minimal config:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,                                  # low-rank dimension
    lora_alpha=16,                        # scaling — alpha/r = 2
    target_modules=["q_proj", "v_proj"],  # start with Q + V
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
# trainable: 1,572,864 / 1,236,635,648 → 0.13%

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Give two splitting rules that prevent leakage, and the one rule unique to RL test sets. FRI-2.1

Use time-based splits (train past, test future) and always hold out an out-of-distribution set; avoid random splits (near-duplicates leak across them). The RL-specific rule: the RL test set must be graded by a different reward model than training used.

Write the cross-entropy gradient at one output position, and read off its magnitude when p = 0.9. FRI-2.2

L/zi=pi1[i=y]\partial\mathcal{L}/\partial z_i = p_i - \mathbb{1}[i = y] (“predicted − target”). For the correct token at p=0.9p = 0.9, the magnitude is 0.91=0.1|0.9 - 1| = 0.1 — a gentle nudge; a wrong-and-confident token (p=0.1p=0.1) gets 0.90.9.

Why does loss masking matter, and roughly how much of an instruction example is prompt? FRI-2.2

Masking the prompt focuses every gradient update on generating the completion — the model already receives the prompt as input, so training it to reproduce the prompt wastes capacity. For instruction data the prompt is often ~80% of the sequence, so unmasked training misallocates ~80% of updates.

Map four loss-curve symptoms to their fixes. FRI-2.3

NaN → LR too high, cut ~10×. Flat from step 0 → LR too low / data bug. Train ↓ but val ↑ → overfitting (fewer epochs, more data). Oscillating → LR too high or batch too small.

State the LoRA forward pass; explain why the alpha/r ratio (not alpha alone) is the knob. FRI-2.4

h=Wx+αrBAxh = Wx + \frac{\alpha}{r}BAx. The ratio α/r\alpha/r scales the update, so raising rr without rescaling α\alpha silently halves the effective adaptation — the more expressive adapter can end up adapting less.

Summary

Supervised fine-tuning is stable imitation, and its quality is set long before training: by data curation, dedup, and honest splits. The objective is cross-entropy, whose gradient (p1[i=y]p - \mathbb{1}[i=y]) self-scales with error; loss masking and teacher forcing make it both focused and fast. The learning rate dominates the hyperparameters, and the loss curve diagnoses most failures. LoRA exploits the low-rank structure of fine-tuning updates to cut trainable parameters ~1000× — the practitioner’s default — with QLoRA pushing it onto a single consumer GPU. Chapter 2b turns to RL: reward models, RLHF, and the PPO → GRPO line.