Part 1 Chapter 2 Last verified 2026-06-18

Core Techniques I: Supervised Fine-Tuning & LoRA

The mechanics of supervised fine-tuning: data preparation and splitting, tokenization, the cross-entropy objective and its gradient, loss masking and teacher forcing, the hyperparameters that decide success, and parameter-efficient fine-tuning with LoRA and QLoRA.

On this page

Data preparation: quality over quantity
Tokenization: what the model actually sees
The SFT objective: cross-entropy and its gradient
Hyperparameters and diagnosis
Parameter-efficient fine-tuning: LoRA
LoRA vs QLoRA vs full fine-tuning
Retrieval check
Summary

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Parameter-efficient fine-tuning: LoRA; if any is shaky, read closely — each is developed below.

Predict: if you trained SFT without masking the prompt (loss on every token), what behavior would degrade?
Predict: as the model’s probability on the correct token climbs $0.5 \to 0.99$ , does the gradient magnitude grow or shrink?
Your training loss suddenly jumps to NaN. What’s the first thing you change?
LoRA freezes $W$ and trains $A$ and $B$ . Roughly what fraction of parameters does that make trainable on a 7B model, and why does it still work?

Check your answers

It would spend a large share of updates learning to reproduce prompts (≈80% of an instruction example is prompt tokens), diluting the signal for generating responses — instruction-following degrades. Masking focuses every update on the completion.
Shrink. The gradient on the correct logit is $p - 1$ , so $|p-1|$ falls toward 0 as $p \to 1$ — the model updates least where it is already confident and correct.
Cut the learning rate ~10×. NaN loss is almost always an LR that’s too high.
~0.1% trainable. Fine-tuning updates are empirically low-rank, so a thin $BA$ decomposition captures most of the useful adaptation.

Data preparation: quality over quantity

The single biggest lever in post-training is data quality, not quantity. Frontier labs routinely discard ~99% of candidate data, keeping only the cleanest subset that measurably improves the model. SFT data is (input, target-output) pairs — for reasoning tasks the target includes a chain-of-thought followed by the final answer.

Splitting decides whether your eval means anything:

Avoid random splits — near-duplicate examples leak across train/test and inflate scores.
Prefer time-based splits (train on 2024, test on 2025) to measure real generalization.
Always hold out an out-of-distribution set of long-tail inputs.

Deduplication catches leakage from similar, not just identical, examples. The standard pipeline is exact-hash → MinHash LSH (approximate near-duplicate detection) → manual review. This is where most of that 99% gets discarded.

Tokenization: what the model actually sees

Models never see text — they see tokens, subword units produced by a tokenizer, then mapped to vectors. Byte-Pair Encoding iteratively merges the most frequent byte pairs; a typical vocabulary is ~50k tokens.

Two operational gotchas: sequences in a batch are padded to uniform length for GPU efficiency, and tokenizers are not interchangeable across model families — a tokenizer mismatch is a silent killer (the model trains on garbage indices with no error).

The SFT objective: cross-entropy and its gradient

For a target token $y_t$ at position $t$ with model probability $p(y_t \mid y_{<t}, x)$ :

\mathcal{L}_{\text{CE}} = -\sum_{t=1}^{T} \log\, p(y_t \mid y_{<t}, x)

Worked example: cross-entropy values and the gradient Worked example

Loss. The per-token loss is just $-\log p$ of the correct token:

-\log(0.9) = 0.11, \quad -\log(0.5) = 0.69, \quad -\log(0.1) = 2.30, \quad -\log(0.01) = 4.61

Low-confidence correct predictions are penalized exponentially harder. Note the diminishing returns: going $0.1 \!\to\! 0.5$ saves 1.61 nats (natural-log units), but $0.5 \!\to\! 0.9$ saves only 0.59 — so most training effort lands on the hard “long tail” of tokens, which is exactly where improvement matters.

Gradient. With respect to the output logits $z_i$ , cross-entropy has an elegant form:

\frac{\partial \mathcal{L}}{\partial z_i} = p_i - \mathbb{1}[i = y]

It is literally predicted − target. If the model already assigns $p=0.95$ to the correct token, the gradient magnitude is just $0.05$ — a gentle nudge. If it assigns $0.1$ , the gradient is $0.9$ — a hard correction. The update size self-scales with how wrong the model is.

Completion problem. The model places probability $p = 0.8$ on the correct token. Fill the two blanks: the cross-entropy loss is $-\ln(0.8) =$ ___ nats, and the gradient on the correct logit is $p - 1 =$ ___.

Now you. A different token gets only $p = 0.5$ . Without computing logs, is its gradient magnitude larger or smaller than the $p = 0.8$ case — and what does that imply about which tokens dominate the update?

Key concept

Loss masking: train only on completions

FRI-2.2

SFT feeds the model the full sequence (prompt + completion) but computes loss only on the completion tokens — the prompt is masked. In TRL this is completion_only_loss=True.

Why it matters: for instruction data the prompt is often ~80% of the sequence. Unmasked, 80% of gradient updates go to reproducing the prompt the model already received — a catastrophic misallocation. Masking spends all capacity on the response. [V] Verified

When it breaks: a few-shot in-context task may want the model to learn prompt patterns — there, a partial mask (mask the system prompt, keep the few-shot examples) is the middle ground.

Hyperparameters and diagnosis

The learning rate is the single most impactful knob; the standard AdamW starting point is $5\times10^{-5}$ . A linear warmup (ramp from 0 over the first 5–10% of steps) avoids destabilizing the randomly-initialized parameters, then cosine annealing decays toward zero.

A loss curve is a diagnostic instrument — most failures are readable at a glance:

| Symptom | Likely cause | Fix | |---|---|---| | Loss → NaN | LR too high | Reduce LR 10× | | Flat from step 0 | LR too low / data pipeline bug | Raise LR; verify the data | | Train ↓, val ↑ | Overfitting | Fewer epochs, more data, regularize | | Both plateau high | Underfitting | More epochs, higher LR, larger model | | Mid-run spikes | Corrupted batch / bad examples | Inspect the data | | Oscillating | LR too high / batch too small | Lower LR or raise batch |

Two reproducibility notes: seed all RNGs (random, numpy, torch + CUDA), and even then expect minor run-to-run drift from GPU floating-point atomics — report results averaged over 3–5 seeds.

Parameter-efficient fine-tuning: LoRA

LoRA adds a trainable low-rank branch (A, B) alongside the frozen weight W. Only A and B receive gradients; their product is scaled by α/r and added to the frozen output.

The forward pass becomes $h = Wx + \tfrac{\alpha}{r}BAx$ , where rank $r$ sets the capacity of the update and alpha $\alpha$ scales it — the ratio $\alpha/r$ is what actually controls adaptation magnitude.

Worked example: how few parameters is LoRA, really? Worked example

Take one attention projection, $d = 4096$ , adapted at rank $r = 8$ .

Full fine-tune of that matrix: $d^2 = 4096^2 \approx 16.8\text{M}$ parameters.
LoRA: $B$ is $d\times r$ and $A$ is $r\times d$ , so $2dr = 2 \cdot 4096 \cdot 8 = 65{,}536$ — about 0.39% of the full matrix.

Across a whole 1.2B model adapting Q and V projections, print_trainable_parameters() reports ~1.57M trainable of 1.24B — 0.13%. Two practical consequences fall out of the math: (1) you train with a ~10× higher learning rate than full FT, because $A$ starts as small random values and $B$ starts at exactly zero (so $\Delta W = BA = 0$ at step 0 — no perturbation) and the small matrices need larger steps to move; (2) after training the adapter can be merged back into $W$ for zero inference overhead.

LoRA vs QLoRA vs full fine-tuning

QLoRA goes further: it stores the frozen base in 4-bit NF4 quantization while keeping the LoRA adapters in full precision — ~4× less memory for a slight quality cost. The memory story on a 7B model:

| Method | Trainable | Memory (7B) | Quality | |---|---|---|---| | Full fine-tuning | 100% | ~60 GB | Best | | LoRA ( $r{=}8$ ) | ~0.1% | ~16 GB | Near-full | | QLoRA ( $r{=}8$ ) | ~0.1% | ~6 GB | Slightly below LoRA |

Decision tree

Which fine-tuning method?

Multiple GPUs, >80 GB aggregate VRAM? → Full fine-tuning for best quality.
Single 16–24 GB GPU? → LoRA, $r{=}8\text{–}16$ , target Q+V (or all-linear if quality lags).
Single consumer GPU under 16 GB? → QLoRA (4-bit, bitsandbytes).
Serving many task-specific models? → LoRA adapters hot-swapped on one base model — a large serving win.

Start by targeting the query and value projections; expand to all-linear only if quality is insufficient. The minimal config:

from peft import LoraConfig, get_peft_model, TaskType

lora_config = LoraConfig(
    r=8,                                  # low-rank dimension
    lora_alpha=16,                        # scaling — alpha/r = 2
    target_modules=["q_proj", "v_proj"],  # start with Q + V
    lora_dropout=0.05,
    task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
# trainable: 1,572,864 / 1,236,635,648 → 0.13%

Retrieval check

Answer from memory, then expand to check — or go deeper in the practice questions.

Give two splitting rules that prevent leakage, and the one rule unique to RL test sets. FRI-2.1

Use time-based splits (train past, test future) and always hold out an out-of-distribution set; avoid random splits (near-duplicates leak across them). The RL-specific rule: the RL test set must be graded by a different reward model than training used.

Write the cross-entropy gradient at one output position, and read off its magnitude when p = 0.9. FRI-2.2

$\partial\mathcal{L}/\partial z_i = p_i - \mathbb{1}[i = y]$ (“predicted − target”). For the correct token at $p = 0.9$ , the magnitude is $|0.9 - 1| = 0.1$ — a gentle nudge; a wrong-and-confident token ( $p=0.1$ ) gets $0.9$ .

Why does loss masking matter, and roughly how much of an instruction example is prompt? FRI-2.2

Masking the prompt focuses every gradient update on generating the completion — the model already receives the prompt as input, so training it to reproduce the prompt wastes capacity. For instruction data the prompt is often ~80% of the sequence, so unmasked training misallocates ~80% of updates.

Map four loss-curve symptoms to their fixes. FRI-2.3

NaN → LR too high, cut ~10×. Flat from step 0 → LR too low / data bug. Train ↓ but val ↑ → overfitting (fewer epochs, more data). Oscillating → LR too high or batch too small.

State the LoRA forward pass; explain why the alpha/r ratio (not alpha alone) is the knob. FRI-2.4

$h = Wx + \frac{\alpha}{r}BAx$ . The ratio $\alpha/r$ scales the update, so raising $r$ without rescaling $\alpha$ silently halves the effective adaptation — the more expressive adapter can end up adapting less.

Summary

Supervised fine-tuning is stable imitation, and its quality is set long before training: by data curation, dedup, and honest splits. The objective is cross-entropy, whose gradient ( $p - \mathbb{1}[i=y]$ ) self-scales with error; loss masking and teacher forcing make it both focused and fast. The learning rate dominates the hyperparameters, and the loss curve diagnoses most failures. LoRA exploits the low-rank structure of fine-tuning updates to cut trainable parameters ~1000× — the practitioner’s default — with QLoRA pushing it onto a single consumer GPU. Chapter 2b turns to RL: reward models, RLHF, and the PPO → GRPO line.