Core Techniques I: Supervised Fine-Tuning & LoRA
The mechanics of supervised fine-tuning: data preparation and splitting, tokenization, the cross-entropy objective and its gradient, loss masking and teacher forcing, the hyperparameters that decide success, and parameter-efficient fine-tuning with LoRA and QLoRA.
On this page
Data preparation: quality over quantity
The single biggest lever in post-training is data quality, not quantity. Frontier labs routinely discard ~99% of candidate data, keeping only the cleanest subset that measurably improves the model. SFT data is (input, target-output) pairs — for reasoning tasks the target includes a chain-of-thought followed by the final answer.
Splitting decides whether your eval means anything:
- Avoid random splits — near-duplicate examples leak across train/test and inflate scores.
- Prefer time-based splits (train on 2024, test on 2025) to measure real generalization.
- Always hold out an out-of-distribution set of long-tail inputs.
Deduplication catches leakage from similar, not just identical, examples. The standard pipeline is exact-hash → MinHash LSH (approximate near-duplicate detection) → manual review. This is where most of that 99% gets discarded.
Tokenization: what the model actually sees
Models never see text — they see tokens, subword units produced by a tokenizer, then mapped to vectors. Byte-Pair Encoding iteratively merges the most frequent byte pairs; a typical vocabulary is ~50k tokens.
Two operational gotchas: sequences in a batch are padded to uniform length for GPU efficiency, and tokenizers are not interchangeable across model families — a tokenizer mismatch is a silent killer (the model trains on garbage indices with no error).
The SFT objective: cross-entropy and its gradient
For a target token at position with model probability :
Completion problem. The model places probability on the correct token. Fill the two blanks: the cross-entropy loss is ___ nats, and the gradient on the correct logit is ___.
Now you. A different token gets only . Without computing logs, is its gradient magnitude larger or smaller than the case — and what does that imply about which tokens dominate the update?
Loss masking: train only on completions
FRI-2.2SFT feeds the model the full sequence (prompt + completion) but computes loss only on the
completion tokens — the prompt is masked. In TRL this is completion_only_loss=True.
Why it matters: for instruction data the prompt is often ~80% of the sequence. Unmasked, 80% of gradient updates go to reproducing the prompt the model already received — a catastrophic misallocation. Masking spends all capacity on the response. [V] Verified
When it breaks: a few-shot in-context task may want the model to learn prompt patterns — there, a partial mask (mask the system prompt, keep the few-shot examples) is the middle ground.
Hyperparameters and diagnosis
The learning rate is the single most impactful knob; the standard AdamW starting point is . A linear warmup (ramp from 0 over the first 5–10% of steps) avoids destabilizing the randomly-initialized parameters, then cosine annealing decays toward zero.
A loss curve is a diagnostic instrument — most failures are readable at a glance:
| Symptom | Likely cause | Fix |
|---|---|---|
| Loss → NaN | LR too high | Reduce LR 10× |
| Flat from step 0 | LR too low / data pipeline bug | Raise LR; verify the data |
| Train ↓, val ↑ | Overfitting | Fewer epochs, more data, regularize |
| Both plateau high | Underfitting | More epochs, higher LR, larger model |
| Mid-run spikes | Corrupted batch / bad examples | Inspect the data |
| Oscillating | LR too high / batch too small | Lower LR or raise batch |
Two reproducibility notes: seed all RNGs (random, numpy, torch + CUDA), and even
then expect minor run-to-run drift from GPU floating-point atomics — report results averaged
over 3–5 seeds.
Parameter-efficient fine-tuning: LoRA
The forward pass becomes , where rank sets the capacity of the update and alpha scales it — the ratio is what actually controls adaptation magnitude.
LoRA vs QLoRA vs full fine-tuning
QLoRA goes further: it stores the frozen base in 4-bit NF4 quantization while keeping the LoRA adapters in full precision — ~4× less memory for a slight quality cost. The memory story on a 7B model:
| Method | Trainable | Memory (7B) | Quality | |---|---|---|---| | Full fine-tuning | 100% | ~60 GB | Best | | LoRA () | ~0.1% | ~16 GB | Near-full | | QLoRA () | ~0.1% | ~6 GB | Slightly below LoRA |
Which fine-tuning method?
- Multiple GPUs, >80 GB aggregate VRAM? → Full fine-tuning for best quality.
- Single 16–24 GB GPU? → LoRA, , target Q+V (or all-linear if quality lags).
- Single consumer GPU under 16 GB? → QLoRA (4-bit,
bitsandbytes). - Serving many task-specific models? → LoRA adapters hot-swapped on one base model — a large serving win.
Start by targeting the query and value projections; expand to all-linear only if quality
is insufficient. The minimal config:
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
r=8, # low-rank dimension
lora_alpha=16, # scaling — alpha/r = 2
target_modules=["q_proj", "v_proj"], # start with Q + V
lora_dropout=0.05,
task_type=TaskType.CAUSAL_LM,
)
model = get_peft_model(base_model, lora_config)
# trainable: 1,572,864 / 1,236,635,648 → 0.13%
Retrieval check
Answer from memory, then expand to check — or go deeper in the practice questions.
Give two splitting rules that prevent leakage, and the one rule unique to RL test sets. FRI-2.1
Use time-based splits (train past, test future) and always hold out an out-of-distribution set; avoid random splits (near-duplicates leak across them). The RL-specific rule: the RL test set must be graded by a different reward model than training used.
Write the cross-entropy gradient at one output position, and read off its magnitude when p = 0.9. FRI-2.2
(“predicted − target”). For the correct token at , the magnitude is — a gentle nudge; a wrong-and-confident token () gets .
Why does loss masking matter, and roughly how much of an instruction example is prompt? FRI-2.2
Masking the prompt focuses every gradient update on generating the completion — the model already receives the prompt as input, so training it to reproduce the prompt wastes capacity. For instruction data the prompt is often ~80% of the sequence, so unmasked training misallocates ~80% of updates.
Map four loss-curve symptoms to their fixes. FRI-2.3
NaN → LR too high, cut ~10×. Flat from step 0 → LR too low / data bug. Train ↓ but val ↑ → overfitting (fewer epochs, more data). Oscillating → LR too high or batch too small.
State the LoRA forward pass; explain why the alpha/r ratio (not alpha alone) is the knob. FRI-2.4
. The ratio scales the update, so raising without rescaling silently halves the effective adaptation — the more expressive adapter can end up adapting less.
Summary
Supervised fine-tuning is stable imitation, and its quality is set long before training: by data curation, dedup, and honest splits. The objective is cross-entropy, whose gradient () self-scales with error; loss masking and teacher forcing make it both focused and fast. The learning rate dominates the hyperparameters, and the loss curve diagnoses most failures. LoRA exploits the low-rank structure of fine-tuning updates to cut trainable parameters ~1000× — the practitioner’s default — with QLoRA pushing it onto a single consumer GPU. Chapter 2b turns to RL: reward models, RLHF, and the PPO → GRPO line.