Fine-tuning & RL for LLMs: Intro to Post-training
Study notes on DLAI's Fine-tuning & RL for LLMs: Intro to Post-training (Sharon Zhou, in partnership with AMD). Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, data and grading requirements, reasoning and safety alignment, evaluation, and production pipelines — with applied practice questions, a glossary, and spaced-recall flashcards.
Chapters
- Post-Training Overview
Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, the data and grading each needs, how reasoning and safety alignment are trained, and how reward hacking arises and is mitigated.
- Core Techniques I: Supervised Fine-Tuning & LoRA
The mechanics of supervised fine-tuning: data preparation and splitting, tokenization, the cross-entropy objective and its gradient, loss masking and teacher forcing, the hyperparameters that decide success, and parameter-efficient fine-tuning with LoRA and QLoRA.
- Core Techniques II: RL Algorithms — RLHF, PPO, GRPO
The RL half of post-training mechanics: reward signals and preference learning, DPO, the RLHF objective and its KL penalty, the four models of RLHF, PPO's clipped objective, GRPO's group-relative baseline, and the failure mode of mode collapse.
- Evaluation as the North Star
Evaluation as the active steering mechanism for post-training: the eval-train loop, the metrics that matter (pass@k, calibration, efficiency), RL test environments and KL/alignment-tax monitoring, reward hacking and Goodhart's Law, error analysis via embedding clustering, choosing an eval strategy, and red teaming.
- Data-Driven Post-Training
Data is the engine room of post-training: how much you need at each scale, the iterative counterexample loop, RL rollouts and reward shaping, the DeepSeek R1 pipeline, synthetic data generation and filtering, and Constitutional AI / RLAIF for alignment without armies of human labelers.
- Production Considerations
Shipping post-trained models: the end-to-end DeepSeek R1 pipeline, agent post-training, promotion rules and rollback, the data-feedback flywheel, monitoring and drift, serving infrastructure and GPU cost, and distillation.
Compliance
This book complies with DLAI's Community Code of Conduct:
- Non-commercial (CC-BY-NC-4.0 on prose, MIT on code)
- DLAI cited; instructor Sharon Zhou named
- No quizzes / graded assignments / lab solutions / verbatim transcripts
Takedown requests: brandon.m.behring@gmail.com (48-hour response).