Fine-tuning & RL for LLMs: Intro to Post-training

Instructor: Sharon Zhou · Partner: AMD
Source course: https://www.deeplearning.ai/short-courses/fine-tuning-rl-for-llms-intro-to-post-training/
Course existence verified: June 18, 2026

Study notes on DLAI's Fine-tuning & RL for LLMs: Intro to Post-training (Sharon Zhou, in partnership with AMD). Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, data and grading requirements, reasoning and safety alignment, evaluation, and production pipelines — with applied practice questions, a glossary, and spaced-recall flashcards.

Chapters

Post-Training Overview
Where post-training sits in the LLM pipeline, the SFT-vs-RL tradeoff, the data and grading each needs, how reasoning and safety alignment are trained, and how reward hacking arises and is mitigated.
Core Techniques I: Supervised Fine-Tuning & LoRA
The mechanics of supervised fine-tuning: data preparation and splitting, tokenization, the cross-entropy objective and its gradient, loss masking and teacher forcing, the hyperparameters that decide success, and parameter-efficient fine-tuning with LoRA and QLoRA.
Core Techniques II: RL Algorithms — RLHF, PPO, GRPO
The RL half of post-training mechanics: reward signals and preference learning, DPO, the RLHF objective and its KL penalty, the four models of RLHF, PPO's clipped objective, GRPO's group-relative baseline, and the failure mode of mode collapse.
Evaluation as the North Star
Evaluation as the active steering mechanism for post-training: the eval-train loop, the metrics that matter (pass@k, calibration, efficiency), RL test environments and KL/alignment-tax monitoring, reward hacking and Goodhart's Law, error analysis via embedding clustering, choosing an eval strategy, and red teaming.
Data-Driven Post-Training
Data is the engine room of post-training: how much you need at each scale, the iterative counterexample loop, RL rollouts and reward shaping, the DeepSeek R1 pipeline, synthetic data generation and filtering, and Constitutional AI / RLAIF for alignment without armies of human labelers.
Production Considerations
Shipping post-trained models: the end-to-end DeepSeek R1 pipeline, agent post-training, promotion rules and rollback, the data-feedback flywheel, monitoring and drift, serving infrastructure and GPU cost, and distillation.

Compliance

This book complies with DLAI's Community Code of Conduct:

Non-commercial (CC-BY-NC-4.0 on prose, MIT on code)
DLAI cited; instructor Sharon Zhou named
No quizzes / graded assignments / lab solutions / verbatim transcripts

Takedown requests: brandon.m.behring@gmail.com (48-hour response).