Evaluating AI Agents

Instructor: John Gilhuly & Aman Khan · Partner: Arize AI
Source course: https://www.deeplearning.ai/short-courses/evaluating-ai-agents/
Course existence verified: June 19, 2026

Study notes on DLAI's Evaluating AI Agents (John Gilhuly & Aman Khan, in partnership with Arize AI). Why LLM systems break traditional testing, decomposing agents into router/skills/memory, observability and tracing, component and trajectory evaluation, and LLM-as-judge with production monitoring — with applied practice questions, a glossary, and spaced-recall flashcards.

Chapters

Evaluation Foundations
Why non-determinism breaks traditional testing for LLM systems, the model-vs-system evaluation distinction, the three-layer (model / application / agent) view, decomposing an agent into router / skills / memory as independent evaluation targets, and the evaluation-driven development loop.
Observability & Tracing
Traces and spans as the units of agent observability, the span hierarchy that mirrors agent architecture (Agent ⊃ Chain ⊃ Tool ⊃ LLM), automatic vs manual OpenTelemetry instrumentation, collecting traces in Phoenix, and turning traces into the bedrock for evaluation, cost attribution, and regression detection.
Component Evaluations
The three evaluation techniques (code-based, LLM-as-a-judge, human annotation) and how to pick among them, why LLM judges need discrete rails not numeric scores, router evaluation as two independent dimensions (tool selection × parameter extraction), per-skill evaluation by output type, and Phoenix evaluation with suppress_tracing.
Trajectory & Structured Evals
Why the path matters beyond the final answer, convergence scoring and its blind spots, the Phoenix Experiments framework (dataset → task → experiment → evaluators), comprehensive-not-exhaustive datasets, comparing agent variants on a dashboard, the production feedback flywheel, and turning wasted work into a cost.
LLM-Judge & Monitoring
Meta-evaluation (measuring the judge against code-based ground truth), agreement rate vs false-positive rate as leniency indicators, four levers to improve a judge, golden datasets as CI shipping gates and how they evolve, production monitoring for silent regressions, and the self-improving pipeline with its feedback-poisoning risk.

Compliance

This book complies with DLAI's Community Code of Conduct:

Non-commercial (CC-BY-NC-4.0 on prose, MIT on code)
DLAI cited; instructor John Gilhuly & Aman Khan named
No quizzes / graded assignments / lab solutions / verbatim transcripts

Takedown requests: brandon.m.behring@gmail.com (48-hour response).