Evaluating AI Agents
Study notes on DLAI's Evaluating AI Agents (John Gilhuly & Aman Khan, in partnership with Arize AI). Why LLM systems break traditional testing, decomposing agents into router/skills/memory, observability and tracing, component and trajectory evaluation, and LLM-as-judge with production monitoring — with applied practice questions, a glossary, and spaced-recall flashcards.
Chapters
- Evaluation Foundations
Why non-determinism breaks traditional testing for LLM systems, the model-vs-system evaluation distinction, the three-layer (model / application / agent) view, decomposing an agent into router / skills / memory as independent evaluation targets, and the evaluation-driven development loop.
- Observability & Tracing
Traces and spans as the units of agent observability, the span hierarchy that mirrors agent architecture (Agent ⊃ Chain ⊃ Tool ⊃ LLM), automatic vs manual OpenTelemetry instrumentation, collecting traces in Phoenix, and turning traces into the bedrock for evaluation, cost attribution, and regression detection.
- Component Evaluations
The three evaluation techniques (code-based, LLM-as-a-judge, human annotation) and how to pick among them, why LLM judges need discrete rails not numeric scores, router evaluation as two independent dimensions (tool selection × parameter extraction), per-skill evaluation by output type, and Phoenix evaluation with suppress_tracing.
- Trajectory & Structured Evals
Why the path matters beyond the final answer, convergence scoring and its blind spots, the Phoenix Experiments framework (dataset → task → experiment → evaluators), comprehensive-not-exhaustive datasets, comparing agent variants on a dashboard, the production feedback flywheel, and turning wasted work into a cost.
- LLM-Judge & Monitoring
Meta-evaluation (measuring the judge against code-based ground truth), agreement rate vs false-positive rate as leniency indicators, four levers to improve a judge, golden datasets as CI shipping gates and how they evolve, production monitoring for silent regressions, and the self-improving pipeline with its feedback-poisoning risk.
Compliance
This book complies with DLAI's Community Code of Conduct:
- Non-commercial (CC-BY-NC-4.0 on prose, MIT on code)
- DLAI cited; instructor John Gilhuly & Aman Khan named
- No quizzes / graded assignments / lab solutions / verbatim transcripts
Takedown requests: brandon.m.behring@gmail.com (48-hour response).