Evaluation Foundations
Why non-determinism breaks traditional testing for LLM systems, the model-vs-system evaluation distinction, the three-layer (model / application / agent) view, decomposing an agent into router / skills / memory as independent evaluation targets, and the evaluation-driven development loop.
On this page
The evaluation problem
Building an LLM application is easy; knowing whether it
works well is the hard part. Traditional software testing rests on a guarantee
LLM systems do not honour: the same input produces the same output. Break that
guarantee and the whole apparatus of deterministic assertions — assertEqual,
golden files, a green test run as proof of correctness — stops meaning what you
think it means.
Picture a research agent: it searches the web, picks relevant sources, collects content, summarises, and iterates. Every step is an LLM call, and every call can produce different output on every run. Improve the summarisation prompt and you might quietly degrade the search step, which now receives slightly different queries. Without systematic evaluation, that regression hides until a user hits it in production.
Model evaluation vs system evaluation
The first distinction to internalise — and a frequent interview opener — is between two layers that the word “evaluation” conflates.
LLM model evaluation measures an isolated model’s capability with standardised benchmarks — MMLU for knowledge, HumanEval for code, GSM8K for math. Model providers run these. They answer: how capable is this model in general?
LLM system evaluation measures how well your entire application performs on real-world, domain-specific data. It answers the only question that pays the bills: does my system solve the user’s actual problem?
| Dimension | Model eval | System eval | | --- | --- | --- | | Data source | Standardised benchmarks | Your domain data | | What is tested | Isolated model capability | End-to-end application | | Run frequency | Once, before model selection | On every prompt/code change |
Model eval tells you what a model *can* do; system eval tells you what your application *actually does*
EAA-1.2A top-ranked model on MMLU can still produce a failing application if the prompts, retrieval, or tools around it are poor. Teams that evaluate only the model ship applications that “should work” but don’t. The catch: system evaluation needs representative test data, which is expensive to curate early — so start with a small synthetic test set and grow it as real traffic arrives. [V] Verified
State the difference between model evaluation and system evaluation, with one concrete example of each and when each is run. EAA-1.2
Model eval = isolated-model capability on standardised benchmarks (e.g. MMLU on GPT-4o), run once before selecting a model. System eval = end-to-end performance of your application on domain data (e.g. resolution rate of your support agent on 100 real tickets), run on every prompt or code change. Model eval filters candidates; system eval decides whether your system works.
Non-determinism and the testing gap
The course’s analogy is worth keeping. Testing traditional software is a train on a track: the same input follows a fixed path to a deterministic output. Testing an LLM system is driving through a busy city: the same start can take different routes depending on conditions. A binary pass/fail assertion cannot describe the second world.
Non-determinism has a concrete consequence: a prompt change that improves average quality can introduce occasional failures a single run never reveals. You sampled one path; the failure lives on a path you didn’t draw. The remedy is statistical evaluation over a representative dataset, run on every change — not one demo that happened to go well.
Why is one passing test run insufficient evidence that an LLM system change is safe? EAA-1.3
Because identical inputs can yield different outputs across runs, a single run samples one of many possible behaviours. A change can raise average quality while adding rare failures the one run didn’t surface. Only statistical evaluation over a representative dataset estimates the real failure rate.
What is an AI agent?
Concretely, an agent composes three capabilities: reasoning (the LLM), routing (choosing a tool or skill), and action (executing it), then loops on the result. That action-taking-via-reasoning loop is what the ReAct line of work formalised. [V] Verified
Define an AI agent and name its three core capabilities. EAA-1.1
An AI agent is a software system that takes actions on a user’s behalf using reasoning. Its three capabilities are reasoning (LLM decides what to do), routing (selects which tool/skill to call), and action (executes the tool), iterating on the result. The reasoning-driven choice of what next is what separates it from a fixed prompt chain.
The three layers of evaluation
Agents add a third layer of evaluation difficulty on top of model and application. Holding the three-layer view in mind is what lets you place any evaluation activity correctly — naming the layer first is what stops you from applying a model-layer benchmark to an agent-layer failure (the skill EAA-1.7 and EAA-1.8 below exercise).
| Layer | What | Evaluation challenge | | --- | --- | --- | | 1 — Model | The isolated LLM | Fixed benchmarks; handled by the provider | | 2 — Application | Your prompts + retrieval | Domain-specific, end-to-end | | 3 — Agent | Multi-step routing | Dynamic, variable paths, state across steps |
Classify each as Layer 1, 2, or 3: (i) running MMLU on the base model, (ii) NDCG@5 on your RAG retrieval, (iii) measuring whether the agent picks the right tool. EAA-1.7
(i) Layer 1 (model) — a standardised benchmark on the isolated LLM. (ii) Layer 2 (application) — your domain-specific retrieval pipeline, end to end. (iii) Layer 3 (agent) — a routing decision, unique to multi-step agents. Placing each at its layer is what keeps you from testing the wrong thing.
A colleague proposes catching a tool-misrouting bug by benchmarking the base model on MMLU. Why is that the wrong layer, and where does the failure actually live? EAA-1.8
Misrouting is an agent-layer (routing) failure; MMLU measures model-layer knowledge and never exercises the routing step, so it can’t observe the bug. The measurement must live at the same layer as the failure — evaluate routing with a labelled message→expected-skill set, not a knowledge benchmark.
Decomposing the agent
Decomposition is not architectural tidiness — it is an evaluation strategy. Each component has different failure modes and therefore different evaluation criteria, so each becomes an independent target you can measure and fix in isolation.
- The router is the brain. It decides which skill or function to call. Routers span a spectrum from rules-based code (deterministic, limited) to LLM function calling (flexible, non-deterministic). [V] Verified
- Skills are the hands. Each is a capability — query a database, generate analysis, build a chart. A skill may itself contain several sub-steps, and each sub-step is its own evaluation target. An entire LLM application can be a single skill inside a larger agent.
- Memory and state is the notebook. Retrieved context, configuration, and the log of prior steps that later decisions depend on.
The cycle — router → skill → memory update → router — makes every transition an evaluation boundary.
Centralised vs distributed routing
In centralised routing a single router makes every decision: easy to trace, easy to evaluate, but a bottleneck for complex flows. In distributed routing (LangGraph, OpenAI Swarm-style graphs) routing logic is spread across the graph: more flexible, but harder to evaluate because routing decisions happen at many points, each a separate place a wrong turn can originate.
Compare centralised and distributed routing for evaluation coverage. When is the harder-to-evaluate option still the right choice? EAA-1.6
Centralised routing has a single decision point — easy to trace and to build one routing dataset for, but a bottleneck for complex flows. Distributed routing spreads decisions across many nodes, so evaluation must cover every node, not one — more work, but the right call when the workflow is genuinely too complex or parallel for a single router. The coverage cost is the price of the flexibility.
The course’s running example is a data-analysis assistant over a sales database, with three skills — data lookup (LLM-generated SQL run on DuckDB), data analysis (a single LLM call returning markdown), and data visualisation (two sequential calls: chart-config extraction, then code generation). Note that visualisation is itself two evaluation targets, illustrating how a single “skill” decomposes further into sub-steps you score independently.
Decomposition turns one opaque failure into localisable ones
EAA-1.4End-to-end evaluation tells you that the agent failed. Component-wise evaluation tells you where and why. The router is scored on skill-selection accuracy, each skill on output quality, each sub-step on its own correctness criterion. The caveat: component evaluation can miss interaction effects (each piece passes, the composition still fails), so you need both component and end-to-end views.
Name the three components of agent decomposition and why each is a separate evaluation target. EAA-1.4
Router (chooses the skill — scored on selection accuracy), skills (execute capabilities — scored on output quality, sub-step by sub-step), and memory/state (carried context — scored on whether the right information persists). Each has distinct failure modes, so evaluating them separately localises a failure that end-to-end scoring would only report as “the agent was wrong.”
Evaluation-driven development
The difference from “testing” is when it runs. Evaluation-driven development treats eval runs as the primary feedback loop during development — not a gate you clear right before deployment. Teams that evaluate only before shipping catch regressions too late to be cheap. The supporting toolkit is small and familiar: trace instrumentation, an eval runner (often including LLM-as-a-judge), datasets for repeatable experiments, human annotation, and a prompt playground for fast iteration — all introduced in the coming chapters.
List the three steps of the evaluation-driven development loop and what makes it different from testing-before-deploy. EAA-1.5
Trace (capture each step’s inputs/outputs), evaluate (score each component against its criteria), iterate (fix the worst-scoring component) — repeated on every change. Unlike a pre-deploy gate, it runs continuously during development, so regressions surface immediately instead of after they ship.
Summary — what this sets up
The foundation is laid: LLM systems are non-deterministic, so evaluation must be statistical and continuous, not binary and final. The model/application/agent layering tells you where a question lives; the router/skills/memory decomposition tells you what to measure; evaluation-driven development tells you when (always). The rest of the guide builds the machinery:
- Chapter 2 — observability and tracing: how to capture the execution you need to evaluate.
- Chapter 3 — component evaluations: metrics for routers, skills, and tool calls.
- Chapter 4 — trajectory and structured evals: scoring the whole path, not just the endpoints.
- Chapter 5 — LLM-as-a-judge and monitoring: automated grading and keeping it honest in production.