Part 1 Chapter 1 Last verified 2026-06-19

Evaluation Foundations

Why non-determinism breaks traditional testing for LLM systems, the model-vs-system evaluation distinction, the three-layer (model / application / agent) view, decomposing an agent into router / skills / memory as independent evaluation targets, and the evaluation-driven development loop.

On this page

The evaluation problem
Model evaluation vs system evaluation
Non-determinism and the testing gap
What is an AI agent?
The three layers of evaluation
Decomposing the agent
Centralised vs distributed routing
Evaluation-driven development
Summary — what this sets up

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Decomposing the agent; if any is shaky, read closely — each is developed below.

Before reading, commit to an answer for each — you will check them against the chapter.

You change one prompt and an unrelated part of your agent regresses. Predict why a single passing test run would miss this.
A model tops the MMLU leaderboard. Estimate how much that tells you about whether it will work in your application. High, some, or almost none?
A four-step agent is 90% reliable at every step. Guess the end-to-end success rate — above 80%, around 65%, or below 50%?
If you could evaluate only one component of a tool-using agent first, predict which one returns the most signal per unit effort.

Check your answers

The single run sampled one of many possible paths; the regression shows up only on the other inputs/samples you didn’t draw. Non-determinism makes one run a sample, not a proof — you need statistical evaluation over a dataset.
Almost none, directly. Benchmarks measure isolated-model capability; your app’s success depends on prompts, retrieval, and tools. Use benchmarks to shortlist models, not to decide.
Around 65% — $0.9^4 \approx 0.66$ . Per-step reliability compounds multiplicatively; “looks fine” components make a failing system.
The router. If it picks the wrong skill, every downstream step is wasted — routing errors cascade and are the cheapest to catch early.

The evaluation problem

Building an LLM application is easy; knowing whether it works well is the hard part. Traditional software testing rests on a guarantee LLM systems do not honour: the same input produces the same output. Break that guarantee and the whole apparatus of deterministic assertions — assertEqual, golden files, a green test run as proof of correctness — stops meaning what you think it means.

Picture a research agent: it searches the web, picks relevant sources, collects content, summarises, and iterates. Every step is an LLM call, and every call can produce different output on every run. Improve the summarisation prompt and you might quietly degrade the search step, which now receives slightly different queries. Without systematic evaluation, that regression hides until a user hits it in production.

Model evaluation vs system evaluation

The first distinction to internalise — and a frequent interview opener — is between two layers that the word “evaluation” conflates.

LLM model evaluation measures an isolated model’s capability with standardised benchmarks — MMLU for knowledge, HumanEval for code, GSM8K for math. Model providers run these. They answer: how capable is this model in general?

LLM system evaluation measures how well your entire application performs on real-world, domain-specific data. It answers the only question that pays the bills: does my system solve the user’s actual problem?

| Dimension | Model eval | System eval | | --- | --- | --- | | Data source | Standardised benchmarks | Your domain data | | What is tested | Isolated model capability | End-to-end application | | Run frequency | Once, before model selection | On every prompt/code change |

Key concept

Model eval tells you what a model can do; system eval tells you what your application actually does

EAA-1.2

A top-ranked model on MMLU can still produce a failing application if the prompts, retrieval, or tools around it are poor. Teams that evaluate only the model ship applications that “should work” but don’t. The catch: system evaluation needs representative test data, which is expensive to curate early — so start with a small synthetic test set and grow it as real traffic arrives. [V] Verified

State the difference between model evaluation and system evaluation, with one concrete example of each and when each is run. EAA-1.2

Model eval = isolated-model capability on standardised benchmarks (e.g. MMLU on GPT-4o), run once before selecting a model. System eval = end-to-end performance of your application on domain data (e.g. resolution rate of your support agent on 100 real tickets), run on every prompt or code change. Model eval filters candidates; system eval decides whether your system works.

Non-determinism and the testing gap

The course’s analogy is worth keeping. Testing traditional software is a train on a track: the same input follows a fixed path to a deterministic output. Testing an LLM system is driving through a busy city: the same start can take different routes depending on conditions. A binary pass/fail assertion cannot describe the second world.

Non-determinism has a concrete consequence: a prompt change that improves average quality can introduce occasional failures a single run never reveals. You sampled one path; the failure lives on a path you didn’t draw. The remedy is statistical evaluation over a representative dataset, run on every change — not one demo that happened to go well.

Why is one passing test run insufficient evidence that an LLM system change is safe? EAA-1.3

Because identical inputs can yield different outputs across runs, a single run samples one of many possible behaviours. A change can raise average quality while adding rare failures the one run didn’t surface. Only statistical evaluation over a representative dataset estimates the real failure rate.

What is an AI agent?

Concretely, an agent composes three capabilities: reasoning (the LLM), routing (choosing a tool or skill), and action (executing it), then loops on the result. That action-taking-via-reasoning loop is what the ReAct line of work formalised. [V] Verified

Define an AI agent and name its three core capabilities. EAA-1.1

An AI agent is a software system that takes actions on a user’s behalf using reasoning. Its three capabilities are reasoning (LLM decides what to do), routing (selects which tool/skill to call), and action (executes the tool), iterating on the result. The reasoning-driven choice of what next is what separates it from a fixed prompt chain.

The three layers of evaluation

Agents add a third layer of evaluation difficulty on top of model and application. Holding the three-layer view in mind is what lets you place any evaluation activity correctly — naming the layer first is what stops you from applying a model-layer benchmark to an agent-layer failure (the skill EAA-1.7 and EAA-1.8 below exercise).

| Layer | What | Evaluation challenge | | --- | --- | --- | | 1 — Model | The isolated LLM | Fixed benchmarks; handled by the provider | | 2 — Application | Your prompts + retrieval | Domain-specific, end-to-end | | 3 — Agent | Multi-step routing | Dynamic, variable paths, state across steps |

Classify each as Layer 1, 2, or 3: (i) running MMLU on the base model, (ii) NDCG@5 on your RAG retrieval, (iii) measuring whether the agent picks the right tool. EAA-1.7

(i) Layer 1 (model) — a standardised benchmark on the isolated LLM. (ii) Layer 2 (application) — your domain-specific retrieval pipeline, end to end. (iii) Layer 3 (agent) — a routing decision, unique to multi-step agents. Placing each at its layer is what keeps you from testing the wrong thing.

A colleague proposes catching a tool-misrouting bug by benchmarking the base model on MMLU. Why is that the wrong layer, and where does the failure actually live? EAA-1.8

Misrouting is an agent-layer (routing) failure; MMLU measures model-layer knowledge and never exercises the routing step, so it can’t observe the bug. The measurement must live at the same layer as the failure — evaluate routing with a labelled message→expected-skill set, not a knowledge benchmark.

Decomposing the agent

Decomposition is not architectural tidiness — it is an evaluation strategy. Each component has different failure modes and therefore different evaluation criteria, so each becomes an independent target you can measure and fix in isolation.

The router is the brain. It decides which skill or function to call. Routers span a spectrum from rules-based code (deterministic, limited) to LLM function calling (flexible, non-deterministic). [V] Verified
Skills are the hands. Each is a capability — query a database, generate analysis, build a chart. A skill may itself contain several sub-steps, and each sub-step is its own evaluation target. An entire LLM application can be a single skill inside a larger agent.
Memory and state is the notebook. Retrieved context, configuration, and the log of prior steps that later decisions depend on.

The cycle — router → skill → memory update → router — makes every transition an evaluation boundary.

Centralised vs distributed routing

In centralised routing a single router makes every decision: easy to trace, easy to evaluate, but a bottleneck for complex flows. In distributed routing (LangGraph, OpenAI Swarm-style graphs) routing logic is spread across the graph: more flexible, but harder to evaluate because routing decisions happen at many points, each a separate place a wrong turn can originate.

Compare centralised and distributed routing for evaluation coverage. When is the harder-to-evaluate option still the right choice? EAA-1.6

Centralised routing has a single decision point — easy to trace and to build one routing dataset for, but a bottleneck for complex flows. Distributed routing spreads decisions across many nodes, so evaluation must cover every node, not one — more work, but the right call when the workflow is genuinely too complex or parallel for a single router. The coverage cost is the price of the flexibility.

The course’s running example is a data-analysis assistant over a sales database, with three skills — data lookup (LLM-generated SQL run on DuckDB), data analysis (a single LLM call returning markdown), and data visualisation (two sequential calls: chart-config extraction, then code generation). Note that visualisation is itself two evaluation targets, illustrating how a single “skill” decomposes further into sub-steps you score independently.

Why component reliability compounds Worked example

Problem. An agent answers a query in one route step plus one skill step. The router picks the right skill 90% of the time; the skill produces a correct output 90% of the time. What is the end-to-end success rate? Now extend to a four-step agent at 90% per step.

Reasoning. End-to-end success requires every independent step to succeed, so the probabilities multiply:

$P(\text{success}) = P(\text{route}) \times P(\text{skill}) = 0.9 \times 0.9 = 0.81.$

For a four-step chain at 90% each:

$P(\text{success}) = 0.9^4 \approx 0.66.$

Answer. Two “90% — looks fine” components yield an agent that fails roughly 1 in 5 times; four such steps fail 1 in 3. The lesson: per-component quality compounds, so end-to-end-only evaluation hides where the loss comes from, and you need high per-component reliability before a multi-step agent is trustworthy. This multiplicative reality is the quantitative case for component-wise evaluation.

Key concept

Decomposition turns one opaque failure into localisable ones

EAA-1.4

End-to-end evaluation tells you that the agent failed. Component-wise evaluation tells you where and why. The router is scored on skill-selection accuracy, each skill on output quality, each sub-step on its own correctness criterion. The caveat: component evaluation can miss interaction effects (each piece passes, the composition still fails), so you need both component and end-to-end views.

Name the three components of agent decomposition and why each is a separate evaluation target. EAA-1.4

Router (chooses the skill — scored on selection accuracy), skills (execute capabilities — scored on output quality, sub-step by sub-step), and memory/state (carried context — scored on whether the right information persists). Each has distinct failure modes, so evaluating them separately localises a failure that end-to-end scoring would only report as “the agent was wrong.”

Evaluation-driven development

The difference from “testing” is when it runs. Evaluation-driven development treats eval runs as the primary feedback loop during development — not a gate you clear right before deployment. Teams that evaluate only before shipping catch regressions too late to be cheap. The supporting toolkit is small and familiar: trace instrumentation, an eval runner (often including LLM-as-a-judge), datasets for repeatable experiments, human annotation, and a prompt playground for fast iteration — all introduced in the coming chapters.

Decompose and prioritise a support agent Worked example

Problem. A customer-support agent can (1) look up order status, (2) check return eligibility, (3) escalate to a human, and (4) draft a response. It uses GPT-4o function calling for routing. Identify the router, skills, and memory/state — then decide which component to evaluate first, and how.

Reasoning.

Router — the GPT-4o function-calling step choosing among the four skills.
Skills — order lookup, return-eligibility check, escalation, response drafting.
Memory/state — the full conversation history passed between turns.
Evaluate first — the router. If it selects the wrong skill, every downstream step is wasted effort, so routing errors are both the most common origin of agent failure and the cheapest to catch. Build a labelled set of customer messages paired with the expected skill, then measure per-category routing accuracy.

Answer. Router = GPT-4o function calling; skills = the four functions; memory = conversation history. Evaluate the router first with a labelled message→expected-skill dataset and per-category accuracy, because routing errors cascade and cost the most to debug downstream.

List the three steps of the evaluation-driven development loop and what makes it different from testing-before-deploy. EAA-1.5

Trace (capture each step’s inputs/outputs), evaluate (score each component against its criteria), iterate (fix the worst-scoring component) — repeated on every change. Unlike a pre-deploy gate, it runs continuously during development, so regressions surface immediately instead of after they ship.

Summary — what this sets up

The foundation is laid: LLM systems are non-deterministic, so evaluation must be statistical and continuous, not binary and final. The model/application/agent layering tells you where a question lives; the router/skills/memory decomposition tells you what to measure; evaluation-driven development tells you when (always). The rest of the guide builds the machinery:

Chapter 2 — observability and tracing: how to capture the execution you need to evaluate.
Chapter 3 — component evaluations: metrics for routers, skills, and tool calls.
Chapter 4 — trajectory and structured evals: scoring the whole path, not just the endpoints.
Chapter 5 — LLM-as-a-judge and monitoring: automated grading and keeping it honest in production.