Part 1 Chapter 4 Last verified 2026-06-19

Trajectory & Structured Evals

Why the path matters beyond the final answer, convergence scoring and its blind spots, the Phoenix Experiments framework (dataset → task → experiment → evaluators), comprehensive-not-exhaustive datasets, comparing agent variants on a dashboard, the production feedback flywheel, and turning wasted work into a cost.

On this page

What is a trajectory?
Convergence scoring
The Phoenix Experiments framework
Designing the dataset
Comparing variants on the dashboard
The production feedback flywheel
A cautionary tale
Summary — what this sets up

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Convergence scoring; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

Two agents return the same correct answer; one takes 3 steps, the other 9. Predict why an evaluation that only checks the final answer is missing something important.
An agent always takes the exact same 5-step path. Guess what a “convergence score” of 1.0 does and doesn’t tell you about correctness.
Variant V3 wins on routing and SQL accuracy but loses on clarity and latency. Predict whether a single aggregate score should decide the ship call.
Of 12 runs, the minimum is 4 steps and the total is 58 steps. Estimate the fraction of steps that were “wasted.”

Check your answers

The 9-step path costs ~3× the LLM calls, adds latency, and introduces more stochastic-failure opportunities — “correct” hides “wasteful,” which can make an agent unviable at scale.
Convergence 1.0 means the agent always takes the same path — consistency, not correctness. It can be consistently wrong; pair it with per-step correctness evals.
No — a single number hides the trade-off. Decide on the full row plus cost, latency, and statistical significance.
Optimal = $12 \times 4 = 48$ ; wasted = $58 - 48 = 10$ ; $10/58 \approx \mathbf{17\%}$ .

What is a trajectory?

A trajectory is the ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response. It captures not just what the agent produced but how it got there.

Two agents can reach the same correct answer through radically different paths. [V] Verified The longer path means more LLM calls, more latency, more cost, and more chances for a stochastic failure — so at scale the trajectory, not just the answer, decides viability.

Convergence scoring

The convergence score measures how often the agent takes the shortest observed path — by step count — across similar queries (so two different paths of the same length both count as “converged”; it checks count, not path identity):

$\text{convergence} = \frac{\#\{\text{runs at the minimum step count}\}}{\#\{\text{completed runs}\}} \in [0, 1].$

A rough reading: ≥ 0.8 is stable, 0.5–0.8 warrants investigation, below 0.5 is unreliable.

Convergence and the cost of wasted work Worked example

Problem. An agent handling “return item” requests produces these step counts over 12 runs: 4, 6, 4, 4, 8, 4, 4, 4, 6, 4, 4, 6. (a) Convergence? (b) Run 5 (8 steps) errored mid-way — exclude it and recompute. (c) At 10,000 queries/day and $0.03/step, what is the daily cost of the wasted steps?

Reasoning.

(a) Minimum = 4 steps; runs at 4 = 8 of 12 → convergence = $8/12 = \mathbf{0.667}$ .
(b) Drop the errored run: 11 completed, min still 4, 8 at min → $8/11 = \mathbf{0.727}$ . (Always filter to completed runs — a crash at 2 steps would falsely lower the minimum.)
(c) Total steps = 58; optimal = $12 \times 4 = 48$ ; wasted = $58 - 48 = 10$ steps, i.e. $10/58 \approx 17\%$ of step spend. At 10,000 queries/day and $0.03/step, daily step spend is about $1,450, so roughly $250/day is wasted work.

Answer. Convergence ≈ 0.67 (0.73 excluding the errored run), and roughly 17% of step cost — on the order of $250/day here — is spent on non-optimal paths. That dollar figure is what makes a trajectory inefficiency a shipping concern, not a curiosity.

Key concept

Convergence measures consistency, not correctness

EAA-4.2

A convergence of 1.0 means the agent always takes the same path — not necessarily the correct one. High convergence on a wrong path is worse than low convergence with some correct runs, so always pair it with the per-step correctness evals from Chapter 3. Its three blind spots: universal waste (if every run includes the same unnecessary step, the minimum includes it and convergence reports 1.0), errored runs (filter to completed runs or a crash deflates the minimum), and parallel tool calls (decide whether 3 tools in one step counts as 1 or 3, and apply it uniformly). [V] Verified

Name convergence's three structural blind spots — the cases where a high score misleads. EAA-4.2

(1) Universal waste — if every run includes the same unnecessary step, that step sits inside the minimum, so convergence still reports 1.0. (2) Errored runs — a crash after 2 steps falsely lowers the minimum, so filter to completed runs first. (3) Parallel tool calls — decide whether 3 tools in one step count as 1 or 3, and apply that convention uniformly. Each is a way a high score can hide a real problem — which is why convergence is paired with per-step correctness evals.

The Phoenix Experiments framework

An experiment is a structured, versioned evaluation run. The workflow has four stages — and it maps cleanly onto pytest: [V] Verified

Create a dataset of test cases with input keys (forwarded to the agent) and optional output keys (forwarded to evaluators). (≈ test cases / parametrize.)
Define a task — a function taking one dataset row, running the agent, returning output + metadata. (≈ the test function.)
Run the experiment — iterate over the dataset, managing parallelism and errors.
Apply evaluators — scoring functions over each result. (≈ assertions.)

Name the four stages of the Phoenix Experiments pipeline in order. EAA-4.3

Dataset (test cases with input/output keys) → task (a function that runs the agent on one row) → experiment (run the task over the whole dataset) → evaluators (scoring functions applied to each result). It’s the agent analogue of pytest: dataset = cases, task = test function, evaluators = assertions.

Designing the dataset

An evaluation dataset pairs input keys (sent to the agent) with output keys (sent to evaluators). The guiding principle is comprehensive, not exhaustive: 1–2 examples per input type, not 100 per category — the sweet spot for iteration speed.

To improve an agent there are five levers — prompts, tool definitions, router logic, skill structure, model selection — and the rule is change one at a time: modify the prompt and the model together and you can’t attribute the result to either.

Why should an evaluation dataset be 'comprehensive, not exhaustive', and what are the two key-types in it? EAA-4.4

Because iteration speed matters more than volume: 1–2 cases per input type cover the behaviour space well enough to catch regressions while keeping each experiment fast to run and cheap to maintain; 100 near-duplicates per category slow the loop without adding signal. The two key-types are input keys (forwarded to the agent) and output keys (forwarded to evaluators for comparison, e.g. expected SQL tables or trend direction).

Comparing variants on the dashboard

No single evaluator captures the full picture, so an experiment composes several — function-calling correctness, SQL accuracy, clarity (LLM judge), entity correctness, runnability — and lays the variants out on a dashboard:

| Variant | Routing | SQL | Clarity | Runnable | Latency | | --- | --- | --- | --- | --- | --- | | V1 (baseline) | 0.82 | 0.75 | 0.88 | 0.95 | 1.2 s | | V2 (new prompt) | 0.85 | 0.78 | 0.91 | 0.95 | 1.3 s | | V3 (new model) | 0.90 | 0.82 | 0.86 | 0.90 | 2.8 s |

Key concept

The dashboard reveals trade-offs a single metric hides

EAA-4.5

Comparing across dimensions at once surfaces trade-offs an aggregate score masks: V3 improves routing and SQL but regresses clarity, runnability, and latency. A shipping call needs the whole row — and it breaks down if the evaluators are too coarse (binary pass/fail can’t show a 3-point clarity dip). [I] Inference

Should you ship V3? Worked example

Problem. Using the dashboard above, decide whether to ship V2 or V3, and say what else you’d need to know.

Reasoning. V3 has the best routing (0.90) and SQL (0.82) but the worst clarity (0.86), runnability (0.90), and a latency of 2.8 s — more than double V1. V2 improves routing and SQL modestly and clarity, holds runnability, and adds only 0.1 s. There’s no dimension where V2 is clearly bad.

Answer. Ship V2: broad improvement with no meaningful regression. V3 trades correctness gains for clarity/runnability/latency losses that a latency SLO might reject outright. But don’t decide on the table alone — first get statistical significance (are these gaps real or noise on a small dataset?), a per-category breakdown, cost per query, and ideally a small A/B on user satisfaction. Aggregate scores need cost, latency, and significance context before they’re a decision.

Variant B scores higher on correctness but costs more, and the difference isn't statistically significant. How do you decide whether to ship it? EAA-4.7

Don’t ship on the higher score alone. With the difference not significant, the “improvement” may be noise — gather more data or a larger eval set first. Weigh the correctness gain against the extra cost (and any latency change) explicitly, check a per-category breakdown for hidden regressions, and confirm the gain holds on a significance test (or an A/B on real users) before committing. Ship only if a real gain justifies the added cost.

The production feedback flywheel

The evaluation flywheel is the self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data → repeat. Chapter 2’s tracing is what supplies the raw production data.

Key concept

The flywheel is only as fast as its slowest stage

EAA-4.6

If promoting a production trace into the dataset takes a week of manual review, the flywheel stalls — teams that automate trace-to-dataset promotion, annotation, and evaluator updates iterate roughly 10× faster. The failure mode: noisy production data (bots, adversarial inputs) entering unfiltered degrades the evaluation set, so the automation must include a quality gate. [I] Inference

Describe the production feedback flywheel and its most common bottleneck. EAA-4.6

The flywheel is the loop production traces → new test cases → updated evaluators → agent improvements → better production data, repeating. Its most common bottleneck is the speed of trace-to-dataset promotion: if turning a real failure into a dataset case needs slow manual review, the loop stalls. The fix is automation (with a quality filter so noisy/adversarial traffic doesn’t pollute the dataset).

A cautionary tale

Vignette

The Efficient but Wrong Agent

Your team ships V2 with a new prompt: convergence jumps to 0.95 (from V1’s 0.72), and it completes queries in a steady 3 steps instead of V1’s variable 3–7. Every trajectory metric improves. Two weeks later, users report the agent’s SQL returns stale data — and the investigation finds V2 skips the data_lookup skill entirely, generating analysis from cached context: fast, consistent, and wrong.

Why high convergence masked it. Convergence measures whether the agent reaches a stable answer in few steps, not whether the answer is correct. V2 converges precisely because it took a consistent shortcut — skipping the data lookup — and convergence rewards exactly that. A factual-accuracy / freshness evaluator comparing outputs against ground-truth query results would have caught it.

The fuller dashboard makes the call obvious:

| Variant | Convergence | SQL accuracy | Freshness | Latency | | --- | --- | --- | --- | --- | | V1 (baseline) | 0.72 | 0.85 | 100% | 3.1 s | | V2 (new prompt) | 0.95 | 0.40 | 12% | 1.4 s | | V3 (hybrid) | 0.83 | 0.82 | 95% | 2.6 s |

Ship V3. It recovers 95% freshness and 0.82 SQL accuracy (near V1) while still lifting convergence to 0.83 and cutting latency 16%. V2 is disqualified by 12% freshness and 0.40 SQL — no latency win justifies serving stale data. The fix that generalises: add a freshness evaluator that checks each trajectory for a data_lookup call when the query needs live data, and auto-fails responses that skip it — so the next regression of this class can’t hide behind a high convergence score.

Summary — what this sets up

You can now evaluate the path, not just the answer: convergence scores consistency (never mistake it for correctness), the Phoenix Experiments framework makes evaluation repeatable (dataset → task → experiment → evaluators), composed evaluators on a dashboard expose trade-offs, shipping decisions weigh the full row plus cost/latency/significance, and the flywheel feeds production reality back into the dataset — and turning wasted steps into a daily dollar figure makes efficiency a first-class shipping metric.

Chapter 5 — LLM-as-a-judge & monitoring: making the judge that powers many of these evaluators trustworthy, and watching all of this in production.