Trajectory & Structured Evals
Why the path matters beyond the final answer, convergence scoring and its blind spots, the Phoenix Experiments framework (dataset → task → experiment → evaluators), comprehensive-not-exhaustive datasets, comparing agent variants on a dashboard, the production feedback flywheel, and turning wasted work into a cost.
On this page
What is a trajectory?
A trajectory is the ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response. It captures not just what the agent produced but how it got there.
Two agents can reach the same correct answer through radically different paths. [V] Verified The longer path means more LLM calls, more latency, more cost, and more chances for a stochastic failure — so at scale the trajectory, not just the answer, decides viability.
Convergence scoring
The convergence score measures how often the agent takes the shortest observed path — by step count — across similar queries (so two different paths of the same length both count as “converged”; it checks count, not path identity):
A rough reading: ≥ 0.8 is stable, 0.5–0.8 warrants investigation, below 0.5 is unreliable.
Convergence measures consistency, not correctness
EAA-4.2A convergence of 1.0 means the agent always takes the same path — not necessarily the correct one. High convergence on a wrong path is worse than low convergence with some correct runs, so always pair it with the per-step correctness evals from Chapter 3. Its three blind spots: universal waste (if every run includes the same unnecessary step, the minimum includes it and convergence reports 1.0), errored runs (filter to completed runs or a crash deflates the minimum), and parallel tool calls (decide whether 3 tools in one step counts as 1 or 3, and apply it uniformly). [V] Verified
Name convergence's three structural blind spots — the cases where a high score misleads. EAA-4.2
(1) Universal waste — if every run includes the same unnecessary step, that step sits inside the minimum, so convergence still reports 1.0. (2) Errored runs — a crash after 2 steps falsely lowers the minimum, so filter to completed runs first. (3) Parallel tool calls — decide whether 3 tools in one step count as 1 or 3, and apply that convention uniformly. Each is a way a high score can hide a real problem — which is why convergence is paired with per-step correctness evals.
The Phoenix Experiments framework
An experiment is a structured,
versioned evaluation run. The workflow has four stages — and it maps cleanly onto
pytest: [V] Verified
- Create a dataset of test cases with input keys (forwarded to the agent) and optional output keys (forwarded to evaluators). (≈ test cases /
parametrize.) - Define a task — a function taking one dataset row, running the agent, returning output + metadata. (≈ the test function.)
- Run the experiment — iterate over the dataset, managing parallelism and errors.
- Apply evaluators — scoring functions over each result. (≈ assertions.)
Name the four stages of the Phoenix Experiments pipeline in order. EAA-4.3
Dataset (test cases with input/output keys) → task (a function that runs the agent on one row) → experiment (run the task over the whole dataset) → evaluators (scoring functions applied to each result). It’s the agent analogue of pytest: dataset = cases, task = test function, evaluators = assertions.
Designing the dataset
An evaluation dataset pairs input keys (sent to the agent) with output keys (sent to evaluators). The guiding principle is comprehensive, not exhaustive: 1–2 examples per input type, not 100 per category — the sweet spot for iteration speed.
To improve an agent there are five levers — prompts, tool definitions, router logic, skill structure, model selection — and the rule is change one at a time: modify the prompt and the model together and you can’t attribute the result to either.
Why should an evaluation dataset be 'comprehensive, not exhaustive', and what are the two key-types in it? EAA-4.4
Because iteration speed matters more than volume: 1–2 cases per input type cover the behaviour space well enough to catch regressions while keeping each experiment fast to run and cheap to maintain; 100 near-duplicates per category slow the loop without adding signal. The two key-types are input keys (forwarded to the agent) and output keys (forwarded to evaluators for comparison, e.g. expected SQL tables or trend direction).
Comparing variants on the dashboard
No single evaluator captures the full picture, so an experiment composes several — function-calling correctness, SQL accuracy, clarity (LLM judge), entity correctness, runnability — and lays the variants out on a dashboard:
| Variant | Routing | SQL | Clarity | Runnable | Latency | | --- | --- | --- | --- | --- | --- | | V1 (baseline) | 0.82 | 0.75 | 0.88 | 0.95 | 1.2 s | | V2 (new prompt) | 0.85 | 0.78 | 0.91 | 0.95 | 1.3 s | | V3 (new model) | 0.90 | 0.82 | 0.86 | 0.90 | 2.8 s |
The dashboard reveals trade-offs a single metric hides
EAA-4.5Comparing across dimensions at once surfaces trade-offs an aggregate score masks: V3 improves routing and SQL but regresses clarity, runnability, and latency. A shipping call needs the whole row — and it breaks down if the evaluators are too coarse (binary pass/fail can’t show a 3-point clarity dip). [I] Inference
Variant B scores higher on correctness but costs more, and the difference isn't statistically significant. How do you decide whether to ship it? EAA-4.7
Don’t ship on the higher score alone. With the difference not significant, the “improvement” may be noise — gather more data or a larger eval set first. Weigh the correctness gain against the extra cost (and any latency change) explicitly, check a per-category breakdown for hidden regressions, and confirm the gain holds on a significance test (or an A/B on real users) before committing. Ship only if a real gain justifies the added cost.
The production feedback flywheel
The evaluation flywheel is the self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data → repeat. Chapter 2’s tracing is what supplies the raw production data.
The flywheel is only as fast as its slowest stage
EAA-4.6If promoting a production trace into the dataset takes a week of manual review, the flywheel stalls — teams that automate trace-to-dataset promotion, annotation, and evaluator updates iterate roughly 10× faster. The failure mode: noisy production data (bots, adversarial inputs) entering unfiltered degrades the evaluation set, so the automation must include a quality gate. [I] Inference
Describe the production feedback flywheel and its most common bottleneck. EAA-4.6
The flywheel is the loop production traces → new test cases → updated evaluators → agent improvements → better production data, repeating. Its most common bottleneck is the speed of trace-to-dataset promotion: if turning a real failure into a dataset case needs slow manual review, the loop stalls. The fix is automation (with a quality filter so noisy/adversarial traffic doesn’t pollute the dataset).
A cautionary tale
The Efficient but Wrong Agent
Your team ships V2 with a new prompt: convergence jumps to 0.95 (from V1’s
0.72), and it completes queries in a steady 3 steps instead of V1’s variable 3–7.
Every trajectory metric improves. Two weeks later, users report the agent’s SQL
returns stale data — and the investigation finds V2 skips the data_lookup
skill entirely, generating analysis from cached context: fast, consistent, and
wrong.
Why high convergence masked it. Convergence measures whether the agent reaches a stable answer in few steps, not whether the answer is correct. V2 converges precisely because it took a consistent shortcut — skipping the data lookup — and convergence rewards exactly that. A factual-accuracy / freshness evaluator comparing outputs against ground-truth query results would have caught it.
The fuller dashboard makes the call obvious:
| Variant | Convergence | SQL accuracy | Freshness | Latency | | --- | --- | --- | --- | --- | | V1 (baseline) | 0.72 | 0.85 | 100% | 3.1 s | | V2 (new prompt) | 0.95 | 0.40 | 12% | 1.4 s | | V3 (hybrid) | 0.83 | 0.82 | 95% | 2.6 s |
Ship V3. It recovers 95% freshness and 0.82 SQL accuracy (near V1) while still
lifting convergence to 0.83 and cutting latency 16%. V2 is disqualified by 12%
freshness and 0.40 SQL — no latency win justifies serving stale data. The fix
that generalises: add a freshness evaluator that checks each trajectory for a
data_lookup call when the query needs live data, and auto-fails responses that
skip it — so the next regression of this class can’t hide behind a high convergence
score.
Summary — what this sets up
You can now evaluate the path, not just the answer: convergence scores consistency (never mistake it for correctness), the Phoenix Experiments framework makes evaluation repeatable (dataset → task → experiment → evaluators), composed evaluators on a dashboard expose trade-offs, shipping decisions weigh the full row plus cost/latency/significance, and the flywheel feeds production reality back into the dataset — and turning wasted steps into a daily dollar figure makes efficiency a first-class shipping metric.
- Chapter 5 — LLM-as-a-judge & monitoring: making the judge that powers many of these evaluators trustworthy, and watching all of this in production.