Part 1 Chapter 3 Last verified 2026-06-19

Component Evaluations

The three evaluation techniques (code-based, LLM-as-a-judge, human annotation) and how to pick among them, why LLM judges need discrete rails not numeric scores, router evaluation as two independent dimensions (tool selection × parameter extraction), per-skill evaluation by output type, and Phoenix evaluation with suppress_tracing.

On this page

The three evaluation techniques
Router evaluation: two independent dimensions
Skill evaluation by output type
Implementing evaluations in Phoenix
A cautionary tale
Summary — what this sets up

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Router evaluation; if any is shaky, read closely — each is developed below.

Predict before reading — check each against the chapter.

You must evaluate whether generated SQL is correct vs whether a summary is good. Predict which one a deterministic code check fits, and which needs a different technique.
A teammate proposes “have GPT-4o rate each answer 1–10 and average.” Estimate how reliable that average is.
A router picks the right tool 95% of the time and the right parameters 85% of the time. Guess the end-to-end router accuracy — about 90%, about 81%, or about 80%.
Tool-selection accuracy is great but parameter accuracy is poor. Predict what kind of test case isolates the parameter bug without also testing tool choice.

Check your answers

SQL → code-based (run it, compare result sets — deterministic). Summary quality → LLM-as-a-judge with discrete labels (no single correct string).
Unreliable — LLMs can’t calibrate a numeric scale (83 vs 79 is noise); averaging uncalibrated scores aggregates noise into false precision. Use discrete rails instead.
~81% — $0.95 \times 0.85 = 0.808$ . The two dimensions are independent and multiply.
A query whose correct tool is obvious but whose parameters are easy to mis-extract (e.g. an ID that could be read as the wrong field, or a date range that could be reversed) — tool choice is unambiguous, so only parameter extraction is under test.

The three evaluation techniques

Every agent evaluation reduces to one of three techniques, trading accuracy against scalability:

Code-based evaluation — programmatic checks (regex, JSON validation, exact match, cosine similarity) that classify output as correct or incorrect. 100% reproducible and free, but only works for codifiable criteria.
LLM-as-a-judge — a separate LLM prompt classifies the output. Scalable and handles nuance, but never 100% accurate.
Human annotation — human labellers or end-user feedback. Most accurate, but slowest to scale and subject to selection bias.

| Output type | Technique | Why | | --- | --- | --- | | Quantitative / codifiable | Code-based | Deterministic, free, fast | | Qualitative / subjective | LLM-as-a-judge | Scalable, handles nuance | | Safety-critical | Human annotation | Accuracy worth the cost |

The shortcut: if you can write an assert, use code-based; if the output is prose or judgment, use an LLM judge with discrete labels; if a wrong answer is dangerous, pay for humans.

Key concept

Discrete rails over raw numeric scores

EAA-3.4

For classification, an LLM judge should output discrete labels — rails like ["correct", "incorrect"] — rather than an uncalibrated numeric score. An LLM cannot reliably distinguish 83 from 79, so averaging raw 1–10 scores manufactures false precision out of noise. Use 2–4 labels. The limit: even discrete labels are imperfect, so safety-critical calls still need human annotation. [V] Verified

Why must an LLM judge use discrete labels rather than a 1–10 numeric score? EAA-3.4

Because an LLM can’t calibrate a fine numeric scale — it can’t reliably tell 83 from 79 — so the numbers are noise dressed as precision, and averaging them aggregates that noise. Discrete rails (2–4 labels like correct/incorrect) ask the judge only what it can reliably do: classify. Numeric scoring produces confident-looking averages that don’t track real quality.

Router evaluation: two independent dimensions

The router is evaluated on two dimensions that fail independently:

Tool selection (function-calling choice) — did the agent pick the correct tool? A string comparison: predicted_tool == expected_tool. [V] Verified
Parameter extraction — even with the right tool, did it extract the right argument values? A customer asks about order #12345 and the agent passes it as a tracking_id instead of an order_id — right tool, wrong parameter.

Key concept

Router accuracy is the product of two independent tests

EAA-3.2

Tool selection and parameter extraction are separate checks, so end-to-end router correctness is their product: $P(\text{router correct}) = P(\text{tool}) \times P(\text{params} \mid \text{tool})$ . Tracking them separately is what tells you where to invest — a strong tool score can hide a weak parameter score. The hard part: for ambiguous queries, ground-truth labels need careful annotation guidelines.

Multiplicative router accuracy and the bottleneck Worked example

Problem. A support agent selects the right tool 95% of the time and extracts the right parameters 85% of the time, on independent dimensions. (a) End-to-end router accuracy? (b) Could improving either dimension alone reach 95% end-to-end?

Reasoning.

(a) Independent dimensions multiply: $0.95 \times 0.85 = 0.8075 \approx \mathbf{80.8\%}$ .
(b) Tool alone can’t: even at a perfect tool score, $1.0 \times 0.85 = 0.85 < 0.95$ (solving $x \times 0.85 = 0.95$ gives $x = 1.118$ , impossible). Params alone can — but only at perfection: $0.95 \times x = 0.95 \Rightarrow x = 1.00$ , i.e. parameter extraction would have to hit a flawless 100%, which is fragile.

Answer. End-to-end is 80.8%, below either component — because every independent step must succeed, the product is bounded by the weaker factor. Improving tool selection alone can’t reach 95% (capped at 0.85); improving parameters alone reaches it only at a brittle 100%. To get there by raising both equally, each needs ≈ 97.5% ( $\sqrt{0.95} \approx 0.975$ ). The lesson: report the two dimensions separately and fix the bottleneck (here, parameter extraction at 85%) first.

A router selects the right tool but passes the order ID where a product ID was expected. Which evaluation dimension catches this, and why isn't tool-selection accuracy enough? EAA-3.2

Parameter extraction catches it — the tool was correct, but an argument value was wrong. Tool-selection accuracy only checks which tool was chosen, so it would score this case as a pass and hide the bug. Because the two dimensions fail independently and multiply, you must measure them separately or a strong tool score masks a weak parameter score.

Skill evaluation by output type

The right technique follows the output type of each skill:

Database lookup (SQL) → code-based: execute the generated query and compare result sets to ground truth. (Or LLM-judged for query-logic assessment.)
Data analysis (prose) → LLM-as-a-judge: the output is qualitative, no single correct string; use rails like ["correct", "incorrect"].
Data visualisation (chart code) → code-based: run the generated code in a sandbox; if it executes, mark it runnable.

You must evaluate a chatbot's tone for empathy. Which technique, and why not code-based? EAA-3.6

LLM-as-a-judge with discrete rails (e.g. empathetic / neutral / dismissive). Empathy is qualitative — there’s no string to exact-match or assert on — so code-based evaluation can’t capture it, and human annotation, while most accurate, is too slow to run on every change. An LLM judge with a small label set scales while handling the nuance.

Implementing evaluations in Phoenix

The Phoenix workflow: export spans via SpanQuery, run an LLM judge with llm_classify() constrained to rails, wrap judge calls in suppress_tracing() so the evaluation’s own LLM calls don’t appear in your agent traces, then upload results with log_evaluations().

from phoenix.evals import llm_classify, suppress_tracing

with suppress_tracing():                      # keep judge spans out of agent traces
    results = llm_classify(
        dataframe=spans_df,
        template=eval_template,
        model=judge_model,
        rails=["correct", "incorrect"],        # discrete labels, not a 1–10 score
    )
px_client.log_evaluations(results)

Key concept

Suppress tracing on judge calls

EAA-3.3

Wrap LLM-as-a-judge calls in suppress_tracing() so the evaluator’s LLM calls don’t intermingle with the agent’s spans and corrupt trace analysis — a 500-span eval run would otherwise roughly double the trace store with evaluation artifacts. The exception: if you’re debugging the judge itself, temporarily remove suppression to see its calls. [V] Verified

A code-based SQL evaluator Worked example

Problem. Your data_lookup skill generates SQL. You have 200 test cases, each with a question, the expected result rows, and expected columns. Sketch a code-based evaluator and name the edge cases — especially why row ordering matters.

Reasoning. Execute the SQL, then compare rows order-independently but duplicate-aware (sort them — a raw list comparison fails on reordering, and a plain set would mask duplicate-row bugs); capture three distinct failure modes — execution error, wrong columns, wrong rows.

def eval_sql(generated_sql, expected_rows, expected_cols, db_path) -> dict:
    try:
        cur = sqlite3.connect(db_path).execute(generated_sql)
        actual_cols = [d[0] for d in cur.description]
        actual_rows = cur.fetchall()
    except Exception as e:
        return {"score": 0, "reason": f"execution error: {e}"}
    if set(actual_cols) != set(expected_cols):
        return {"score": 0, "reason": "wrong columns"}
    if sorted(actual_rows) != sorted(expected_rows):  # order-independent, duplicate-aware
        return {"score": 0, "reason": "wrong rows"}
    return {"score": 1, "reason": "correct"}

Answer. Compare columns as sets and rows as sorted multisets (order-independent, duplicate-aware); handle (1) row ordering — SQL order is non-deterministic without ORDER BY, so list comparison fails logically correct queries; (2) floating-point columns — compare within ±ε; (3) timeouts — wrap execution to catch full-table scans; (4) destructive queries — run read-only or in a rolled-back transaction so a generated DROP TABLE can’t do damage. Code-based SQL eval is deterministic and free, but the comparison logic is where correctness lives.

A cautionary tale

Vignette

The Router That Cheated

Your data-analysis agent has three skills: data_lookup (SQL), data_analysis (markdown), and data_visualisation (chart code). After a prompt update “to improve routing,” the router’s tool-selection score jumped from 82% to 96% and the team celebrates — until a manual audit shows the router now sends 70% of queries to data_analysis, including ones that clearly need a SQL lookup. The “improvement” happened because data_analysis produces plausible-looking output for any query, and the end-to-end eval scored that as acceptable.

Why the end-to-end eval missed it. It measured output plausibility, not process correctness. Because data_analysis can emit a believable answer to anything, the eval rewarded the router for choosing the skill with the highest surface-level pass rate — not the semantically correct one. “Plausible output” is not “correct routing.”

The fix. Evaluate routing directly: a labelled set of 100+ queries with ground-truth skill assignments, scored on tool-selection exact-match against those labels (threshold ≥ 85%). This isolates routing correctness from downstream output quality. A cheap routing-distribution alert (“any skill > 50% of queries”) catches dramatic collapse — but its threshold is arbitrary: a legitimate workload might genuinely be 60% analysis, and without ground-truth labels the metric can’t tell correct skew from over-routing.

What does suppress_tracing() do in a Phoenix evaluation, and why is it needed? EAA-3.3

It stops the evaluator’s own LLM-judge calls from being recorded as spans, so judge calls don’t intermingle with the agent’s traces. Without it, a large eval run pollutes (and roughly doubles) the trace store with evaluation artifacts, corrupting trace analysis of the agent itself. You’d only drop it when deliberately debugging the judge.

Summary — what this sets up

You can now score the components: pick the technique by output type (code-based, LLM-judge with rails, or human), evaluate the router on tool selection and parameter extraction independently (their product is the real number), and keep judge calls out of the agent’s traces. But components passing individually doesn’t prove the whole path was right — the Router That Cheated passed component quality while routing wrongly.

Chapter 4 — trajectory & structured evals: scoring the whole sequence of steps, not just isolated components.
Chapter 5 — LLM-as-a-judge & monitoring: making the judge trustworthy and watching it in production.