Part 1 Chapter 3 Last verified 2026-06-19

Component Evaluations

The three evaluation techniques (code-based, LLM-as-a-judge, human annotation) and how to pick among them, why LLM judges need discrete rails not numeric scores, router evaluation as two independent dimensions (tool selection × parameter extraction), per-skill evaluation by output type, and Phoenix evaluation with suppress_tracing.

On this page
  1. The three evaluation techniques
  2. Router evaluation: two independent dimensions
  3. Skill evaluation by output type
  4. Implementing evaluations in Phoenix
  5. A cautionary tale
  6. Summary — what this sets up

The three evaluation techniques

Every agent evaluation reduces to one of three techniques, trading accuracy against scalability:

  • Code-based evaluation — programmatic checks (regex, JSON validation, exact match, cosine similarity) that classify output as correct or incorrect. 100% reproducible and free, but only works for codifiable criteria.
  • LLM-as-a-judge — a separate LLM prompt classifies the output. Scalable and handles nuance, but never 100% accurate.
  • Human annotation — human labellers or end-user feedback. Most accurate, but slowest to scale and subject to selection bias.

| Output type | Technique | Why | | --- | --- | --- | | Quantitative / codifiable | Code-based | Deterministic, free, fast | | Qualitative / subjective | LLM-as-a-judge | Scalable, handles nuance | | Safety-critical | Human annotation | Accuracy worth the cost |

The shortcut: if you can write an assert, use code-based; if the output is prose or judgment, use an LLM judge with discrete labels; if a wrong answer is dangerous, pay for humans.

Key concept

Discrete rails over raw numeric scores

EAA-3.4

For classification, an LLM judge should output discrete labels — rails like ["correct", "incorrect"] — rather than an uncalibrated numeric score. An LLM cannot reliably distinguish 83 from 79, so averaging raw 1–10 scores manufactures false precision out of noise. Use 2–4 labels. The limit: even discrete labels are imperfect, so safety-critical calls still need human annotation. [V] Verified

Why must an LLM judge use discrete labels rather than a 1–10 numeric score? EAA-3.4

Because an LLM can’t calibrate a fine numeric scale — it can’t reliably tell 83 from 79 — so the numbers are noise dressed as precision, and averaging them aggregates that noise. Discrete rails (2–4 labels like correct/incorrect) ask the judge only what it can reliably do: classify. Numeric scoring produces confident-looking averages that don’t track real quality.

Router evaluation: two independent dimensions

The router is evaluated on two dimensions that fail independently:

  1. Tool selection (function-calling choice) — did the agent pick the correct tool? A string comparison: predicted_tool == expected_tool. [V] Verified
  2. Parameter extraction — even with the right tool, did it extract the right argument values? A customer asks about order #12345 and the agent passes it as a tracking_id instead of an order_id — right tool, wrong parameter.
Key concept

Router accuracy is the product of two independent tests

EAA-3.2

Tool selection and parameter extraction are separate checks, so end-to-end router correctness is their product: P(router correct)=P(tool)×P(paramstool)P(\text{router correct}) = P(\text{tool}) \times P(\text{params} \mid \text{tool}). Tracking them separately is what tells you where to invest — a strong tool score can hide a weak parameter score. The hard part: for ambiguous queries, ground-truth labels need careful annotation guidelines.

A router selects the right tool but passes the order ID where a product ID was expected. Which evaluation dimension catches this, and why isn't tool-selection accuracy enough? EAA-3.2

Parameter extraction catches it — the tool was correct, but an argument value was wrong. Tool-selection accuracy only checks which tool was chosen, so it would score this case as a pass and hide the bug. Because the two dimensions fail independently and multiply, you must measure them separately or a strong tool score masks a weak parameter score.

Skill evaluation by output type

The right technique follows the output type of each skill:

  • Database lookup (SQL) → code-based: execute the generated query and compare result sets to ground truth. (Or LLM-judged for query-logic assessment.)
  • Data analysis (prose) → LLM-as-a-judge: the output is qualitative, no single correct string; use rails like ["correct", "incorrect"].
  • Data visualisation (chart code) → code-based: run the generated code in a sandbox; if it executes, mark it runnable.
You must evaluate a chatbot's tone for empathy. Which technique, and why not code-based? EAA-3.6

LLM-as-a-judge with discrete rails (e.g. empathetic / neutral / dismissive). Empathy is qualitative — there’s no string to exact-match or assert on — so code-based evaluation can’t capture it, and human annotation, while most accurate, is too slow to run on every change. An LLM judge with a small label set scales while handling the nuance.

Implementing evaluations in Phoenix

The Phoenix workflow: export spans via SpanQuery, run an LLM judge with llm_classify() constrained to rails, wrap judge calls in suppress_tracing() so the evaluation’s own LLM calls don’t appear in your agent traces, then upload results with log_evaluations().

from phoenix.evals import llm_classify, suppress_tracing

with suppress_tracing():                      # keep judge spans out of agent traces
    results = llm_classify(
        dataframe=spans_df,
        template=eval_template,
        model=judge_model,
        rails=["correct", "incorrect"],        # discrete labels, not a 1–10 score
    )
px_client.log_evaluations(results)
Key concept

Suppress tracing on judge calls

EAA-3.3

Wrap LLM-as-a-judge calls in suppress_tracing() so the evaluator’s LLM calls don’t intermingle with the agent’s spans and corrupt trace analysis — a 500-span eval run would otherwise roughly double the trace store with evaluation artifacts. The exception: if you’re debugging the judge itself, temporarily remove suppression to see its calls. [V] Verified

A cautionary tale

Vignette

The Router That Cheated

Your data-analysis agent has three skills: data_lookup (SQL), data_analysis (markdown), and data_visualisation (chart code). After a prompt update “to improve routing,” the router’s tool-selection score jumped from 82% to 96% and the team celebrates — until a manual audit shows the router now sends 70% of queries to data_analysis, including ones that clearly need a SQL lookup. The “improvement” happened because data_analysis produces plausible-looking output for any query, and the end-to-end eval scored that as acceptable.

Why the end-to-end eval missed it. It measured output plausibility, not process correctness. Because data_analysis can emit a believable answer to anything, the eval rewarded the router for choosing the skill with the highest surface-level pass rate — not the semantically correct one. “Plausible output” is not “correct routing.”

The fix. Evaluate routing directly: a labelled set of 100+ queries with ground-truth skill assignments, scored on tool-selection exact-match against those labels (threshold ≥ 85%). This isolates routing correctness from downstream output quality. A cheap routing-distribution alert (“any skill > 50% of queries”) catches dramatic collapse — but its threshold is arbitrary: a legitimate workload might genuinely be 60% analysis, and without ground-truth labels the metric can’t tell correct skew from over-routing.

What does suppress_tracing() do in a Phoenix evaluation, and why is it needed? EAA-3.3

It stops the evaluator’s own LLM-judge calls from being recorded as spans, so judge calls don’t intermingle with the agent’s traces. Without it, a large eval run pollutes (and roughly doubles) the trace store with evaluation artifacts, corrupting trace analysis of the agent itself. You’d only drop it when deliberately debugging the judge.

Summary — what this sets up

You can now score the components: pick the technique by output type (code-based, LLM-judge with rails, or human), evaluate the router on tool selection and parameter extraction independently (their product is the real number), and keep judge calls out of the agent’s traces. But components passing individually doesn’t prove the whole path was right — the Router That Cheated passed component quality while routing wrongly.

  • Chapter 4trajectory & structured evals: scoring the whole sequence of steps, not just isolated components.
  • Chapter 5LLM-as-a-judge & monitoring: making the judge trustworthy and watching it in production.