Glossary

67 terms.

Agent Variant (variant)

A version of the agent differing by one of the five improvement levers — prompt, tool definitions, router logic, skill structure, or model selection. The discipline is to change one lever at a time: alter the prompt and the model together and you can’t attribute the result to either. Variants are compared head-to-head in an experiment.

See also: Phoenix Experiment, Dashboard Comparison

Agreement Rate (judge agreement)

The fraction of cases where an LLM judge produces the same label as the ground-truth evaluator. It is the headline meta-evaluation metric — target ≥ 0.90, and below about 0.85 don’t ship the judge. Its blind spot: it hides which way the judge errs, so a high agreement rate can still sit on top of a dangerous false-positive (leniency) rate. Always report it alongside FPR.

See also: Meta-Evaluation, False-Positive Rate

AI Agent (agent, agentic system)

A software system that takes actions on behalf of a user using reasoning — deciding what to do next, which tool to call, and when to stop, rather than emitting a single fixed response. It composes three capabilities: reasoning (the LLM), routing (tool selection), and action (execution), then loops on the result. What makes an agent harder to evaluate than a plain LLM application is the third, agent layer: multi-step, dynamic paths whose success depends on choices made at run time.

See also: Router, Skill, Memory and State, Three-Layer Evaluation

Alert Threshold (monitoring threshold)

The deviation that fires a production alert — commonly about >10% from a metric’s baseline, set per metric (evaluator scores, convergence, LLM-calls/query, p50/p99 latency, cost/query). The tuning trade-off: too tight and the alert is noisy and ignored, too loose and a silent regression slips past. Thresholds turn the monitored time-series into actionable signals.

See also: Production Monitoring, Silent Regression

Automatic Instrumentation (auto-instrumentation, zero-code instrumentation)

Zero-code tracing: a library hooks the LLM API client and emits a span for every call without touching your code. It delivers instant LLM-level detail — prompts, completions, tokens, latency — but cannot see your router logic, tool dispatch, or business code, since those aren’t API calls. The fast first step; pair it with manual spans to fill the gaps.

See also: Manual Instrumentation, OpenTelemetry

Centralised Routing (central router)

A routing architecture where a single router makes every decision. Because all routing flows through one decision point, it is the easiest to trace and to evaluate — but that single point becomes a bottleneck for complex, branching flows. Contrast with distributed routing, which trades evaluation simplicity for flexibility.

See also: Distributed Routing, Router

CI Shipping Gate (CI gate, shipping gate)

A golden dataset wired into CI so every PR touching agent code triggers a run, and any regression blocks the merge. It encodes institutional knowledge that new engineers inherit automatically through the gate. Because agent runs are probabilistic, the gate runs the dataset multiple times with tolerance bands rather than demanding a single deterministic pass.

See also: Golden Dataset, Golden-Dataset Evolution

Code-Based Evaluation (code-based eval, programmatic evaluation)

Programmatic checks — regex, JSON validation, exact match, cosine similarity, SQL result-set diff — that classify an output as correct or incorrect. It is 100% reproducible, free, and fast, but applies only to codifiable criteria. The rule of thumb: if you can write an assert, use code-based evaluation; reach for an LLM judge only when the output is prose or judgment.

See also: LLM-as-a-Judge, Runnability, Cosine Similarity

Comprehensive, Not Exhaustive (comprehensive not exhaustive)

The dataset-design principle: cover every input type with 1–2 examples rather than piling 100 near-duplicates per category. It keeps each experiment fast to run and cheap to maintain — iteration speed beats volume — while still catching regressions across the behaviour space. Resisting the urge to add bulk is what keeps the evaluation loop tight.

See also: Evaluation Dataset, Input and Output Keys

Convergence Score (convergence)

A metric in [0,1][0,1] for the fraction of completed runs that follow the optimal, minimum-step path across similar queries (≥ 0.8 stable, 0.5–0.8 investigate, below 0.5 unreliable). It measures consistency, not correctness — 1.0 can mean a consistently wrong shortcut — and is blind to waste present in every run (the waste enters the minimum). Always pair it with per-step correctness evaluators.

See also: Trajectory, Wasted-Work Cost

Cosine Similarity (cosine distance)

A code-based similarity metric on two embedding vectors, cosθ=abab\cos\theta = \frac{\mathbf{a}\cdot \mathbf{b}}{\lVert\mathbf{a}\rVert\,\lVert\mathbf{b}\rVert}, ranging from −1 to 1; a threshold (≈ 0.85) marks two outputs as “semantically equivalent.” It lets a deterministic, free check approximate meaning-level matching — useful when exact string match is too strict but an LLM judge is more than you need.

See also: Code-Based Evaluation

Dashboard Comparison (experiment dashboard, variant dashboard)

Laying agent variants out across multiple evaluator dimensions at once (routing, SQL, clarity, runnability, latency …) so trade-offs a single aggregate score would mask become visible — e.g. a variant that gains correctness while regressing clarity and latency. It is the core decision artifact for a ship call, provided the evaluators are fine-grained enough (binary pass/fail hides nuance).

See also: Agent Variant, Ship Decision

Distributed Routing (decentralised routing)

A routing architecture (LangGraph, OpenAI Swarm-style graphs) where routing logic is spread across multiple components. More flexible than centralised routing, but harder to evaluate: routing decisions occur at many points, each a separate place a wrong turn can originate, so evaluation coverage must reach every decision node rather than a single one.

See also: Centralised Routing, Router

Evaluation Dataset (eval dataset)

A structured collection of test cases with input keys (forwarded to the agent) and output keys (forwarded to evaluators for comparison). Built to be “comprehensive, not exhaustive” — 1–2 cases per input type — and fed from three sources: manually constructed (edge cases), model-generated (coverage), and production-sampled (highest ecological validity).

See also: Input and Output Keys, Comprehensive, Not Exhaustive

Evaluation Flywheel (production feedback flywheel, feedback flywheel)

The self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data, repeating. It is only as fast as its slowest stage, so trace-to-dataset promotion must be automated — teams that automate iterate ~10× faster. The failure mode: unfiltered noisy or adversarial traffic entering the dataset degrades it, so the automation needs a quality gate.

See also: Evaluation Dataset, Regression Detection

Evaluation Technique Selection (technique selection framework)

The decision framework for picking an evaluation technique by output type: code-based for quantitative/codifiable outputs (SQL, dollar amounts, runnability), LLM-as-a-judge for qualitative/subjective ones (prose, tone), and human annotation for safety-critical calls. It prevents over-spending on LLM judges where a deterministic assert would do — and under-evaluating prose where an assert can’t.

See also: Code-Based Evaluation, LLM-as-a-Judge, Human Annotation

Evaluation-Driven Development (eval-driven development, EDD)

A methodology where every code or prompt change triggers an evaluation cycle: trace execution, evaluate each component against its criteria, then iterate on the worst-scoring one. It is the LLM analogue of test-driven development. The defining shift is timing — evaluation is the primary feedback loop during development, not a gate cleared just before deployment — so regressions surface while they are still cheap to fix.

See also: LLM System Evaluation, Non-Determinism

Evaluator Composition (composed evaluators, multi-evaluator)

Running several evaluators over one experiment because no single one captures the full picture — e.g. function-calling correctness, SQL accuracy, clarity (LLM judge), entity correctness, and runnability together. Composition is what surfaces cross-dimension trade-offs on the dashboard, where one variant gains on correctness but loses on clarity or latency.

See also: Dashboard Comparison, Phoenix Experiment

False-Positive Rate (FPR, judge leniency)

For a quality-gating judge, the fraction of truly-wrong answers it labels “correct” — its leniency, and the dangerous error for a gate, because a lenient judge passes bad answers. A high agreement rate can still hide a high FPR, so cap it explicitly (e.g. FPR ≤ 0.10) alongside agreement. Its mirror is the false-negative rate (strictness) — splitting the two reveals asymmetric judge bias.

See also: Agreement Rate, Meta-Evaluation

Feedback Poisoning (dataset poisoning)

When noisy or adversarial user feedback contaminates a self-improving pipeline’s dataset and few-shot examples. The trap is self-confirming: because the evaluator learns from the same poisoned data, the automated score can rise while real quality falls, so the metric endorses the degradation. Only independent signals — user surveys and the human-curated golden dataset — reveal the truth, which is why those firewalls are essential.

See also: Feedback Quality Filter, Self-Improving Agent

Feedback Quality Filter (feedback filter)

The defenses that keep noisy feedback out of a self-improving pipeline: an engagement-time threshold (discard sub-2-second clicks), cross-signal validation (a “helpful” click followed by an immediate re-query of the same topic is treated as noise), confidence weighting (trust users with a history over first-touch clicks), and a periodic human audit of a sample. Together they are the firewall between automated feedback and dataset quality.

See also: Feedback Poisoning, Golden Dataset

Few-Shot Examples (few-shot judging)

3–5 labelled example judgments added to a judge prompt to fix calibration, especially on boundary cases (case-insensitive match, floating-point tolerance, polite-but-wrong responses). One of the four judge-improvement levers. In a self-improving pipeline these are drawn from collected production data — which is exactly why feedback poisoning can corrupt the judge through them.

See also: Judge Improvement, Feedback Poisoning

GenAI Semantic Conventions (gen_ai.* conventions, OTEL GenAI conventions)

OpenTelemetry’s own gen_ai.* semantic conventions for GenAI spans — a second LLM-trace attribute namespace that matured alongside Arize’s Open Inference (still experimental as of 2026). Their coexistence fragments agent traces across two vocabularies, so a team must confirm which namespace its backend and dashboards expect before standardising on attribute names that may still move.

See also: Open Inference, OpenTelemetry

Golden Dataset (golden set)

A curated set of must-pass scenarios used as a shipping gate: every agent change runs against it and a regression blocks deployment. It differs from a general eval set — curated for criticality (every row must pass), seeded with known failure modes codified from past production bugs, and grown continuously. Because agent runs are probabilistic, run it several times and take the median or worst score with tolerance bands.

See also: CI Shipping Gate, Golden-Dataset Evolution

Golden-Dataset Evolution (golden dataset maintenance)

The policy that keeps a golden dataset useful: add a regression row for every production incident and new feature, prune rows that never trigger a failure on a stable feature, never prune safety-critical rows, and stabilise the size (~40–60 rows) so it stays fast enough to run on every PR. The guiding signal: a gate that never fails is stale, not safe — it has stopped keeping pace with the system.

See also: Golden Dataset, CI Shipping Gate

Ground-Truth Label (gold label, reference label)

The known-correct answer a prediction is scored against — the expected tool for a query, the expected SQL result set, the reference summary judgment. It is the bottleneck for router and code-based evaluations on ambiguous inputs, where two annotators might disagree, so consistent labels require careful annotation guidelines (and often human annotation).

See also: Router Evaluation, Human Annotation

Human Annotation (human labelling, human evaluation)

Evaluation by human labellers or end-user feedback. The most accurate technique and the standard for safety-critical or genuinely ambiguous judgments, but the slowest to scale and subject to selection bias. It is the fallback when neither code-based checks nor an LLM judge are reliable enough — and the source of the ground-truth labels the other techniques are measured against.

See also: LLM-as-a-Judge, Ground-Truth Label

Input and Output Keys (input keys, output keys, dataset keys)

The two field types in an evaluation dataset. Input keys are forwarded to the agent (the query, expected parameters); output keys are forwarded to evaluators for comparison (expected SQL tables, expected trend direction). Including output keys wherever possible is what unlocks cheap code-based evaluators instead of paying for an LLM judge.

See also: Evaluation Dataset, Comprehensive, Not Exhaustive

Judge Improvement (improving LLM judge accuracy)

The four levers for raising a failing judge’s agreement rate, each targeting a different failure mode: prompt engineering (misunderstood criteria — ~80% of judge failures), few-shot examples (calibration on boundary cases), model selection (capability gaps), and semantic similarity (false negatives on paraphrases). Apply one at a time and re-run meta-evaluation after each — the agreement rate is the single metric that says whether it worked.

See also: Meta-Evaluation, Few-Shot Examples, Cosine Similarity

LLM Model Evaluation (model eval, benchmark evaluation)

Evaluation of an isolated LLM using standardised benchmarks — MMLU (knowledge), HumanEval (code), GSM8K (math). Model providers run it; it answers “how capable is this model in general?” It does not predict performance in a specific application, where the surrounding prompts, retrieval, and tools dominate the outcome. Use model evaluation to shortlist candidate models, never to decide whether your system actually works — that is the job of system evaluation.

See also: LLM System Evaluation, Three-Layer Evaluation

LLM System Evaluation (system eval, end-to-end evaluation)

Evaluation of your entire application’s end-to-end performance on real-world, domain-specific data. It answers the question that matters — “does my system solve the user’s actual problem?” — and must be re-run on every prompt or code change, because any of them can regress it. Its main cost is curating representative test data; the practical move is to start with a small synthetic test set and grow it as real traffic arrives.

See also: LLM Model Evaluation, Evaluation-Driven Development

llm_classify (Phoenix llm_classify)

Phoenix’s function for running an LLM judge over a dataframe of spans, constrained to a rails label set, returning one classification per row. In the standard workflow it is wrapped in suppress_tracing() (to keep judge calls out of the agent’s traces) and followed by log_evaluations() (to upload the results back onto the spans). It is the concrete form of LLM-as-a-judge in the Phoenix stack.

See also: LLM-as-a-Judge, Rails, Suppress Tracing

LLM-as-a-Judge (LLM judge)

A separate LLM prompt that classifies the agent’s output using discrete labels. It scales and handles nuance no assert can, but is never 100% accurate — and is only trustworthy when constrained to rails (a fixed label set), not asked for a numeric score. The workhorse for qualitative skill outputs; layer human annotation on top for safety-critical judgments.

See also: Rails, Human Annotation, Suppress Tracing

Manual Instrumentation (explicit instrumentation)

Explicit span creation in your own code, wrapping the agent run, router iterations, and tool calls that automatic instrumentation can’t observe. It gives full control over what gets traced and produces the structure-bearing spans; the cost is discipline — a missed with block leaves a hole in the trace tree. Follows the outside-in principle.

See also: Automatic Instrumentation, Outside-In Instrumentation, Span Hierarchy

Memory and State (agent memory, state)

The agent’s notebook — retrieved context, configuration, and the log of previous execution steps that later decisions depend on. Evaluating it asks whether the right information persists across steps: stale, missing, or corrupted state is a failure mode distinct from a wrong route or a bad skill output, and one that end-to-end scoring often misattributes to the skill that acted on the bad state.

See also: Router, Skill

Meta-Evaluation (evaluating the evaluator)

Evaluating an evaluator’s own accuracy by comparing its judgments against known ground truth — quantified as the agreement rate between an LLM judge and a deterministic (code-based) reference. Without it, a dashboard’s “92%” is meaningless if the judge itself is only 75% accurate. For subjective dimensions with no code oracle (tone, creativity), human inter-annotator agreement replaces the code-based reference.

See also: Agreement Rate, False-Positive Rate

Multiplicative Accuracy (compounding accuracy)

The rule that independent component success rates multiply into end-to-end accuracy: P(end-to-end)=iPiP(\text{end-to-end}) = \prod_i P_i. Two “looks-fine” components — say tool selection 0.95 and parameter extraction 0.85 — yield a markedly lower whole (0.95×0.85=0.810.95 \times 0.85 = 0.81), and the product is bounded by the weakest factor. The practical consequence: report components separately and fix the bottleneck first.

See also: Router Evaluation, Tool Selection

Non-Determinism (non-deterministic behaviour, stochasticity)

The property that identical inputs can produce different outputs across runs — caused by sampling temperature, floating-point non-associativity, batching, and provider-side model updates. It is why a single passing run is a sample, not a proof: a change can raise average quality while introducing rare failures one run never surfaces. Evaluation must therefore be statistical over a representative dataset, not a binary pass/fail on one run. Setting temperature to 0 reduces but does not eliminate it.

See also: LLM System Evaluation, Evaluation-Driven Development

Observability (agent observability)

Complete visibility into every layer of an application — prompts, model responses, token usage, latency, and the function calls orchestrating behaviour. For non-deterministic agents it replaces the stack trace and print log, neither of which can capture a path that changes per run or a failure that is semantic (a wrong answer) rather than a crash.

See also: Trace, OpenTelemetry, Phoenix

Open Inference (OpenInference)

Arize’s LLM-specific extension to OpenTelemetry: it adds span kinds (Agent, Chain, Tool, LLM, and more — e.g. Retriever, Reranker) and attribute names for prompts, completions, and token counts. OTEL gives the generic trace/span machinery; Open Inference supplies the semantics that make an agent trace meaningful. As of 2026 it shares the space with OTEL’s own gen_ai.* conventions.

See also: OpenTelemetry, Span Kind, GenAI Semantic Conventions

OpenTelemetry (OTEL)

(OTEL) The vendor-neutral observability framework — a data model and instrumentation API for traces, spans, metrics, and logs. Its value is instrument once, export anywhere: spans emitted to the OTEL standard can be sent to any compatible backend. It tracks timing and metadata; LLM-specific semantics come from a layer on top (Open Inference).

See also: Open Inference, GenAI Semantic Conventions, Span

Outside-In Instrumentation (outside-in span nesting)

The principle for placing manual spans: start at the outermost layer (the agent run) and work inward — chain, then tool — letting nested with blocks build the correct span hierarchy automatically. Following it yields a trace whose nesting matches the architecture; ignoring it produces flat or mis-parented spans that are hard to navigate.

See also: Manual Instrumentation, Span Hierarchy

Parameter Extraction (argument extraction)

The router dimension measuring whether — having chosen the right tool — the agent supplied the right argument values (e.g. an order_id rather than a tracking_id; a date range in the correct order). It fails independently of tool selection, so a query can have the right tool and the wrong parameters. Because it’s independent, it must be evaluated as its own dimension.

See also: Tool Selection, Router Evaluation

Phoenix (Arize Phoenix)

Arize’s open-source observability tool: it receives OpenTelemetry traces, stores them by project, and provides trace-tree and timeline views, span-detail inspection, and evaluation integration — where instrumented spans become something you can read, query, and score. As of 2026 the package split tracing (arize-phoenix-otel) from evaluators (arize-phoenix-evals).

See also: OpenTelemetry, Trace

Phoenix Experiment (experiment, Experiments framework)

A structured, versioned evaluation run: execute a task function against every row of a dataset, record outputs, then apply evaluator functions. It is the agent analogue of a parametrized pytest run — dataset = test cases, task = test function, evaluators = assertions — and the unit you name, version, and compare across agent variants.

See also: Evaluation Dataset, Task Function, Evaluator Composition

Production Monitoring (prod monitoring)

Watching a deployed agent with the same tracing and annotation tools used in development, plus time-series metrics: running evaluator scores per dimension, convergence, LLM-call counts (a spike signals a loop), latency (p50/p99 by component), and cost per query — alerting on roughly

10% deviation from baseline. Its real job is catching silent regressions that surface metrics like latency and error rate miss.

See also: Silent Regression, Alert Threshold

Rails (discrete labels, classification rails)

The fixed set of discrete labels that constrain an LLM judge’s output — e.g. ["correct", "incorrect"] or 2–4 categories. Rails prevent free-form responses, enable automated parsing, and, most importantly, keep the judge doing what it does reliably (classify) rather than what it does not (calibrate a numeric scale). They are why LLM-as-a-judge is usable at all.

See also: LLM-as-a-Judge, llm_classify

Regression Detection (regression monitoring)

Catching a quality drop by re-running evaluators over traces and comparing scores across versions of the agent. It depends on collected traces as a stable, replayable record: without them you’d re-gather data by hand, and a missing-span gap blinds the comparison exactly where coverage is thin. It is the production payoff of the tracing investment.

See also: Trace, Observability

Router (routing layer)

The agent component that decides which skill or function to call at each step — the agent’s “brain.” Implementations span rules-based code (deterministic, limited), an NLP classifier, and LLM function calling (flexible, non-deterministic). The router is usually the first thing to evaluate: routing errors cascade, wasting every downstream step, and are the cheapest failure to catch — build a labelled message→expected-skill dataset and measure per-category routing accuracy.

See also: Skill, Centralised Routing, Distributed Routing

Router Evaluation (router eval)

Scoring the router on its two independent dimensions — tool selection and parameter extraction — whose product is end-to-end router accuracy. Measuring them separately is what localises where to invest: a single combined number hides which dimension is the bottleneck. The hard part is ground-truth labels for ambiguous queries, which need careful annotation guidelines.

See also: Tool Selection, Parameter Extraction, Multiplicative Accuracy

Runnability (runnable check)

A code-based check that generated code executes without error in a sandbox. It is necessary but not sufficient for correctness: code that runs yet plots the wrong data still passes a runnability check. Treat “it didn’t crash” as a first gate, then layer a semantic check (an LLM judge or a result comparison) for whether the output is actually right.

See also: Code-Based Evaluation, LLM-as-a-Judge

Self-Improving Agent (self-improving pipeline)

An automated pipeline where production feedback continuously updates the eval dataset, triggers CI/CD experiments, adds few-shot examples, and gates deployment via the golden dataset — improving the agent with minimal human intervention. Automation is the multiplier: each interaction makes the system incrementally better. The catch: without quality gates it can poison itself, so a human-curated golden dataset and feedback filters are mandatory.

See also: Feedback Poisoning, Golden Dataset

Ship Decision (shipping decision, variant promotion)

Deciding whether to promote a candidate variant by reading the whole per-evaluator row plus cost, latency, and statistical significance — never an aggregate score alone. A correctness gain that isn’t statistically significant, or that a latency SLO would reject, is not a ship. The dashboard supports the call; cost/latency/significance are the context that turns scores into a decision.

See also: Dashboard Comparison, Statistical Significance

Silent Regression (silent failure)

The most dangerous production failure: plausible-looking but wrong outputs that throw no error — for example, an agent fabricating an answer when a data source returns empty. Latency and error-rate dashboards stay green because nothing crashed, so only content-quality monitoring — evaluator scores over sampled live traces, plus provenance checks — catches it. It is the reason production monitoring needs output metrics, not just system metrics.

See also: Production Monitoring, Alert Threshold

Skill (agent skill, capability)

An individual capability within an agent — query a database, generate analysis, build a chart. A skill may contain several sub-steps (LLM calls, API calls, code), and each sub-step is its own evaluation target. An entire LLM application can itself be a single skill inside a larger agent, which is why agent decomposition recurses: the unit you evaluate depends on the level at which you draw the boundary.

See also: Router, Memory and State

Span (trace span)

One step within a trace — an LLM call, a tool invocation, or a routing decision — recording start time, end time, attributes (including token counts), and status. Spans nest hierarchically, and that nesting mirrors the agent’s architecture, so the set of spans in a trace is both the execution log and the structure map.

See also: Trace, Span Kind, Span Hierarchy

Span Hierarchy (span tree, trace tree)

The nesting of spans within a trace, which mirrors the agent’s architecture: Agent ⊃ Chain ⊃ Tool ⊃ LLM. Reading the hierarchy is reading the architecture diagram with live data, and it defines the debugging workflow — start at the symptom span and walk back through its parents. A poorly instrumented agent collapses to a flat hierarchy that hides where a failure occurred.

See also: Span Kind, Trace, Outside-In Instrumentation

Span Kind (OpenInference span kinds)

The role label on a span. Open Inference defines several kinds; the four most relevant to agents are Agent (the full run), Chain (a router iteration), Tool (a skill invocation), and LLM (a single model call) — others include Retriever, Reranker, and Embedding. These four give a trace its Agent ⊃ Chain ⊃ Tool ⊃ LLM shape, so a reader can tell a routing step from a tool call from a raw model call at a glance.

See also: Span, Span Hierarchy, Open Inference

Statistical Significance (significance)

Whether a measured score difference between variants is real or sampling noise — acutely relevant on the small “comprehensive, not exhaustive” datasets used for fast iteration. A higher aggregate score with no significance is not a reason to ship: gather more data or run an A/B test first. It is one of the contexts (with cost and latency) that an aggregate score needs before it becomes a decision.

See also: Ship Decision, Dashboard Comparison

Suppress Tracing (suppress_tracing)

A Phoenix context manager that keeps an LLM judge’s own calls from being recorded as spans, so evaluation artifacts don’t intermingle with — and roughly double — the agent’s traces. You wrap llm_classify() (and any judge call) in it during evaluation runs, and drop it only when deliberately debugging the judge itself.

See also: LLM-as-a-Judge, llm_classify

Task Function (task)

In Phoenix Experiments, the function that takes one dataset row, runs the agent on it, and returns the output plus metadata — the agent analogue of a pytest test function. It is the bridge between the dataset (its input) and the evaluators (which score its output), and is run once per row when the experiment executes.

See also: Phoenix Experiment, Evaluation Dataset

Three-Layer Evaluation (model–application–agent layers, layer model)

A mental model that places any evaluation activity at one of three layers. Layer 1 (model) — fixed benchmarks on the isolated LLM, handled by the provider. Layer 2 (application) — your domain-specific, end-to-end prompts and retrieval. Layer 3 (agent) — multi-step routing with variable paths and state across steps. Naming the layer first is what stops you applying a model-layer benchmark to an agent-layer failure, the single most common mismatch in proposed evaluation strategies.

See also: LLM Model Evaluation, LLM System Evaluation, AI Agent

Tool Selection (function-calling choice)

The router dimension measuring whether the agent chose the correct tool for a query (function-calling choice), scored by exact match against a labelled target (predicted_tool == expected_tool). It is one of the two independent router dimensions — a high tool-selection score can still mask poor parameter extraction, so the two must be tracked separately.

See also: Parameter Extraction, Router Evaluation

Trace (agent trace)

The complete end-to-end record of a single agent run — a hierarchical tree of spans linked by one trace ID. Because it holds full inputs, outputs, and every intermediate step, a trace is the unit of replay, cost attribution, and evaluation at scale: you can run automated checks over thousands of traces to detect regressions and measure quality.

See also: Span, Span Hierarchy, Trace-Based Cost Attribution

Trace-Based Cost Attribution (per-skill cost attribution)

Computing per-skill (or per-component) LLM cost from the token counts recorded on each span: average calls × tokens-per-call, weighted by the skill’s query share × daily volume × price per token. It turns cost optimisation from guesswork into arithmetic, and routinely reveals a skill that is a small share of traffic yet a large share of spend — and even which sub-step within it to redirect to a cheaper model.

See also: Trace, Span

Trajectory (agent trajectory, step sequence)

The ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response. It captures not just what the agent produced but how it got there. Two agents can reach the same correct answer by very different paths, so the trajectory carries the cost, latency, and failure-variance that the final answer alone hides — at scale, the path decides viability.

See also: Convergence Score, Wasted-Work Cost

Wasted-Work Cost (wasted work, non-optimal step cost)

The cost of non-optimal steps: (avg stepsoptimal steps)(\text{avg steps} - \text{optimal steps}) per run, times daily volume, times price per step — expressed as a daily dollar figure. Converting trajectory inefficiency into money is what turns it from a curiosity into a shipping metric, and lets two variants be compared on efficiency, not just accuracy.

See also: Trajectory, Convergence Score