Glossary
67 terms.
- Agent Variant (variant)
-
A version of the agent differing by one of the five improvement levers — prompt, tool definitions, router logic, skill structure, or model selection. The discipline is to change one lever at a time: alter the prompt and the model together and you can’t attribute the result to either. Variants are compared head-to-head in an experiment.
See also: Phoenix Experiment, Dashboard Comparison
- Agreement Rate (judge agreement)
-
The fraction of cases where an LLM judge produces the same label as the ground-truth evaluator. It is the headline meta-evaluation metric — target ≥ 0.90, and below about 0.85 don’t ship the judge. Its blind spot: it hides which way the judge errs, so a high agreement rate can still sit on top of a dangerous false-positive (leniency) rate. Always report it alongside FPR.
See also: Meta-Evaluation, False-Positive Rate
- AI Agent (agent, agentic system)
-
A software system that takes actions on behalf of a user using reasoning — deciding what to do next, which tool to call, and when to stop, rather than emitting a single fixed response. It composes three capabilities: reasoning (the LLM), routing (tool selection), and action (execution), then loops on the result. What makes an agent harder to evaluate than a plain LLM application is the third, agent layer: multi-step, dynamic paths whose success depends on choices made at run time.
See also: Router, Skill, Memory and State, Three-Layer Evaluation
- Alert Threshold (monitoring threshold)
-
The deviation that fires a production alert — commonly about >10% from a metric’s baseline, set per metric (evaluator scores, convergence, LLM-calls/query, p50/p99 latency, cost/query). The tuning trade-off: too tight and the alert is noisy and ignored, too loose and a silent regression slips past. Thresholds turn the monitored time-series into actionable signals.
See also: Production Monitoring, Silent Regression
- Automatic Instrumentation (auto-instrumentation, zero-code instrumentation)
-
Zero-code tracing: a library hooks the LLM API client and emits a span for every call without touching your code. It delivers instant LLM-level detail — prompts, completions, tokens, latency — but cannot see your router logic, tool dispatch, or business code, since those aren’t API calls. The fast first step; pair it with manual spans to fill the gaps.
See also: Manual Instrumentation, OpenTelemetry
- Centralised Routing (central router)
-
A routing architecture where a single router makes every decision. Because all routing flows through one decision point, it is the easiest to trace and to evaluate — but that single point becomes a bottleneck for complex, branching flows. Contrast with distributed routing, which trades evaluation simplicity for flexibility.
See also: Distributed Routing, Router
- CI Shipping Gate (CI gate, shipping gate)
-
A golden dataset wired into CI so every PR touching agent code triggers a run, and any regression blocks the merge. It encodes institutional knowledge that new engineers inherit automatically through the gate. Because agent runs are probabilistic, the gate runs the dataset multiple times with tolerance bands rather than demanding a single deterministic pass.
See also: Golden Dataset, Golden-Dataset Evolution
- Code-Based Evaluation (code-based eval, programmatic evaluation)
-
Programmatic checks — regex, JSON validation, exact match, cosine similarity, SQL result-set diff — that classify an output as correct or incorrect. It is 100% reproducible, free, and fast, but applies only to codifiable criteria. The rule of thumb: if you can write an
assert, use code-based evaluation; reach for an LLM judge only when the output is prose or judgment.See also: LLM-as-a-Judge, Runnability, Cosine Similarity
- Comprehensive, Not Exhaustive (comprehensive not exhaustive)
-
The dataset-design principle: cover every input type with 1–2 examples rather than piling 100 near-duplicates per category. It keeps each experiment fast to run and cheap to maintain — iteration speed beats volume — while still catching regressions across the behaviour space. Resisting the urge to add bulk is what keeps the evaluation loop tight.
See also: Evaluation Dataset, Input and Output Keys
- Convergence Score (convergence)
-
A metric in for the fraction of completed runs that follow the optimal, minimum-step path across similar queries (≥ 0.8 stable, 0.5–0.8 investigate, below 0.5 unreliable). It measures consistency, not correctness — 1.0 can mean a consistently wrong shortcut — and is blind to waste present in every run (the waste enters the minimum). Always pair it with per-step correctness evaluators.
See also: Trajectory, Wasted-Work Cost
- Cosine Similarity (cosine distance)
-
A code-based similarity metric on two embedding vectors, , ranging from −1 to 1; a threshold (≈ 0.85) marks two outputs as “semantically equivalent.” It lets a deterministic, free check approximate meaning-level matching — useful when exact string match is too strict but an LLM judge is more than you need.
See also: Code-Based Evaluation
- Dashboard Comparison (experiment dashboard, variant dashboard)
-
Laying agent variants out across multiple evaluator dimensions at once (routing, SQL, clarity, runnability, latency …) so trade-offs a single aggregate score would mask become visible — e.g. a variant that gains correctness while regressing clarity and latency. It is the core decision artifact for a ship call, provided the evaluators are fine-grained enough (binary pass/fail hides nuance).
See also: Agent Variant, Ship Decision
- Distributed Routing (decentralised routing)
-
A routing architecture (LangGraph, OpenAI Swarm-style graphs) where routing logic is spread across multiple components. More flexible than centralised routing, but harder to evaluate: routing decisions occur at many points, each a separate place a wrong turn can originate, so evaluation coverage must reach every decision node rather than a single one.
See also: Centralised Routing, Router
- Evaluation Dataset (eval dataset)
-
A structured collection of test cases with input keys (forwarded to the agent) and output keys (forwarded to evaluators for comparison). Built to be “comprehensive, not exhaustive” — 1–2 cases per input type — and fed from three sources: manually constructed (edge cases), model-generated (coverage), and production-sampled (highest ecological validity).
See also: Input and Output Keys, Comprehensive, Not Exhaustive
- Evaluation Flywheel (production feedback flywheel, feedback flywheel)
-
The self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data, repeating. It is only as fast as its slowest stage, so trace-to-dataset promotion must be automated — teams that automate iterate ~10× faster. The failure mode: unfiltered noisy or adversarial traffic entering the dataset degrades it, so the automation needs a quality gate.
See also: Evaluation Dataset, Regression Detection
- Evaluation Technique Selection (technique selection framework)
-
The decision framework for picking an evaluation technique by output type: code-based for quantitative/codifiable outputs (SQL, dollar amounts, runnability), LLM-as-a-judge for qualitative/subjective ones (prose, tone), and human annotation for safety-critical calls. It prevents over-spending on LLM judges where a deterministic
assertwould do — and under-evaluating prose where anassertcan’t.See also: Code-Based Evaluation, LLM-as-a-Judge, Human Annotation
- Evaluation-Driven Development (eval-driven development, EDD)
-
A methodology where every code or prompt change triggers an evaluation cycle: trace execution, evaluate each component against its criteria, then iterate on the worst-scoring one. It is the LLM analogue of test-driven development. The defining shift is timing — evaluation is the primary feedback loop during development, not a gate cleared just before deployment — so regressions surface while they are still cheap to fix.
See also: LLM System Evaluation, Non-Determinism
- Evaluator Composition (composed evaluators, multi-evaluator)
-
Running several evaluators over one experiment because no single one captures the full picture — e.g. function-calling correctness, SQL accuracy, clarity (LLM judge), entity correctness, and runnability together. Composition is what surfaces cross-dimension trade-offs on the dashboard, where one variant gains on correctness but loses on clarity or latency.
See also: Dashboard Comparison, Phoenix Experiment
- False-Positive Rate (FPR, judge leniency)
-
For a quality-gating judge, the fraction of truly-wrong answers it labels “correct” — its leniency, and the dangerous error for a gate, because a lenient judge passes bad answers. A high agreement rate can still hide a high FPR, so cap it explicitly (e.g. FPR ≤ 0.10) alongside agreement. Its mirror is the false-negative rate (strictness) — splitting the two reveals asymmetric judge bias.
See also: Agreement Rate, Meta-Evaluation
- Feedback Poisoning (dataset poisoning)
-
When noisy or adversarial user feedback contaminates a self-improving pipeline’s dataset and few-shot examples. The trap is self-confirming: because the evaluator learns from the same poisoned data, the automated score can rise while real quality falls, so the metric endorses the degradation. Only independent signals — user surveys and the human-curated golden dataset — reveal the truth, which is why those firewalls are essential.
See also: Feedback Quality Filter, Self-Improving Agent
- Feedback Quality Filter (feedback filter)
-
The defenses that keep noisy feedback out of a self-improving pipeline: an engagement-time threshold (discard sub-2-second clicks), cross-signal validation (a “helpful” click followed by an immediate re-query of the same topic is treated as noise), confidence weighting (trust users with a history over first-touch clicks), and a periodic human audit of a sample. Together they are the firewall between automated feedback and dataset quality.
See also: Feedback Poisoning, Golden Dataset
- Few-Shot Examples (few-shot judging)
-
3–5 labelled example judgments added to a judge prompt to fix calibration, especially on boundary cases (case-insensitive match, floating-point tolerance, polite-but-wrong responses). One of the four judge-improvement levers. In a self-improving pipeline these are drawn from collected production data — which is exactly why feedback poisoning can corrupt the judge through them.
See also: Judge Improvement, Feedback Poisoning
- GenAI Semantic Conventions (gen_ai.* conventions, OTEL GenAI conventions)
-
OpenTelemetry’s own
gen_ai.*semantic conventions for GenAI spans — a second LLM-trace attribute namespace that matured alongside Arize’s Open Inference (still experimental as of 2026). Their coexistence fragments agent traces across two vocabularies, so a team must confirm which namespace its backend and dashboards expect before standardising on attribute names that may still move.See also: Open Inference, OpenTelemetry
- Golden Dataset (golden set)
-
A curated set of must-pass scenarios used as a shipping gate: every agent change runs against it and a regression blocks deployment. It differs from a general eval set — curated for criticality (every row must pass), seeded with known failure modes codified from past production bugs, and grown continuously. Because agent runs are probabilistic, run it several times and take the median or worst score with tolerance bands.
See also: CI Shipping Gate, Golden-Dataset Evolution
- Golden-Dataset Evolution (golden dataset maintenance)
-
The policy that keeps a golden dataset useful: add a regression row for every production incident and new feature, prune rows that never trigger a failure on a stable feature, never prune safety-critical rows, and stabilise the size (~40–60 rows) so it stays fast enough to run on every PR. The guiding signal: a gate that never fails is stale, not safe — it has stopped keeping pace with the system.
See also: Golden Dataset, CI Shipping Gate
- Ground-Truth Label (gold label, reference label)
-
The known-correct answer a prediction is scored against — the expected tool for a query, the expected SQL result set, the reference summary judgment. It is the bottleneck for router and code-based evaluations on ambiguous inputs, where two annotators might disagree, so consistent labels require careful annotation guidelines (and often human annotation).
See also: Router Evaluation, Human Annotation
- Human Annotation (human labelling, human evaluation)
-
Evaluation by human labellers or end-user feedback. The most accurate technique and the standard for safety-critical or genuinely ambiguous judgments, but the slowest to scale and subject to selection bias. It is the fallback when neither code-based checks nor an LLM judge are reliable enough — and the source of the ground-truth labels the other techniques are measured against.
See also: LLM-as-a-Judge, Ground-Truth Label
- Input and Output Keys (input keys, output keys, dataset keys)
-
The two field types in an evaluation dataset. Input keys are forwarded to the agent (the query, expected parameters); output keys are forwarded to evaluators for comparison (expected SQL tables, expected trend direction). Including output keys wherever possible is what unlocks cheap code-based evaluators instead of paying for an LLM judge.
See also: Evaluation Dataset, Comprehensive, Not Exhaustive
- Judge Improvement (improving LLM judge accuracy)
-
The four levers for raising a failing judge’s agreement rate, each targeting a different failure mode: prompt engineering (misunderstood criteria — ~80% of judge failures), few-shot examples (calibration on boundary cases), model selection (capability gaps), and semantic similarity (false negatives on paraphrases). Apply one at a time and re-run meta-evaluation after each — the agreement rate is the single metric that says whether it worked.
See also: Meta-Evaluation, Few-Shot Examples, Cosine Similarity
- LLM Model Evaluation (model eval, benchmark evaluation)
-
Evaluation of an isolated LLM using standardised benchmarks — MMLU (knowledge), HumanEval (code), GSM8K (math). Model providers run it; it answers “how capable is this model in general?” It does not predict performance in a specific application, where the surrounding prompts, retrieval, and tools dominate the outcome. Use model evaluation to shortlist candidate models, never to decide whether your system actually works — that is the job of system evaluation.
See also: LLM System Evaluation, Three-Layer Evaluation
- LLM System Evaluation (system eval, end-to-end evaluation)
-
Evaluation of your entire application’s end-to-end performance on real-world, domain-specific data. It answers the question that matters — “does my system solve the user’s actual problem?” — and must be re-run on every prompt or code change, because any of them can regress it. Its main cost is curating representative test data; the practical move is to start with a small synthetic test set and grow it as real traffic arrives.
See also: LLM Model Evaluation, Evaluation-Driven Development
- llm_classify (Phoenix llm_classify)
-
Phoenix’s function for running an LLM judge over a dataframe of spans, constrained to a rails label set, returning one classification per row. In the standard workflow it is wrapped in
suppress_tracing()(to keep judge calls out of the agent’s traces) and followed bylog_evaluations()(to upload the results back onto the spans). It is the concrete form of LLM-as-a-judge in the Phoenix stack.See also: LLM-as-a-Judge, Rails, Suppress Tracing
- LLM-as-a-Judge (LLM judge)
-
A separate LLM prompt that classifies the agent’s output using discrete labels. It scales and handles nuance no
assertcan, but is never 100% accurate — and is only trustworthy when constrained to rails (a fixed label set), not asked for a numeric score. The workhorse for qualitative skill outputs; layer human annotation on top for safety-critical judgments.See also: Rails, Human Annotation, Suppress Tracing
- Manual Instrumentation (explicit instrumentation)
-
Explicit span creation in your own code, wrapping the agent run, router iterations, and tool calls that automatic instrumentation can’t observe. It gives full control over what gets traced and produces the structure-bearing spans; the cost is discipline — a missed
withblock leaves a hole in the trace tree. Follows the outside-in principle.See also: Automatic Instrumentation, Outside-In Instrumentation, Span Hierarchy
- Memory and State (agent memory, state)
-
The agent’s notebook — retrieved context, configuration, and the log of previous execution steps that later decisions depend on. Evaluating it asks whether the right information persists across steps: stale, missing, or corrupted state is a failure mode distinct from a wrong route or a bad skill output, and one that end-to-end scoring often misattributes to the skill that acted on the bad state.
- Meta-Evaluation (evaluating the evaluator)
-
Evaluating an evaluator’s own accuracy by comparing its judgments against known ground truth — quantified as the agreement rate between an LLM judge and a deterministic (code-based) reference. Without it, a dashboard’s “92%” is meaningless if the judge itself is only 75% accurate. For subjective dimensions with no code oracle (tone, creativity), human inter-annotator agreement replaces the code-based reference.
See also: Agreement Rate, False-Positive Rate
- Multiplicative Accuracy (compounding accuracy)
-
The rule that independent component success rates multiply into end-to-end accuracy: . Two “looks-fine” components — say tool selection 0.95 and parameter extraction 0.85 — yield a markedly lower whole (), and the product is bounded by the weakest factor. The practical consequence: report components separately and fix the bottleneck first.
See also: Router Evaluation, Tool Selection
- Non-Determinism (non-deterministic behaviour, stochasticity)
-
The property that identical inputs can produce different outputs across runs — caused by sampling temperature, floating-point non-associativity, batching, and provider-side model updates. It is why a single passing run is a sample, not a proof: a change can raise average quality while introducing rare failures one run never surfaces. Evaluation must therefore be statistical over a representative dataset, not a binary pass/fail on one run. Setting temperature to 0 reduces but does not eliminate it.
See also: LLM System Evaluation, Evaluation-Driven Development
- Observability (agent observability)
-
Complete visibility into every layer of an application — prompts, model responses, token usage, latency, and the function calls orchestrating behaviour. For non-deterministic agents it replaces the stack trace and
printlog, neither of which can capture a path that changes per run or a failure that is semantic (a wrong answer) rather than a crash.See also: Trace, OpenTelemetry, Phoenix
- Open Inference (OpenInference)
-
Arize’s LLM-specific extension to OpenTelemetry: it adds span kinds (Agent, Chain, Tool, LLM, and more — e.g. Retriever, Reranker) and attribute names for prompts, completions, and token counts. OTEL gives the generic trace/span machinery; Open Inference supplies the semantics that make an agent trace meaningful. As of 2026 it shares the space with OTEL’s own
gen_ai.*conventions.See also: OpenTelemetry, Span Kind, GenAI Semantic Conventions
- OpenTelemetry (OTEL)
-
(OTEL) The vendor-neutral observability framework — a data model and instrumentation API for traces, spans, metrics, and logs. Its value is instrument once, export anywhere: spans emitted to the OTEL standard can be sent to any compatible backend. It tracks timing and metadata; LLM-specific semantics come from a layer on top (Open Inference).
See also: Open Inference, GenAI Semantic Conventions, Span
- Outside-In Instrumentation (outside-in span nesting)
-
The principle for placing manual spans: start at the outermost layer (the agent run) and work inward — chain, then tool — letting nested
withblocks build the correct span hierarchy automatically. Following it yields a trace whose nesting matches the architecture; ignoring it produces flat or mis-parented spans that are hard to navigate.See also: Manual Instrumentation, Span Hierarchy
- Parameter Extraction (argument extraction)
-
The router dimension measuring whether — having chosen the right tool — the agent supplied the right argument values (e.g. an
order_idrather than atracking_id; a date range in the correct order). It fails independently of tool selection, so a query can have the right tool and the wrong parameters. Because it’s independent, it must be evaluated as its own dimension.See also: Tool Selection, Router Evaluation
- Phoenix (Arize Phoenix)
-
Arize’s open-source observability tool: it receives OpenTelemetry traces, stores them by project, and provides trace-tree and timeline views, span-detail inspection, and evaluation integration — where instrumented spans become something you can read, query, and score. As of 2026 the package split tracing (
arize-phoenix-otel) from evaluators (arize-phoenix-evals).See also: OpenTelemetry, Trace
- Phoenix Experiment (experiment, Experiments framework)
-
A structured, versioned evaluation run: execute a task function against every row of a dataset, record outputs, then apply evaluator functions. It is the agent analogue of a parametrized
pytestrun — dataset = test cases, task = test function, evaluators = assertions — and the unit you name, version, and compare across agent variants.See also: Evaluation Dataset, Task Function, Evaluator Composition
- Production Monitoring (prod monitoring)
-
Watching a deployed agent with the same tracing and annotation tools used in development, plus time-series metrics: running evaluator scores per dimension, convergence, LLM-call counts (a spike signals a loop), latency (p50/p99 by component), and cost per query — alerting on roughly
10% deviation from baseline. Its real job is catching silent regressions that surface metrics like latency and error rate miss.
See also: Silent Regression, Alert Threshold
- Rails (discrete labels, classification rails)
-
The fixed set of discrete labels that constrain an LLM judge’s output — e.g.
["correct", "incorrect"]or 2–4 categories. Rails prevent free-form responses, enable automated parsing, and, most importantly, keep the judge doing what it does reliably (classify) rather than what it does not (calibrate a numeric scale). They are why LLM-as-a-judge is usable at all.See also: LLM-as-a-Judge, llm_classify
- Regression Detection (regression monitoring)
-
Catching a quality drop by re-running evaluators over traces and comparing scores across versions of the agent. It depends on collected traces as a stable, replayable record: without them you’d re-gather data by hand, and a missing-span gap blinds the comparison exactly where coverage is thin. It is the production payoff of the tracing investment.
See also: Trace, Observability
- Router (routing layer)
-
The agent component that decides which skill or function to call at each step — the agent’s “brain.” Implementations span rules-based code (deterministic, limited), an NLP classifier, and LLM function calling (flexible, non-deterministic). The router is usually the first thing to evaluate: routing errors cascade, wasting every downstream step, and are the cheapest failure to catch — build a labelled message→expected-skill dataset and measure per-category routing accuracy.
See also: Skill, Centralised Routing, Distributed Routing
- Router Evaluation (router eval)
-
Scoring the router on its two independent dimensions — tool selection and parameter extraction — whose product is end-to-end router accuracy. Measuring them separately is what localises where to invest: a single combined number hides which dimension is the bottleneck. The hard part is ground-truth labels for ambiguous queries, which need careful annotation guidelines.
See also: Tool Selection, Parameter Extraction, Multiplicative Accuracy
- Runnability (runnable check)
-
A code-based check that generated code executes without error in a sandbox. It is necessary but not sufficient for correctness: code that runs yet plots the wrong data still passes a runnability check. Treat “it didn’t crash” as a first gate, then layer a semantic check (an LLM judge or a result comparison) for whether the output is actually right.
See also: Code-Based Evaluation, LLM-as-a-Judge
- Self-Improving Agent (self-improving pipeline)
-
An automated pipeline where production feedback continuously updates the eval dataset, triggers CI/CD experiments, adds few-shot examples, and gates deployment via the golden dataset — improving the agent with minimal human intervention. Automation is the multiplier: each interaction makes the system incrementally better. The catch: without quality gates it can poison itself, so a human-curated golden dataset and feedback filters are mandatory.
See also: Feedback Poisoning, Golden Dataset
- Ship Decision (shipping decision, variant promotion)
-
Deciding whether to promote a candidate variant by reading the whole per-evaluator row plus cost, latency, and statistical significance — never an aggregate score alone. A correctness gain that isn’t statistically significant, or that a latency SLO would reject, is not a ship. The dashboard supports the call; cost/latency/significance are the context that turns scores into a decision.
See also: Dashboard Comparison, Statistical Significance
- Silent Regression (silent failure)
-
The most dangerous production failure: plausible-looking but wrong outputs that throw no error — for example, an agent fabricating an answer when a data source returns empty. Latency and error-rate dashboards stay green because nothing crashed, so only content-quality monitoring — evaluator scores over sampled live traces, plus provenance checks — catches it. It is the reason production monitoring needs output metrics, not just system metrics.
See also: Production Monitoring, Alert Threshold
- Skill (agent skill, capability)
-
An individual capability within an agent — query a database, generate analysis, build a chart. A skill may contain several sub-steps (LLM calls, API calls, code), and each sub-step is its own evaluation target. An entire LLM application can itself be a single skill inside a larger agent, which is why agent decomposition recurses: the unit you evaluate depends on the level at which you draw the boundary.
See also: Router, Memory and State
- Span (trace span)
-
One step within a trace — an LLM call, a tool invocation, or a routing decision — recording start time, end time, attributes (including token counts), and status. Spans nest hierarchically, and that nesting mirrors the agent’s architecture, so the set of spans in a trace is both the execution log and the structure map.
See also: Trace, Span Kind, Span Hierarchy
- Span Hierarchy (span tree, trace tree)
-
The nesting of spans within a trace, which mirrors the agent’s architecture: Agent ⊃ Chain ⊃ Tool ⊃ LLM. Reading the hierarchy is reading the architecture diagram with live data, and it defines the debugging workflow — start at the symptom span and walk back through its parents. A poorly instrumented agent collapses to a flat hierarchy that hides where a failure occurred.
See also: Span Kind, Trace, Outside-In Instrumentation
- Span Kind (OpenInference span kinds)
-
The role label on a span. Open Inference defines several kinds; the four most relevant to agents are Agent (the full run), Chain (a router iteration), Tool (a skill invocation), and LLM (a single model call) — others include Retriever, Reranker, and Embedding. These four give a trace its Agent ⊃ Chain ⊃ Tool ⊃ LLM shape, so a reader can tell a routing step from a tool call from a raw model call at a glance.
See also: Span, Span Hierarchy, Open Inference
- Statistical Significance (significance)
-
Whether a measured score difference between variants is real or sampling noise — acutely relevant on the small “comprehensive, not exhaustive” datasets used for fast iteration. A higher aggregate score with no significance is not a reason to ship: gather more data or run an A/B test first. It is one of the contexts (with cost and latency) that an aggregate score needs before it becomes a decision.
See also: Ship Decision, Dashboard Comparison
- Suppress Tracing (suppress_tracing)
-
A Phoenix context manager that keeps an LLM judge’s own calls from being recorded as spans, so evaluation artifacts don’t intermingle with — and roughly double — the agent’s traces. You wrap
llm_classify()(and any judge call) in it during evaluation runs, and drop it only when deliberately debugging the judge itself.See also: LLM-as-a-Judge, llm_classify
- Task Function (task)
-
In Phoenix Experiments, the function that takes one dataset row, runs the agent on it, and returns the output plus metadata — the agent analogue of a
pytesttest function. It is the bridge between the dataset (its input) and the evaluators (which score its output), and is run once per row when the experiment executes.See also: Phoenix Experiment, Evaluation Dataset
- Three-Layer Evaluation (model–application–agent layers, layer model)
-
A mental model that places any evaluation activity at one of three layers. Layer 1 (model) — fixed benchmarks on the isolated LLM, handled by the provider. Layer 2 (application) — your domain-specific, end-to-end prompts and retrieval. Layer 3 (agent) — multi-step routing with variable paths and state across steps. Naming the layer first is what stops you applying a model-layer benchmark to an agent-layer failure, the single most common mismatch in proposed evaluation strategies.
See also: LLM Model Evaluation, LLM System Evaluation, AI Agent
- Tool Selection (function-calling choice)
-
The router dimension measuring whether the agent chose the correct tool for a query (function-calling choice), scored by exact match against a labelled target (
predicted_tool == expected_tool). It is one of the two independent router dimensions — a high tool-selection score can still mask poor parameter extraction, so the two must be tracked separately.See also: Parameter Extraction, Router Evaluation
- Trace (agent trace)
-
The complete end-to-end record of a single agent run — a hierarchical tree of spans linked by one trace ID. Because it holds full inputs, outputs, and every intermediate step, a trace is the unit of replay, cost attribution, and evaluation at scale: you can run automated checks over thousands of traces to detect regressions and measure quality.
See also: Span, Span Hierarchy, Trace-Based Cost Attribution
- Trace-Based Cost Attribution (per-skill cost attribution)
-
Computing per-skill (or per-component) LLM cost from the token counts recorded on each span: average calls × tokens-per-call, weighted by the skill’s query share × daily volume × price per token. It turns cost optimisation from guesswork into arithmetic, and routinely reveals a skill that is a small share of traffic yet a large share of spend — and even which sub-step within it to redirect to a cheaper model.
- Trajectory (agent trajectory, step sequence)
-
The ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response. It captures not just what the agent produced but how it got there. Two agents can reach the same correct answer by very different paths, so the trajectory carries the cost, latency, and failure-variance that the final answer alone hides — at scale, the path decides viability.
See also: Convergence Score, Wasted-Work Cost
- Wasted-Work Cost (wasted work, non-optimal step cost)
-
The cost of non-optimal steps: per run, times daily volume, times price per step — expressed as a daily dollar figure. Converting trajectory inefficiency into money is what turns it from a curiosity into a shipping metric, and lets two variants be compared on efficiency, not just accuracy.
See also: Trajectory, Convergence Score