Flashcards
Study the glossary as flashcards: shuffled, one term at a time — recall the definition, flip to check, and sort each card into "knew it" / "still learning".
67 cards · 0 marked known
Agent Variant
A version of the agent differing by one of the five improvement levers — prompt, tool definitions, router logic, skill structure, or model selection. The discipline is to change one lever at a time: alter the prompt and the model together and you can’t attribute the result to either. Variants are compared head-to-head in an experiment.
Agreement Rate
The fraction of cases where an LLM judge produces the same label as the ground-truth evaluator. It is the headline meta-evaluation metric — target ≥ 0.90, and below about 0.85 don’t ship the judge. Its blind spot: it hides which way the judge errs, so a high agreement rate can still sit on top of a dangerous false-positive (leniency) rate. Always report it alongside FPR.
AI Agent
A software system that takes actions on behalf of a user using reasoning — deciding what to do next, which tool to call, and when to stop, rather than emitting a single fixed response. It composes three capabilities: reasoning (the LLM), routing (tool selection), and action (execution), then loops on the result. What makes an agent harder to evaluate than a plain LLM application is the third, agent layer: multi-step, dynamic paths whose success depends on choices made at run time.
Alert Threshold
The deviation that fires a production alert — commonly about >10% from a metric’s baseline, set per metric (evaluator scores, convergence, LLM-calls/query, p50/p99 latency, cost/query). The tuning trade-off: too tight and the alert is noisy and ignored, too loose and a silent regression slips past. Thresholds turn the monitored time-series into actionable signals.
Automatic Instrumentation
Zero-code tracing: a library hooks the LLM API client and emits a span for every call without touching your code. It delivers instant LLM-level detail — prompts, completions, tokens, latency — but cannot see your router logic, tool dispatch, or business code, since those aren’t API calls. The fast first step; pair it with manual spans to fill the gaps.
Centralised Routing
A routing architecture where a single router makes every decision. Because all routing flows through one decision point, it is the easiest to trace and to evaluate — but that single point becomes a bottleneck for complex, branching flows. Contrast with distributed routing, which trades evaluation simplicity for flexibility.
CI Shipping Gate
A golden dataset wired into CI so every PR touching agent code triggers a run, and any regression blocks the merge. It encodes institutional knowledge that new engineers inherit automatically through the gate. Because agent runs are probabilistic, the gate runs the dataset multiple times with tolerance bands rather than demanding a single deterministic pass.
Code-Based Evaluation
Programmatic checks — regex, JSON validation, exact match, cosine similarity, SQL result-set
diff — that classify an output as correct or incorrect. It is 100% reproducible, free, and
fast, but applies only to codifiable criteria. The rule of thumb: if you can write an
assert, use code-based evaluation; reach for an LLM judge only when the output is prose or
judgment.
Comprehensive, Not Exhaustive
The dataset-design principle: cover every input type with 1–2 examples rather than piling 100 near-duplicates per category. It keeps each experiment fast to run and cheap to maintain — iteration speed beats volume — while still catching regressions across the behaviour space. Resisting the urge to add bulk is what keeps the evaluation loop tight.
Convergence Score
A metric in for the fraction of completed runs that follow the optimal, minimum-step path across similar queries (≥ 0.8 stable, 0.5–0.8 investigate, below 0.5 unreliable). It measures consistency, not correctness — 1.0 can mean a consistently wrong shortcut — and is blind to waste present in every run (the waste enters the minimum). Always pair it with per-step correctness evaluators.
Cosine Similarity
A code-based similarity metric on two embedding vectors, , ranging from −1 to 1; a threshold (≈ 0.85) marks two outputs as “semantically equivalent.” It lets a deterministic, free check approximate meaning-level matching — useful when exact string match is too strict but an LLM judge is more than you need.
Dashboard Comparison
Laying agent variants out across multiple evaluator dimensions at once (routing, SQL, clarity, runnability, latency …) so trade-offs a single aggregate score would mask become visible — e.g. a variant that gains correctness while regressing clarity and latency. It is the core decision artifact for a ship call, provided the evaluators are fine-grained enough (binary pass/fail hides nuance).
Distributed Routing
A routing architecture (LangGraph, OpenAI Swarm-style graphs) where routing logic is spread across multiple components. More flexible than centralised routing, but harder to evaluate: routing decisions occur at many points, each a separate place a wrong turn can originate, so evaluation coverage must reach every decision node rather than a single one.
Evaluation Dataset
A structured collection of test cases with input keys (forwarded to the agent) and output keys (forwarded to evaluators for comparison). Built to be “comprehensive, not exhaustive” — 1–2 cases per input type — and fed from three sources: manually constructed (edge cases), model-generated (coverage), and production-sampled (highest ecological validity).
Evaluation Flywheel
The self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data, repeating. It is only as fast as its slowest stage, so trace-to-dataset promotion must be automated — teams that automate iterate ~10× faster. The failure mode: unfiltered noisy or adversarial traffic entering the dataset degrades it, so the automation needs a quality gate.
Evaluation Technique Selection
The decision framework for picking an evaluation technique by output type: code-based for
quantitative/codifiable outputs (SQL, dollar amounts, runnability), LLM-as-a-judge for
qualitative/subjective ones (prose, tone), and human annotation for safety-critical calls. It
prevents over-spending on LLM judges where a deterministic assert would do — and
under-evaluating prose where an assert can’t.
Evaluation-Driven Development
A methodology where every code or prompt change triggers an evaluation cycle: trace execution, evaluate each component against its criteria, then iterate on the worst-scoring one. It is the LLM analogue of test-driven development. The defining shift is timing — evaluation is the primary feedback loop during development, not a gate cleared just before deployment — so regressions surface while they are still cheap to fix.
Evaluator Composition
Running several evaluators over one experiment because no single one captures the full picture — e.g. function-calling correctness, SQL accuracy, clarity (LLM judge), entity correctness, and runnability together. Composition is what surfaces cross-dimension trade-offs on the dashboard, where one variant gains on correctness but loses on clarity or latency.
False-Positive Rate
For a quality-gating judge, the fraction of truly-wrong answers it labels “correct” — its leniency, and the dangerous error for a gate, because a lenient judge passes bad answers. A high agreement rate can still hide a high FPR, so cap it explicitly (e.g. FPR ≤ 0.10) alongside agreement. Its mirror is the false-negative rate (strictness) — splitting the two reveals asymmetric judge bias.
Feedback Poisoning
When noisy or adversarial user feedback contaminates a self-improving pipeline’s dataset and few-shot examples. The trap is self-confirming: because the evaluator learns from the same poisoned data, the automated score can rise while real quality falls, so the metric endorses the degradation. Only independent signals — user surveys and the human-curated golden dataset — reveal the truth, which is why those firewalls are essential.
Feedback Quality Filter
The defenses that keep noisy feedback out of a self-improving pipeline: an engagement-time threshold (discard sub-2-second clicks), cross-signal validation (a “helpful” click followed by an immediate re-query of the same topic is treated as noise), confidence weighting (trust users with a history over first-touch clicks), and a periodic human audit of a sample. Together they are the firewall between automated feedback and dataset quality.
Few-Shot Examples
3–5 labelled example judgments added to a judge prompt to fix calibration, especially on boundary cases (case-insensitive match, floating-point tolerance, polite-but-wrong responses). One of the four judge-improvement levers. In a self-improving pipeline these are drawn from collected production data — which is exactly why feedback poisoning can corrupt the judge through them.
GenAI Semantic Conventions
OpenTelemetry’s own gen_ai.* semantic conventions for GenAI spans — a second
LLM-trace attribute namespace that matured alongside Arize’s Open Inference (still experimental
as of 2026). Their coexistence fragments agent traces across two vocabularies, so a team must
confirm which namespace its backend and dashboards expect before standardising on attribute
names that may still move.
Golden Dataset
A curated set of must-pass scenarios used as a shipping gate: every agent change runs against it and a regression blocks deployment. It differs from a general eval set — curated for criticality (every row must pass), seeded with known failure modes codified from past production bugs, and grown continuously. Because agent runs are probabilistic, run it several times and take the median or worst score with tolerance bands.
Golden-Dataset Evolution
The policy that keeps a golden dataset useful: add a regression row for every production incident and new feature, prune rows that never trigger a failure on a stable feature, never prune safety-critical rows, and stabilise the size (~40–60 rows) so it stays fast enough to run on every PR. The guiding signal: a gate that never fails is stale, not safe — it has stopped keeping pace with the system.
Ground-Truth Label
The known-correct answer a prediction is scored against — the expected tool for a query, the expected SQL result set, the reference summary judgment. It is the bottleneck for router and code-based evaluations on ambiguous inputs, where two annotators might disagree, so consistent labels require careful annotation guidelines (and often human annotation).
Human Annotation
Evaluation by human labellers or end-user feedback. The most accurate technique and the standard for safety-critical or genuinely ambiguous judgments, but the slowest to scale and subject to selection bias. It is the fallback when neither code-based checks nor an LLM judge are reliable enough — and the source of the ground-truth labels the other techniques are measured against.
Input and Output Keys
The two field types in an evaluation dataset. Input keys are forwarded to the agent (the query, expected parameters); output keys are forwarded to evaluators for comparison (expected SQL tables, expected trend direction). Including output keys wherever possible is what unlocks cheap code-based evaluators instead of paying for an LLM judge.
Judge Improvement
The four levers for raising a failing judge’s agreement rate, each targeting a different failure mode: prompt engineering (misunderstood criteria — ~80% of judge failures), few-shot examples (calibration on boundary cases), model selection (capability gaps), and semantic similarity (false negatives on paraphrases). Apply one at a time and re-run meta-evaluation after each — the agreement rate is the single metric that says whether it worked.
LLM Model Evaluation
Evaluation of an isolated LLM using standardised benchmarks — MMLU (knowledge), HumanEval (code), GSM8K (math). Model providers run it; it answers “how capable is this model in general?” It does not predict performance in a specific application, where the surrounding prompts, retrieval, and tools dominate the outcome. Use model evaluation to shortlist candidate models, never to decide whether your system actually works — that is the job of system evaluation.
LLM System Evaluation
Evaluation of your entire application’s end-to-end performance on real-world, domain-specific data. It answers the question that matters — “does my system solve the user’s actual problem?” — and must be re-run on every prompt or code change, because any of them can regress it. Its main cost is curating representative test data; the practical move is to start with a small synthetic test set and grow it as real traffic arrives.
llm_classify
Phoenix’s function for running an LLM judge over a dataframe of spans, constrained to a rails
label set, returning one classification per row. In the standard workflow it is wrapped in
suppress_tracing() (to keep judge calls out of the agent’s traces) and followed by
log_evaluations() (to upload the results back onto the spans). It is the concrete form of
LLM-as-a-judge in the Phoenix stack.
LLM-as-a-Judge
A separate LLM prompt that classifies the agent’s output using discrete labels. It scales
and handles nuance no assert can, but is never 100% accurate — and is only trustworthy
when constrained to rails (a fixed label set), not asked for a numeric score.
The workhorse for qualitative skill outputs; layer human annotation on top for
safety-critical judgments.
Manual Instrumentation
Explicit span creation in your own code, wrapping the agent run, router iterations, and
tool calls that automatic instrumentation can’t observe. It gives full control over what gets
traced and produces the structure-bearing spans; the cost is discipline — a missed with
block leaves a hole in the trace tree. Follows the outside-in principle.
Memory and State
The agent’s notebook — retrieved context, configuration, and the log of previous execution steps that later decisions depend on. Evaluating it asks whether the right information persists across steps: stale, missing, or corrupted state is a failure mode distinct from a wrong route or a bad skill output, and one that end-to-end scoring often misattributes to the skill that acted on the bad state.
Meta-Evaluation
Evaluating an evaluator’s own accuracy by comparing its judgments against known ground truth — quantified as the agreement rate between an LLM judge and a deterministic (code-based) reference. Without it, a dashboard’s “92%” is meaningless if the judge itself is only 75% accurate. For subjective dimensions with no code oracle (tone, creativity), human inter-annotator agreement replaces the code-based reference.
Multiplicative Accuracy
The rule that independent component success rates multiply into end-to-end accuracy: . Two “looks-fine” components — say tool selection 0.95 and parameter extraction 0.85 — yield a markedly lower whole (), and the product is bounded by the weakest factor. The practical consequence: report components separately and fix the bottleneck first.
Non-Determinism
The property that identical inputs can produce different outputs across runs — caused by sampling temperature, floating-point non-associativity, batching, and provider-side model updates. It is why a single passing run is a sample, not a proof: a change can raise average quality while introducing rare failures one run never surfaces. Evaluation must therefore be statistical over a representative dataset, not a binary pass/fail on one run. Setting temperature to 0 reduces but does not eliminate it.
Observability
Complete visibility into every layer of an application — prompts, model responses, token
usage, latency, and the function calls orchestrating behaviour. For non-deterministic agents
it replaces the stack trace and print log, neither of which can capture a path that changes
per run or a failure that is semantic (a wrong answer) rather than a crash.
Open Inference
Arize’s LLM-specific extension to OpenTelemetry: it adds span kinds (Agent, Chain, Tool,
LLM, and more — e.g. Retriever, Reranker) and attribute names for prompts, completions, and
token counts. OTEL gives the generic
trace/span machinery; Open Inference supplies the semantics that make an agent trace
meaningful. As of 2026 it shares the space with OTEL’s own gen_ai.* conventions.
OpenTelemetry
(OTEL) The vendor-neutral observability framework — a data model and instrumentation API for traces, spans, metrics, and logs. Its value is instrument once, export anywhere: spans emitted to the OTEL standard can be sent to any compatible backend. It tracks timing and metadata; LLM-specific semantics come from a layer on top (Open Inference).
Outside-In Instrumentation
The principle for placing manual spans: start at the outermost layer (the agent run) and
work inward — chain, then tool — letting nested with blocks build the correct span
hierarchy automatically. Following it yields a trace whose nesting matches the architecture;
ignoring it produces flat or mis-parented spans that are hard to navigate.
Parameter Extraction
The router dimension measuring whether — having chosen the right tool — the agent supplied the
right argument values (e.g. an order_id rather than a tracking_id; a date range in the
correct order). It fails independently of tool selection, so a query can have the right tool
and the wrong parameters. Because it’s independent, it must be evaluated as its own dimension.
Phoenix
Arize’s open-source observability tool: it receives OpenTelemetry traces, stores them by
project, and provides trace-tree and timeline views, span-detail inspection, and evaluation
integration — where instrumented spans become something you can read, query, and score. As of
2026 the package split tracing (arize-phoenix-otel) from evaluators (arize-phoenix-evals).
Phoenix Experiment
A structured, versioned evaluation run: execute a task function against every row of a
dataset, record outputs, then apply evaluator functions. It is the agent analogue of a
parametrized pytest run — dataset = test cases, task = test function, evaluators =
assertions — and the unit you name, version, and compare across agent variants.
Production Monitoring
Watching a deployed agent with the same tracing and annotation tools used in development, plus time-series metrics: running evaluator scores per dimension, convergence, LLM-call counts (a spike signals a loop), latency (p50/p99 by component), and cost per query — alerting on roughly
10% deviation from baseline. Its real job is catching silent regressions that surface metrics like latency and error rate miss.
Rails
The fixed set of discrete labels that constrain an LLM judge’s output — e.g.
["correct", "incorrect"] or 2–4 categories. Rails prevent free-form responses, enable
automated parsing, and, most importantly, keep the judge doing what it does reliably
(classify) rather than what it does not (calibrate a numeric scale). They are why
LLM-as-a-judge is usable at all.
Regression Detection
Catching a quality drop by re-running evaluators over traces and comparing scores across versions of the agent. It depends on collected traces as a stable, replayable record: without them you’d re-gather data by hand, and a missing-span gap blinds the comparison exactly where coverage is thin. It is the production payoff of the tracing investment.
Router
The agent component that decides which skill or function to call at each step — the agent’s “brain.” Implementations span rules-based code (deterministic, limited), an NLP classifier, and LLM function calling (flexible, non-deterministic). The router is usually the first thing to evaluate: routing errors cascade, wasting every downstream step, and are the cheapest failure to catch — build a labelled message→expected-skill dataset and measure per-category routing accuracy.
Router Evaluation
Scoring the router on its two independent dimensions — tool selection and parameter extraction — whose product is end-to-end router accuracy. Measuring them separately is what localises where to invest: a single combined number hides which dimension is the bottleneck. The hard part is ground-truth labels for ambiguous queries, which need careful annotation guidelines.
Runnability
A code-based check that generated code executes without error in a sandbox. It is necessary but not sufficient for correctness: code that runs yet plots the wrong data still passes a runnability check. Treat “it didn’t crash” as a first gate, then layer a semantic check (an LLM judge or a result comparison) for whether the output is actually right.
Self-Improving Agent
An automated pipeline where production feedback continuously updates the eval dataset, triggers CI/CD experiments, adds few-shot examples, and gates deployment via the golden dataset — improving the agent with minimal human intervention. Automation is the multiplier: each interaction makes the system incrementally better. The catch: without quality gates it can poison itself, so a human-curated golden dataset and feedback filters are mandatory.
Ship Decision
Deciding whether to promote a candidate variant by reading the whole per-evaluator row plus cost, latency, and statistical significance — never an aggregate score alone. A correctness gain that isn’t statistically significant, or that a latency SLO would reject, is not a ship. The dashboard supports the call; cost/latency/significance are the context that turns scores into a decision.
Silent Regression
The most dangerous production failure: plausible-looking but wrong outputs that throw no error — for example, an agent fabricating an answer when a data source returns empty. Latency and error-rate dashboards stay green because nothing crashed, so only content-quality monitoring — evaluator scores over sampled live traces, plus provenance checks — catches it. It is the reason production monitoring needs output metrics, not just system metrics.
Skill
An individual capability within an agent — query a database, generate analysis, build a chart. A skill may contain several sub-steps (LLM calls, API calls, code), and each sub-step is its own evaluation target. An entire LLM application can itself be a single skill inside a larger agent, which is why agent decomposition recurses: the unit you evaluate depends on the level at which you draw the boundary.
Span
One step within a trace — an LLM call, a tool invocation, or a routing decision — recording start time, end time, attributes (including token counts), and status. Spans nest hierarchically, and that nesting mirrors the agent’s architecture, so the set of spans in a trace is both the execution log and the structure map.
Span Hierarchy
The nesting of spans within a trace, which mirrors the agent’s architecture: Agent ⊃ Chain ⊃ Tool ⊃ LLM. Reading the hierarchy is reading the architecture diagram with live data, and it defines the debugging workflow — start at the symptom span and walk back through its parents. A poorly instrumented agent collapses to a flat hierarchy that hides where a failure occurred.
Span Kind
The role label on a span. Open Inference defines several kinds; the four most relevant to agents are Agent (the full run), Chain (a router iteration), Tool (a skill invocation), and LLM (a single model call) — others include Retriever, Reranker, and Embedding. These four give a trace its Agent ⊃ Chain ⊃ Tool ⊃ LLM shape, so a reader can tell a routing step from a tool call from a raw model call at a glance.
Statistical Significance
Whether a measured score difference between variants is real or sampling noise — acutely relevant on the small “comprehensive, not exhaustive” datasets used for fast iteration. A higher aggregate score with no significance is not a reason to ship: gather more data or run an A/B test first. It is one of the contexts (with cost and latency) that an aggregate score needs before it becomes a decision.
Suppress Tracing
A Phoenix context manager that keeps an LLM judge’s own calls from being recorded as spans,
so evaluation artifacts don’t intermingle with — and roughly double — the agent’s traces. You
wrap llm_classify() (and any judge call) in it during evaluation runs, and drop it only when
deliberately debugging the judge itself.
Task Function
In Phoenix Experiments, the function that takes one dataset row, runs the agent on it, and
returns the output plus metadata — the agent analogue of a pytest test function. It is the
bridge between the dataset (its input) and the evaluators (which score its output), and is run
once per row when the experiment executes.
Three-Layer Evaluation
A mental model that places any evaluation activity at one of three layers. Layer 1 (model) — fixed benchmarks on the isolated LLM, handled by the provider. Layer 2 (application) — your domain-specific, end-to-end prompts and retrieval. Layer 3 (agent) — multi-step routing with variable paths and state across steps. Naming the layer first is what stops you applying a model-layer benchmark to an agent-layer failure, the single most common mismatch in proposed evaluation strategies.
Tool Selection
The router dimension measuring whether the agent chose the correct tool for a query
(function-calling choice), scored by exact match against a labelled target
(predicted_tool == expected_tool). It is one of the two independent router dimensions — a
high tool-selection score can still mask poor parameter
extraction, so the two must be tracked separately.
Trace
The complete end-to-end record of a single agent run — a hierarchical tree of spans linked by one trace ID. Because it holds full inputs, outputs, and every intermediate step, a trace is the unit of replay, cost attribution, and evaluation at scale: you can run automated checks over thousands of traces to detect regressions and measure quality.
Trace-Based Cost Attribution
Computing per-skill (or per-component) LLM cost from the token counts recorded on each span: average calls × tokens-per-call, weighted by the skill’s query share × daily volume × price per token. It turns cost optimisation from guesswork into arithmetic, and routinely reveals a skill that is a small share of traffic yet a large share of spend — and even which sub-step within it to redirect to a cheaper model.
Trajectory
The ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response. It captures not just what the agent produced but how it got there. Two agents can reach the same correct answer by very different paths, so the trajectory carries the cost, latency, and failure-variance that the final answer alone hides — at scale, the path decides viability.
Wasted-Work Cost
The cost of non-optimal steps: per run, times daily volume, times price per step — expressed as a daily dollar figure. Converting trajectory inefficiency into money is what turns it from a curiosity into a shipping metric, and lets two variants be compared on efficiency, not just accuracy.