Practice questions

80 questions across 5 domains. Try each before revealing the answer.

Take a scored practice exam: a random form sampled from the bank below, with a per-domain score readout.

Questions: of 30

Evaluation Foundations

eaa-1-agent-vs-app mcq apply ◆◆◇◇

You have four LLM features. Which one is an agent rather than a fixed LLM application?

Why

The distinguishing mark of an agent is run-time reasoning about what to do next — the system that chooses among searching, querying, or answering, and then routes based on the result, is making routing decisions an application doesn’t. The single-call summariser and the email classifier are one-shot LLM calls with no routing. The RAG pipeline runs a predetermined retrieve-then-generate sequence — useful, but the path is fixed in advance, so it is an application, not an agent.

Options a. A prompt that summarises a pasted document in a single LLM call b. A RAG pipeline that retrieves, then generates an answer, in a fixed two-step sequence c. A classifier that labels incoming emails by topic with one LLM call d. A system that decides per request which tool to call, then picks its next step from the result

Show answer

Correct: A system that decides per request which tool to call, then picks its next step from the result

eaa-1-apply-layers free apply ◆◆◇◇

For a RAG-based customer-support agent, give one concrete evaluation activity at each of the three layers (model, application, agent), naming a metric for each.

Show answer

Model layer: measure the base LLM's factual accuracy on a curated QA set (metric: exact-match or accuracy) — e.g. does it state the correct refund window for a policy question. Application layer: evaluate the RAG retrieval pipeline (metric: NDCG@k or recall@k over query–document pairs) — do the correct knowledge-base articles come back for a query? Agent layer: measure end-to-end multi-turn task completion (metric: resolution rate within N turns) — does the agent gather information, retrieve, draft, and escalate correctly across a conversation? The point is that each layer catches a different failure: a perfect model can still fail at the agent layer if routing or escalation is broken. Common wrong answer: giving three variants of end-to-end accuracy — that only exercises the agent layer and cannot localise a retrieval or model fault.

eaa-1-benchmark-vs-domain free apply ◆◆◆◇

Your manager wants to choose a model for a legal-document assistant purely by leaderboard ranking. Make the case for what additional evaluation is required and why the leaderboard alone is insufficient.

Show answer

Leaderboards measure isolated-model capability on general benchmarks (model evaluation); they say nothing about how the model performs inside your prompts, retrieval, and tools on *legal* data (system evaluation). Benchmark scores also suffer saturation, possible training contamination, and distribution mismatch with a specialised legal domain. The right process: use the leaderboard to shortlist two or three candidates, then run system evaluation on a curated set of representative legal queries with domain-appropriate criteria — citation correctness, hallucination rate, refusal on out-of-scope questions — and pick the winner on that. Common wrong answer: "the top-ranked model is the safe choice" — ranking predicts general capability, not domain performance; the only decision-grade signal is evaluation on your own data.

eaa-1-classify-layer mcq apply ◆◆◇◇

You measure whether your agent selects the correct tool for different query types. In the model / application / agent layer model, which layer does this activity belong to?

Why

Measuring tool selection is routing accuracy, and routing is the defining behaviour of the agent layer (Layer 3) — multi-step systems that choose what to do next. It is not the model layer, which tests the isolated LLM on fixed benchmarks regardless of your tools. It is not the application layer, which evaluates domain prompts and retrieval end to end without a routing decision. And routing is squarely inside the layer model, not outside it — the agent layer exists precisely to capture decisions like this one.

Options a. Model layer — it tests the LLM's raw capability b. Agent layer — tool selection is a routing decision specific to multi-step agents c. Application layer — it tests your domain-specific prompts end to end d. It belongs to no single layer, because routing sits outside the layer model

Show answer

Correct: Agent layer — tool selection is a routing decision specific to multi-step agents

eaa-1-decompose-agent free analyze ◆◆◇◇

A travel-booking agent can search flights, check a loyalty balance, hold a reservation, and draft a confirmation email, routed by LLM function calling. Decompose it into router / skills / memory, and say which you would evaluate first and how.

Show answer

Router: the LLM function-calling step that chooses among the four skills. Skills: flight search, loyalty-balance check, reservation hold, and email drafting (each possibly multi-step). Memory/state: the trip parameters and conversation history carried between steps. Evaluate the router first — if it picks the wrong skill, all downstream work is wasted, and routing is the most common origin of agent failure. Build a labelled dataset of user messages paired with the expected skill and measure per-category routing accuracy, then move to the lowest-scoring skill. Common wrong answer: "evaluate the final email quality" — that is an end-to-end measure that tells you *that* the agent failed, not *where*; a routing error would masquerade as a bad skill output and send you debugging the wrong component.

eaa-1-define-agent free understand ◆◇◇◇

Define an AI agent and name its three core capabilities. In one sentence, what distinguishes it from a fixed prompt chain?

Show answer

An AI agent is a software system that takes actions on a user's behalf using reasoning. Its three core capabilities are reasoning (the LLM decides what to do), routing (selecting which tool or skill to call), and action (executing the tool), looping on the result. What distinguishes it from a fixed prompt chain is that the agent decides *what to do next* at run time, rather than following a predetermined sequence of steps. Common wrong answer: "an agent is any application that calls an LLM" — that conflates an application with an agent; without run-time routing and decision-making it is a fixed application, not an agent.

eaa-1-edd-applied free apply ◆◆◇◇

A team runs evaluations only the night before each release and keeps shipping regressions they trace days later. Rework their workflow using evaluation-driven development, and say concretely what changes.

Show answer

Move evaluation from a once-per-release gate to a per-change feedback loop. Concretely: maintain a versioned evaluation dataset and a one-command eval run; on every prompt or code change, trace the agent, run the component and end-to-end evals, and read the scores before merging — not the night before release; then iterate immediately on the worst-scoring component. Wire the eval run into CI so it cannot be skipped. The change is timing and granularity: regressions surface at the commit that caused them (cheap to localise) instead of accumulating until a pre-release batch run (expensive to trace back days later). Common wrong answer: "run the full eval suite twice before release instead of once" — that is still a gate; it does nothing to tell you *which change* introduced a regression.

eaa-1-edd-loop free understand ◆◇◇◇

Describe the three steps of evaluation-driven development and explain how it differs from running evaluations as a pre-deployment gate.

Show answer

The three steps are: trace (capture each step's inputs and outputs), evaluate (score each component against its criteria — routing accuracy, skill output quality, and so on), and iterate (fix the worst-scoring component) — repeated on every code or prompt change. The difference from a pre-deployment gate is timing: evaluation-driven development makes evaluation the primary feedback loop *during* development, so regressions surface immediately and are cheap to localise, whereas a gate catches them only at the end when they are expensive to trace back. Common wrong answer: "it's the same as running the test suite before release" — that is exactly the gate model EDD replaces; EDD evaluates continuously, not once at the end.

eaa-1-eval-strategy-critique free analyze ◆◆◆◇

A team proposes to catch every agent failure by tracking a single metric: end-to-end task-resolution rate. Evaluate this strategy — what will it catch, what will it miss, and what would you add?

Show answer

End-to-end task-resolution rate is necessary but not sufficient. What it catches: whether the agent succeeded overall, reflecting real user outcomes — keep it. What it misses: *where* and *why* it failed. A single number aggregates router, skill, and state failures, so a drop can't tell you which component broke; the multiplicative-accuracy effect means several mediocre components can sink the rate invisibly; and rare failure modes hide inside the average. What to add: component-level evals (routing accuracy, per-skill output quality) and trajectory/trace inspection so a regression localises to a layer, plus slicing by query type to expose rare failures. Layer the metric to the failure mode — end-to-end for outcome, component for diagnosis. Common wrong answer: "resolution rate is ground truth, so it's enough" — it is the right *outcome* metric but cannot localise failures, which is exactly what you need to fix them.

eaa-1-model-vs-system mcq understand ◆◆◇◇

A team reports their agent “passed evaluation” because the underlying model ranks in the top tier on MMLU and HumanEval. What is the most important limitation of this claim?

Why

A high benchmark score is model evaluation — it measures the isolated model’s general capability, not how that model behaves inside the team’s prompts, retrieval, and tools on their own data, which is system evaluation. The age of the benchmarks isn’t the core issue; they remain useful as a capability signal. Swapping to GSM8K just substitutes one general benchmark for another — still not their domain. And benchmark evaluation isn’t something you re-run per prompt change; that frequency describes system evaluation. The gap is that “the model is capable” was mistaken for “our system works.”

Options a. Benchmark scores measure isolated-model capability, not performance inside their prompts, retrieval, and tools b. MMLU and HumanEval are outdated benchmarks that no longer correlate with model capability c. The team should have used GSM8K, which better matches agent tasks d. Benchmark evaluation is valid only if it is re-run on every prompt change

Show answer

Correct: Benchmark scores measure isolated-model capability, not performance inside their prompts, retrieval, and tools

eaa-1-multiplicative-accuracy mcq analyze ◆◆◇◇

An agent routes correctly 90% of the time, and its single skill produces a correct output 90% of the time, independently. What end-to-end success rate should you expect, and what does it imply?

Why

End-to-end success requires both independent steps to succeed, so the probabilities multiply: $0.9 \times 0.9 = 0.81$ . Independence does cause compounding — the claim that they don’t is exactly the error. Success rates are not averaged (that would give 90%), and chaining components lowers reliability rather than raising it toward 99%. The takeaway: two “90% — looks fine” components yield a system that fails roughly one in five times, which is the quantitative case for evaluating components, not just the endpoint.

Options a. 90% — the steps don't compound because they are independent b. 95% — independent steps average their success rates c. 81% — independent step successes multiply (0.9 × 0.9) d. 99% — combining two strong components raises overall reliability

Show answer

Correct: 81% — independent step successes multiply (0.9 × 0.9)

eaa-1-nondeterminism-testing free understand ◆◆◇◇

Explain why non-determinism makes LLM system testing fundamentally different from traditional software testing. Why is setting temperature to 0 not a complete fix?

Show answer

Traditional testing assumes identical inputs yield identical outputs, so a single passing run proves correctness. LLM systems violate this: the same input can produce different outputs across runs (sampling temperature, floating-point non-associativity, batching, provider-side model updates). A single run is therefore a *sample*, not a proof — a change can raise average quality while introducing rare failures one run never surfaces. So testing must be statistical over a representative dataset, re-run on every change, comparing score distributions rather than asserting one pass/fail. Temperature 0 reduces variation but does not eliminate it (floating-point, batching, and provider drift remain), so it does not restore determinism. Common wrong answer: "set temperature to 0 and it becomes deterministic" — it lowers variance but production outputs still drift.

eaa-1-routing-coverage mcq analyze ◆◆◇◇

Your agent uses distributed routing across a LangGraph graph. Compared with a single centralised router, what most directly raises your evaluation burden?

Why

With routing spread across the graph, a wrong turn can originate at any node, so you must build coverage for every decision point instead of one — that is the core cost. Distributed routing increases flexibility, not decreases it, so the “more rigid paths” claim inverts the trade-off. It does not eliminate the routing dataset; if anything you need routing checks at several nodes. And distributing the routing does not make it deterministic, so end-to-end tests alone still miss where a misroute happened.

Options a. Distributed routing reduces flexibility, so you must test more rigid paths b. Routing decisions occur at many nodes, so coverage must reach every one c. Distributed routing removes the need for a routing dataset, shifting all effort to skills d. Distributed graphs are deterministic, so end-to-end tests alone suffice

Show answer

Correct: Routing decisions occur at many nodes, so coverage must reach every one

eaa-1-routing-tradeoff free analyze ◆◆◇◇

Compare centralised and distributed routing, focusing on the trade-off for evaluation coverage. When would the harder-to-evaluate option still be the right choice?

Show answer

Centralised routing puts every decision in a single router: easy to trace and evaluate (one decision point, one routing dataset to build), but a bottleneck for complex branching flows. Distributed routing (LangGraph, Swarm-style graphs) spreads routing across the graph: more flexible and scalable for complex or parallel multi-agent flows, but harder to evaluate because routing decisions occur at many nodes, each a separate place a wrong turn can originate — so evaluation coverage must reach every node, not just one. The distributed option is the right choice when the workflow is genuinely too complex or parallel for a single router and you accept the cost of instrumenting and evaluating each decision point. Common wrong answer: "distributed is better because it is more flexible" — flexibility trades directly against evaluation coverage; for many agents a centralised router is far easier to make reliable.

eaa-1-statistical-eval free apply ◆◆◇◇

You changed a summarisation prompt and one demo looked great. Design a check that would actually tell you whether the change is safe to ship.

Show answer

Run the new prompt against a fixed, representative evaluation dataset — dozens to hundreds of inputs covering the real query distribution — not one demo. Generate several samples per input to estimate variance, score outputs against defined criteria (a quality metric and/or LLM-as-a-judge), and compare the *distribution* of scores to the previous prompt: check the average AND the tail (rare failures), plus any regression in downstream components that consume the summary. Ship only if the aggregate improves with no tail regression. Common wrong answer: "re-run the demo a few times and eyeball it" — too few samples on unrepresentative inputs; you need a curated dataset and a statistical comparison to detect the rare failures that hide behind a good average.

eaa-1-wrong-layer mcq analyze ◆◆◆◇

To catch a tool-misrouting bug seen in production, a colleague proposes benchmarking the base model on MMLU. Why does this target the wrong layer?

Why

Misrouting is a routing failure, which lives at the agent layer; MMLU measures the isolated model’s knowledge at the model layer, so it never runs the routing step where the bug occurs. The benchmark’s difficulty is irrelevant — no knowledge benchmark observes routing, so making it harder or swapping it changes nothing. A per-prompt unit test also misses it: routing depends on run-time, non-deterministic decisions across the whole input distribution, not a single deterministic prompt. The failure mode and the measurement must live at the same layer.

Options a. MMLU is too easy for modern models, so it would not surface the bug b. The bug should instead be caught with a unit test on each prompt c. MMLU should be swapped for a harder knowledge benchmark to surface the bug d. Misrouting is an agent-layer failure; MMLU measures model-layer knowledge, not routing

Show answer

Correct: Misrouting is an agent-layer failure; MMLU measures model-layer knowledge, not routing

Observability & Tracing

eaa-2-auto-vs-manual mcq analyze ◆◆◇◇

You add automatic instrumentation to your OpenAI client and see every LLM call, but the router’s decisions stay invisible. Why?

Why

Automatic instrumentation hooks the model API client, so it captures model calls but not your router, which is ordinary orchestration code rather than an API call — that gap is exactly what manual spans fill. It does not hide LLM calls (it surfaces them), the router isn’t dropped for running elsewhere, and routers aren’t excluded for being deterministic — an LLM-based router isn’t deterministic in any case.

Options a. Automatic instrumentation hooks the LLM client only; the router needs a manual span b. Automatic instrumentation traces routers but hides LLM calls by default c. The router runs on a different machine, so its spans are dropped in transit d. Routers are deterministic and are therefore excluded from tracing

Show answer

Correct: Automatic instrumentation hooks the LLM client only; the router needs a manual span

eaa-2-cost-calc free apply ◆◆◇◇

An agent runs 2,000 queries/day. The router uses 1 call of 200 tokens per query. One skill, used on 40% of queries, makes 1 LLM call of 1,000 tokens. At $0.02 per 1,000 tokens, what does the router cost per day, and what does that one skill cost per day?

Show answer

Router: 2,000 × 1 × 200 = 400,000 tokens/day → 400 × $0.02 = $8.00/day. Skill: 2,000 × 0.40 × 1 × 1,000 = 800,000 tokens/day → 800 × $0.02 = $16.00/day. So the skill, despite running on only 40% of queries, costs twice the router. Method: tokens = queries × share × calls × tokens-per-call, then tokens ÷ 1,000 × price. Common wrong answer: forgetting the 40% share and charging the skill on all 2,000 queries ($40/day) — the per-skill frequency taken from the traces is exactly what prevents that error.

eaa-2-cost-dominant mcq analyze ◆◆◇◇

A skill is 20% of queries but 42% of LLM spend. From the trace data, what is the most defensible first optimisation?

Why

The per-span token counts point to the specific sub-step burning tokens, so moving just that high-token step to a cheaper model is the targeted fix the trace data justifies. Cutting the skill’s query share changes the product rather than its efficiency; switching the whole agent to a cheaper model risks quality everywhere to fix one skill; and caching to stop model calls breaks any query that needs a fresh result. The trace is what lets you optimise precisely.

Options a. Reduce that skill's query share by routing fewer queries to it b. Switch the whole agent to a cheaper model to cut total cost c. Move that skill's highest-token sub-step to a cheaper model d. Cache the skill's outputs so it stops calling the model

Show answer

Correct: Move that skill's highest-token sub-step to a cheaper model

eaa-2-debug-flat-trace free analyze ◆◆◇◇

A teammate says “tracing is on but useless — it’s just one big span.” Diagnose what is missing and give the concrete fix that makes router and tool boundaries visible.

Show answer

Tracing is on but only the outermost (or a single) span is being created — there is no manual instrumentation of the orchestration, so the router/tool structure is collapsed into one node. Concrete fix: instrument outside-in — start_as_current_span for an Agent span around the run, a Chain span inside each router-loop iteration, and a Tool span around each skill dispatch. With current-context nesting the auto-traced LLM calls drop into place and the tree becomes Agent contains Chain contains Tool contains LLM. Now a misroute localises to a Chain span and a bad tool output to its Tool span. Common wrong answer: "increase log verbosity" — more flat logs add no structure; you need nested spans at the boundaries.

eaa-2-define-trace-span free understand ◆◇◇◇

Define a trace and a span, give an example span kind for each level, and state how they relate.

Show answer

A trace is the end-to-end record of one agent run; a span is one step inside it. Example span kinds: an Agent span for the whole run, a Chain span for a router iteration, a Tool span for a skill invocation, an LLM span for a single model call. They relate as a tree — the trace is the root, spans nest beneath it sharing one trace ID, and the nesting mirrors the agent's structure (Agent contains Chains, which contain Tools, which contain LLM calls). Common wrong answer: "a trace is just a list of log lines" — it is a hierarchical tree, and that hierarchy is what makes it navigable and evaluable.

eaa-2-eval-workflow-from-traces free apply ◆◆◇◇

You have a week of production traces in Phoenix. Design a workflow that uses them to catch a routing regression before your next release.

Show answer

Build a labelled evaluation dataset from the traces: sample real queries and their correct skill (the router target), read from the Chain/Tool spans. On each candidate release, re-run the agent over that fixed dataset, extract the router's chosen skill from the new traces, and compute per-category routing accuracy. Compare against the previous release's score and gate the release if accuracy drops on any category. Optionally layer an LLM-as-judge over skill outputs for quality. The point: traces supply both the dataset and the per-version comparison, so a regression appears as a score delta rather than a user complaint. Common wrong answer: "eyeball a few recent traces before release" — unrepresentative and not comparable across versions; you need a fixed dataset and a numeric comparison.

eaa-2-fix-nesting free apply ◆◆◆◇

An engineer creates a span for every LLM call but starts each as a new root span. The Phoenix trace is a flat list of LLM spans. What went wrong, and how do you fix it?

Show answer

They created spans bottom-up and as roots, so nothing establishes parent–child links — each LLM span becomes its own tiny trace instead of a nested step, and Phoenix shows a flat list. Fix: instrument outside-in using current-span context — open an Agent span for the run, a Chain span per router iteration, and a Tool span per skill with start_as_current_span (or the equivalent context manager), so each inner span attaches to the currently-active parent. The LLM-call spans then nest under their Tool/Chain parents automatically. Common wrong answer: "add a parent_id attribute to each span by hand" — brittle and easy to get wrong; let the tracer's current-context nesting build the tree.

eaa-2-flat-trace mcq analyze ◆◆◇◇

Your trace is one flat span for the whole run. To localise a misrouted tool call, which manual spans most directly restore the needed structure?

Why

Adding Agent, Chain (per router iteration), and Tool (per skill) spans rebuilds the router and tool boundaries, so a misroute localises to a Chain span and a bad tool result to its Tool span. A lone LLM span around the answer, or one Tool span wrapping everything, leaves the trace effectively flat; metric counters give totals but no structure to navigate, which is exactly what debugging a misroute needs.

Options a. A single LLM span around the final answer b. One Tool span around the entire agent, and nothing else c. Metric counters for total tokens and latency d. Agent, Chain-per-iteration, and Tool-per-skill spans

Show answer

Correct: Agent, Chain-per-iteration, and Tool-per-skill spans

eaa-2-instrument-agent free apply ◆◆◇◇

You have an agent with a router loop that dispatches skills, each of which makes one or more LLM calls. Describe how you would instrument it so the trace shows the full structure.

Show answer

Apply outside-in. Wrap run_agent() in an Agent span. Inside the router loop, wrap each iteration in a Chain span. Inside each iteration, wrap every skill/tool call in a Tool span named for the function. Leave the model calls to automatic instrumentation — they appear as LLM spans nested correctly because the enclosing with-blocks are the current span. The nesting falls out of the block structure, giving Agent contains Chain contains Tool contains LLM. Common wrong answer: "wrap each LLM call in its own top-level span" — that produces a flat list of LLM spans with no router or tool structure; start from the outermost layer and let inner spans attach to the current parent.

eaa-2-minimal-rollout free apply ◆◇◇◇

You need observability for a demo tomorrow on a 200-line OpenAI agent. Give the minimal rollout and name two things that stay invisible until you add manual instrumentation.

Show answer

Three lines, zero agent edits: register() to point at the Phoenix collector, OpenAIInstrumentor().instrument() to auto-trace every model call, and launch Phoenix. Visible immediately: every LLM call's prompt, completion, token count, and latency. Invisible until manual spans are added: the router's iterations and decisions, and the tool dispatch (which skill ran, with what inputs/outputs) — plus things like retry logic. They are invisible because automatic instrumentation only hooks the LLM client, not your orchestration code. Common wrong answer: "manually instrument everything first" — too slow for a demo deadline; automatic gets you from zero to useful in minutes.

eaa-2-span-kind-map mcq understand ◆◆◇◇

A single model call runs inside a tool that the router selected. From outermost to innermost, how do the OpenInference span kinds nest?

Why

The outermost unit is the whole run (Agent), then the router iteration that chose the tool (Chain), then the tool invocation (Tool), then the model call inside it (LLM). The orderings that place LLM or Chain outermost invert containment — the model call is the innermost unit and the agent run is the root. The variant that buries Chain beneath Tool misorders routing and execution: the router decides before the tool runs.

Options a. LLM ⊃ Tool ⊃ Chain ⊃ Agent b. Chain ⊃ Agent ⊃ Tool ⊃ LLM c. Agent ⊃ Chain ⊃ Tool ⊃ LLM d. Agent ⊃ Tool ⊃ LLM ⊃ Chain

Show answer

Correct: Agent ⊃ Chain ⊃ Tool ⊃ LLM

eaa-2-trace-tree-debug free analyze ◆◆◇◇

A trace shows the agent route to a search_kb tool (which makes a query-generation LLM call then a knowledge-base API call), then an answer tool. The knowledge base clearly contains the answer, yet the agent replied “I don’t know.” Which span do you inspect first, and what two causes does that localise to?

Show answer

Inspect the search_kb Tool span first — specifically its two children: the "Generate Query" LLM span and the "KB Search" API span. Two causes it localises: (1) the generated search query was poor or empty (visible in the Generate Query span), so retrieval missed the article; or (2) the query was fine but the API span returned no or irrelevant results (a retrieval/index problem). If both look correct, move up to the answer Tool span to see whether good context was retrieved but ignored. Common wrong answer: "inspect the final answer LLM span" — that is the symptom; the cause is upstream in retrieval, which the span tree lets you reach directly.

eaa-2-trace-vs-span mcq understand ◆◇◇◇

In an OpenTelemetry-instrumented agent, what is the relationship between a trace and a span?

Why

A trace is the end-to-end record of one run; a span is a single nested step, and spans form a tree under the trace sharing one trace ID. Swapping the two inverts the containment. They are not interchangeable names for one model call — a single call is one LLM span, not a whole trace. And both record timing; a span captures inputs, attributes, and status, not merely a final output.

Options a. A trace is the whole run; a span is one nested step, and spans form a tree b. A span is the whole run; a trace is one step within the span c. Trace and span are interchangeable names for a single LLM call d. A trace records timing while a span records only the final output

Show answer

Correct: A trace is the whole run; a span is one nested step, and spans form a tree

eaa-2-traces-enable-eval mcq understand ◆◆◇◇

Why are collected traces called the “bedrock for evaluation at scale”?

Why

Each trace stores full inputs, outputs, and intermediate steps, so automated evaluators can score thousands of real runs and compare across versions — that is what “at scale” means. The value isn’t compression or storage savings; traces don’t make a non-deterministic agent deterministic; and they complement, rather than replace, a curated test dataset (you still need known-answer cases for some checks).

Options a. They compress the agent's logs so storage costs drop b. Each trace holds full inputs/outputs/steps, so evaluators can score thousands of runs c. They make the agent deterministic, which evaluation requires d. They remove the need for a curated test dataset

Show answer

Correct: Each trace holds full inputs/outputs/steps, so evaluators can score thousands of runs

eaa-2-what-to-add-next free analyze ◆◆◇◇

After the demo, your auto-instrumented agent shows a flat list of LLM spans. Prioritise which manual spans to add next, and justify the order.

Show answer

Add the Agent span first (wrap run_agent) so every later span has a root and traces group per run. Next the Chain span per router iteration — routing bugs live here, so it is the highest-value debugging boundary. Then Tool spans per skill, so per-skill cost/latency attribution and skill-level failures become visible. Lowest priority: wrapping individual sub-steps and retry logic, added where a specific skill needs a finer breakdown. The order follows debugging value: structure (Agent) → routing (Chain) → skills (Tool) → details. Common wrong answer: "wrap individual sub-steps first" — fine-grained spans are useless without the enclosing Agent/Chain structure to hang them on.

eaa-2-when-auto-vs-manual free apply ◆◆◇◇

When is automatic instrumentation the right choice, and when do you need manual instrumentation? Give the rule you would actually follow.

Show answer

Use automatic instrumentation when you want immediate LLM-level visibility with no code changes — prototypes, demos, or any case where you mainly care about prompts, completions, tokens, and latency. Use manual instrumentation when you need to see your own logic: router iterations, tool dispatch, retries, multi-step skills — anything that is not an LLM API call. In practice you combine them: auto first for instant coverage, then manual spans wherever the trace shows gaps. Lean manual-heavy when the failures you debug are routing/orchestration bugs; auto-only is fine when the only thing that varies is the model call. Common wrong answer: "manual is always better because it is more complete" — it is also more work and error-prone (a missed span is a hole); start auto and add manual only where needed.

Component Evaluations

eaa-3-compare-techniques free understand ◆◇◇◇

Compare the three evaluation techniques on accuracy and scalability, and state when you would use each.

Show answer

Code-based evaluation: programmatic checks (exact match, JSON validation, SQL result-set diff, cosine similarity) — 100% reproducible, free, fast, but only for codifiable criteria. LLM-as-a-judge: a separate LLM classifies the output into discrete rails — scales to subjective/qualitative output and handles nuance, but is never fully accurate. Human annotation: human labellers or end-user feedback — the most accurate, the standard for safety-critical or ambiguous judgments, but slowest to scale and prone to selection bias. Use code-based when you can write an assert, an LLM judge for prose/judgment, and humans where a wrong answer is dangerous. Common wrong answer: "LLM-as-a-judge is strictly best because it scales and is accurate" — it scales but is never fully accurate, and it's wasteful where a deterministic check would do.

eaa-3-design-router-eval free analyze ◆◆◇◇

Design an evaluation for an agent’s router that tests both function-calling choice and parameter extraction. Specify the dataset and the metrics.

Show answer

Build a labelled dataset of representative queries, each tagged with the ground-truth tool and the ground-truth parameter values. Then score two independent dimensions: (1) tool-selection accuracy — exact match of predicted vs expected tool; (2) parameter-extraction accuracy — given the correct tool, whether the extracted argument values match the expected ones. Report them separately (and the product as end-to-end router accuracy), with a per-category breakdown so you can see which query types misroute. This localises the bottleneck: a strong tool score with a weak parameter score tells you where to invest. Common wrong answer: "measure one combined router-correct/incorrect score" — it hides which dimension fails, and the two fail independently, so you can't tell tool errors from parameter errors.

eaa-3-diagnose-dimension free analyze ◆◆◇◇

A router-handled query produced a wrong result. Describe how you determine whether the cause was tool selection or parameter extraction, and how you confirm it.

Show answer

Compare the trace against the ground-truth label on both dimensions separately. First check the chosen tool against the expected tool: if it differs, it's a tool-selection failure. If the tool matches, inspect the extracted arguments against the expected parameter values: a mismatch there is a parameter-extraction failure. Confirm by re-running a few near-identical queries where only the tool-ambiguity (or only the parameter difficulty) varies — if failures track the parameter variation while the tool stays correct, it's parameter extraction. The router's two-dimension breakdown is exactly what makes this attribution possible from the trace. Common wrong answer: "the answer was wrong, so the router failed" — that doesn't say *which* dimension; you must check tool and parameters separately.

eaa-3-isolate-tests free analyze ◆◆◇◇

Design two test cases that isolate parameter-extraction failures from tool-selection failures, so each can be debugged independently. Explain why each isolates the intended dimension.

Show answer

To isolate parameter extraction, pick queries whose correct tool is unambiguous so tool selection can't be what fails, but whose parameters are easy to mis-extract. Example 1: "Check the shipping status for order 12345" — only the shipping tool plausibly applies, but the agent may pass 12345 as a tracking_id instead of an order_id. Example 2: "Show sales from January to March" — clearly the sales-report tool, but the agent may reverse the start and end dates. In both, a failure can only be parameter extraction, not tool choice. (To isolate tool selection instead, you'd use queries with trivial/no parameters but an ambiguous tool.) Common wrong answer: a query that is ambiguous about both the tool and the parameters — a failure there can't be attributed to one dimension.

eaa-3-judge-summaries free apply ◆◆◇◇

You need to evaluate a summarisation skill’s output quality at scale. Describe how you would apply LLM-as-a-judge, including the output format and how you keep the judge out of the agent’s traces.

Show answer

Use LLM-as-a-judge constrained to discrete rails — e.g. faithful/unfaithful, or complete/incomplete — never a numeric score. Write a judge template that states the criterion and asks for one rail label given the source and the summary; run it over the summary spans with llm_classify(rails=[...]); wrap the run in suppress_tracing() so the judge's own LLM calls don't pollute the agent's traces; then log_evaluations() to attach the labels back to the spans. Validate the judge against a small human-labelled sample to estimate its agreement before trusting it. Common wrong answer: "have the judge rate each summary 1–10 and average" — uncalibrated numeric scores are noise; rails are what make the judge reliable.

eaa-3-multiplicative mcq analyze ◆◆◇◇

Tool selection is 90% and parameter extraction is 80%, on independent dimensions. What is the end-to-end router accuracy, and why is it below both?

Why

Both independent steps must succeed, so the rates multiply: $0.90 \times 0.80 = 0.72$ . The product falls below either factor because each step discards some of the other’s successes — it is neither their average (85%), nor capped at the lower factor (80%), nor dominated by the higher one (90%). This is the multiplicative-accuracy effect, and it’s why you fix the weaker dimension first.

Options a. 85% — the two dimensions average b. 80% — the lower dimension caps the result c. 72% — independent dimensions multiply (0.90 × 0.80) d. 90% — the higher dimension dominates

Show answer

Correct: 72% — independent dimensions multiply (0.90 × 0.80)

eaa-3-multiplicative-calc free apply ◆◆◇◇

A router has 92% tool-selection accuracy and 88% parameter-extraction accuracy, independent. Compute end-to-end router accuracy, explain why it is below both, and state what parameter accuracy you’d need (holding tool selection fixed) to reach 90% end-to-end.

Show answer

End-to-end router accuracy = 0.92 × 0.88 = 0.8096 ≈ 81%. It is below both factors because both independent dimensions must succeed, so each one discards a fraction of the other's successes — the product is bounded above by the smaller factor (0.88) and is always ≤ either. To reach 90% end-to-end you need the product ≥ 0.90; holding tool selection at 0.92, parameter extraction must reach 0.90 / 0.92 ≈ 0.978, i.e. ~97.8% — so you'd realistically lift both dimensions, not one. Common wrong answer: "average them to ~90%" — independent component rates multiply, they don't average.

eaa-3-numeric-fails free understand ◆◇◇◇

Explain why numeric scoring scales fail for LLM judges, and what you use instead.

Show answer

An LLM cannot calibrate a fine numeric scale — it cannot reliably distinguish 79 from 83 — so the numbers are noise with the appearance of precision, and averaging them aggregates that noise into a confident-looking figure that doesn't track real quality. The fix is to use 2–4 discrete rails (e.g. correct/incorrect, or faithful/partially/unfaithful) so the judge does only what it does reliably: classify. If you need a finer signal, add more *labels* with clear definitions, not a wider numeric range. Common wrong answer: "use a larger judge model to calibrate 1–10" — the problem is the task (uncalibrated numeric scoring), not model size.

eaa-3-rails-why mcq understand ◆◆◇◇

An LLM judge is asked to “score answer quality from 1 to 10.” What is the core problem, and the fix?

Why

An LLM cannot reliably distinguish 83 from 79 — or 7 from 8 — so a numeric score is noise with the appearance of precision, and averaging such scores aggregates the noise. The fix is to ask only what the model does reliably: classify into 2–4 discrete rails. A finer scale (1–100) makes calibration worse, not better; the problem is calibration, not model size; and rounding to the nearest 5 doesn’t fix an uncalibrated scale.

Options a. The scale should be 1 to 100 for finer resolution b. LLMs are poor at calibrating a numeric scale; use 2–4 discrete rails instead c. The judge model is too small; a larger model calibrates 1–10 reliably d. Numeric scores are fine as long as you round them to the nearest 5

Show answer

Correct: LLMs are poor at calibrating a numeric scale; use 2–4 discrete rails instead

eaa-3-router-dimension mcq analyze ◆◆◇◇

An agent picks the correct tool but passes the customer’s order ID into a tracking_id parameter. Which statement is true?

Why

The tool was right, so tool-selection accuracy scores a pass and misses the bug entirely; only parameter-extraction accuracy — checking the argument values — catches it, which is exactly why the two dimensions are evaluated independently. It is not a tool-selection failure (the tool was correct). And end-to-end output quality is an unreliable detector: a wrong parameter sometimes yields a plausible-looking answer, so it can pass the end-to-end check while being wrong.

Options a. Tool-selection accuracy will catch this as a failure b. Only parameter-extraction accuracy catches it — tool selection scores it a pass c. End-to-end output quality reliably flags parameter errors on its own d. This is a tool-selection failure, not a parameter failure

Show answer

Correct: Only parameter-extraction accuracy catches it — tool selection scores it a pass

eaa-3-runnability-eval free apply ◆◆◇◇

Design a code-based evaluation for a visualisation skill that generates chart code. What does it check, and why is “it runs” not enough?

Show answer

Execute the generated visualisation code in a sandboxed subprocess with a timeout and no filesystem/network access; if it exits without an exception, mark it "runnable," else capture the error as the failure reason. That is a code-based check and catches syntax errors, bad API calls, and missing columns. But runnability is necessary, not sufficient: code that runs can still plot the wrong data or the wrong chart type. Layer a semantic check on top — compare the rendered output (or the data it plotted) against expectations, or use an LLM judge on a description of the chart — to catch "runs but wrong." Common wrong answer: "if it runs, it's correct" — runnability and correctness are different gates.

eaa-3-sql-edge-cases free analyze ◆◆◆◇

A teammate’s code-based SQL evaluator simply compares the generated query’s output list to the expected list. What edge cases will make it wrong, and how do you handle each — especially row ordering?

Show answer

Row ordering: SQL result order is non-deterministic without ORDER BY, so comparing result sets as ordered lists fails logically-correct queries — compare as sets (or sorted tuples). Floating-point columns: exact equality breaks on rounding, so compare numerics within ±ε. Timeouts: a bad query can full-scan or loop, so wrap execution in a timeout and treat it as a failure mode. Destructive queries: a generated DROP/DELETE could mutate the test DB, so run on a read-only connection or in a rolled-back transaction. Also distinguish failure modes (execution error vs wrong columns vs wrong rows) so the score carries a reason. Common wrong answer: "diff the two result lists directly" — that flags correct queries whose rows came back in a different order, the most common false negative.

eaa-3-suppress-tracing mcq understand ◆◆◇◇

Why wrap LLM-as-a-judge calls in suppress_tracing() during a Phoenix evaluation run?

Why

The judge is itself an LLM call, so without suppression its spans land in the same trace store and intermingle with the agent’s spans — a large eval run can roughly double the store with evaluation artifacts and corrupt trace analysis. It isn’t a performance trick, it doesn’t control the output format (rails do that, set separately), and it doesn’t change what the judge reads — it only governs whether the judge’s calls are traced.

Options a. So the judge's own LLM calls aren't recorded as spans in the agent's traces b. To speed up the judge by skipping all network logging c. To force the judge to return discrete rails instead of numeric scores d. To stop the judge from seeing the agent's prompts

Show answer

Correct: So the judge's own LLM calls aren't recorded as spans in the agent's traces

eaa-3-technique-by-output mcq apply ◆◆◇◇

A refund-calculator skill outputs a dollar amount with a known correct value. Which evaluation technique fits best?

Why

The output is quantitative with a known answer, so a deterministic numeric comparison (within a small tolerance for floating point) is exact, free, and reproducible — the textbook case for code-based evaluation. An LLM judge adds cost and error for something an assert settles; routing every dollar amount to a human doesn’t scale and isn’t needed when the correct value is known; and cosine similarity on text is the wrong tool for a precise number.

Options a. LLM-as-a-judge with rails, because money is sensitive b. Human annotation, because any financial output is safety-critical c. Cosine similarity on the text of the response d. Code-based — compare the number to the expected amount within tolerance

Show answer

Correct: Code-based — compare the number to the expected amount within tolerance

eaa-3-technique-cost free apply ◆◆◇◇

A support agent has three skills — an order-lookup that emits SQL, a complaint-summary that emits free text, and a refund-calculator that emits a dollar amount. Recommend an evaluation technique for each, then compute the daily cost at 1,000 evals per skill if an LLM judge costs $0.01/eval and a human costs $0.50/eval.

Show answer

Order-lookup (SQL) → code-based: execute and compare result sets, $0. Complaint-summary (free text) → LLM-as-a-judge with rails: qualitative, no ground-truth string, 1,000 × $0.01 = $10/day. Refund-calculator (dollar amount, known value) → code-based: numeric compare within tolerance, $0. Daily total = $10, because two of the three skills are code-evaluable — routing all three to an LLM judge would cost $30/day for no accuracy gain on the deterministic ones. Common wrong answer: "use an LLM judge for all three to be safe" — it triples cost and adds error to outputs a deterministic check settles exactly.

eaa-3-three-techniques mcq understand ◆◇◇◇

Which statement correctly characterises the three evaluation techniques on the accuracy–scalability trade-off?

Why

The three sit on a trade-off: code-based is deterministic, free, and reproducible but limited to codifiable criteria; LLM-as-a-judge scales to subjective outputs yet is never fully accurate; human annotation is the most accurate and the least scalable. LLM-as-a-judge is not the most accurate (humans are) nor free of error. Code-based evaluation is precisely what can’t judge subjective prose — that’s the judge’s job. And human annotation is the slowest to scale, not the fastest, even with end-user feedback.

Options a. LLM-as-a-judge is both the most accurate and the cheapest, so it dominates the other two b. Code-based evaluation handles subjective prose quality better than an LLM judge c. Code-based: reproducible but codifiable-only; LLM-judge: scalable, imperfect; human: most accurate, slowest d. Human annotation scales best because end users supply feedback for free

Show answer

Correct: Code-based: reproducible but codifiable-only; LLM-judge: scalable, imperfect; human: most accurate, slowest

Trajectory & Structured Evals

eaa-4-compare-variants free analyze ◆◆◇◇

A dashboard shows V1 (baseline), V2 (new prompt: small gains on routing/SQL/clarity, +0.1s latency), and V3 (new model: best routing/SQL, worse clarity/runnability, 2.8s latency vs 1.2s). Which would you ship, and what else do you need before committing?

Show answer

Read the whole row, not an aggregate. V2 (new prompt) improves routing (+2pp) and SQL (+3pp), improves clarity, holds runnability, and adds only ~0.1s latency — broad gains, no meaningful regression. V3 (new model) has the best routing and SQL but regresses clarity and runnability and more than doubles latency. Ship V2: it's a clear improvement with no dimension clearly worse, whereas V3 trades correctness for clarity/latency losses a latency SLO might reject. Before committing, get statistical significance, a per-category breakdown, cost/query, and a small A/B on user satisfaction. Common wrong answer: "ship V3 because its routing and SQL are highest" — that ignores the clarity, runnability, and latency regressions the dashboard exists to surface.

eaa-4-comprehensive free understand ◆◇◇◇

Why should an evaluation dataset be “comprehensive, not exhaustive,” and what are the two key-types it contains?

Show answer

"Comprehensive, not exhaustive" means cover every input *type* with 1–2 examples rather than piling up 100 near-duplicates per category. The reason is iteration speed: a small, varied dataset runs fast and stays cheap to maintain, so you can re-run it on every change — while still spanning the behaviour space enough to catch regressions. A bloated dataset slows each experiment and adds little new signal once a type is represented. Input keys are forwarded to the agent; output keys are forwarded to the evaluators for comparison. Common wrong answer: "more cases are always better" — past one or two per type you mostly buy slower runs, not more coverage.

eaa-4-convergence-calc free apply ◆◆◇◇

Given the step counts [3, 3, 5, 3, 7, 3] across six completed runs, compute the convergence score and state one limitation of the metric.

Show answer

Convergence = (runs at the minimum step count) / (completed runs). For [3, 3, 5, 3, 7, 3]: the minimum is 3, and 4 of the 6 runs hit it, so convergence = 4/6 ≈ 0.67. A key limitation: it measures consistency, not correctness — if every run shared the same unnecessary step, that waste would sit inside the minimum and convergence could still read 1.0, so it must be paired with per-step correctness evaluators. (Also filter to completed runs first: a crash after 2 steps would falsely lower the minimum.) Common wrong answer: "0.67 means the agent is 67% correct" — convergence is about path consistency, not answer correctness.

eaa-4-convergence-meaning mcq analyze ◆◆◇◇

A convergence score of 1.0 tells you that…

Why

Convergence measures consistency — the fraction of runs on the minimum-step path — so 1.0 means the agent always takes the same shortest path it found, which could be a consistently wrong one. It says nothing about correctness or answer quality, so neither the “correct on every query” nor the “maximal quality” reading holds. And the minimum is relative to the runs observed, not a theoretical optimum — universal waste sits inside that minimum undetected.

Options a. The agent reaches the correct answer on every query b. The agent's answer quality is maximal c. The agent uses the fewest steps that are theoretically possible d. The agent takes the same minimum-step path every run — consistency, not correctness

Show answer

Correct: The agent takes the same minimum-step path every run — consistency, not correctness

eaa-4-dashboard-tradeoff mcq analyze ◆◆◇◇

V3 has the best routing and SQL scores but the worst clarity, runnability, and latency. What does the multi-dimension dashboard reveal that a single aggregate score would hide?

Why

The dashboard exposes a trade-off: V3’s correctness gains come with clarity, runnability, and latency regressions — an averaged score could net out positive and hide exactly that. So V3 is not strictly best, and the regressed dimensions clearly matter (a latency SLO alone might reject it). The evaluators aren’t miscalibrated for showing the tension — surfacing it is the dashboard’s whole purpose.

Options a. That V3 is strictly the best variant and should ship b. A cross-dimension trade-off — correctness gains bought with clarity/latency regressions c. That clarity and runnability don't matter for a shipping decision d. That the evaluators are miscalibrated and should be removed

Show answer

Correct: A cross-dimension trade-off — correctness gains bought with clarity/latency regressions

eaa-4-dataset-design free apply ◆◆◆◇

A sales agent is adding a forecast_revenue skill (takes a date range and category, generates SQL, produces a markdown forecast). Design its evaluation dataset: name the input and output keys (and why each), the case distribution, and three evaluators.

Show answer

Schema — input keys: query (the natural-language request, forwarded to the agent), date_range (expected extracted range, for parameter-extraction eval), category (for grouping error analysis). Output keys: expected_sql_tables (which tables the SQL should hit → enables a free code-based eval), expected_trend_direction (growth/decline/flat → an LLM-judge correctness eval), and a ground-truth data snapshot (to verify the right data was queried). Distribution (comprehensive, not exhaustive): ~1–2 cases per input type — a couple of normal categories, one sparse-data category, one invalid/future range, one ambiguous request — ~8 total, not 100 per category. Three evaluators: SQL-table check (code-based), trend-direction (LLM judge), confidence-interval-present (code-based regex). Common wrong answer: "100 cases per category to be safe" — that violates comprehensive-not-exhaustive and slows iteration without adding signal.

eaa-4-define-trajectory free understand ◆◇◇◇

Define trajectory evaluation and explain why the path taken matters beyond just the final answer.

Show answer

A trajectory is the ordered sequence of all steps — router decisions, tool calls, LLM completions — from input to final response; it records how the agent got to the answer, not just the answer. Trajectory evaluation matters because two agents can produce the same correct output via very different paths, and the longer path costs more LLM calls, adds latency, and exposes the run to more stochastic-failure opportunities. At scale that difference decides viability, so "correct" is not the same as "efficient." Common wrong answer: "only the final answer matters" — that ignores the cost and reliability the path determines.

eaa-4-experiments-apply free apply ◆◆◇◇

You want to evaluate your agent’s data-lookup skill repeatably as you change prompts. Describe how you would use the Phoenix Experiments framework, naming what goes in each of its four stages.

Show answer

Map it onto the four stages. Dataset: assemble test cases with input keys (the user query, expected parameters) and output keys (expected SQL tables, expected answer features). Task: a function that takes one row, runs the agent on the query, and returns its output plus metadata (the trajectory, tool calls). Run experiment: execute the task over every row, with managed parallelism and error handling, recording results as a named, versioned run. Evaluators: attach scoring functions — code-based where you have output keys (SQL table match), an LLM judge for prose — to score each result. Then compare runs across agent variants on the dashboard. Common wrong answer: "write one evaluator and run it once" — the framework's value is repeatability and versioned comparison across variants, not a single check.

eaa-4-experiments-pipeline mcq understand ◆◇◇◇

What is the correct order of the Phoenix Experiments pipeline?

Why

You first build the dataset of test cases, define a task that runs the agent on one row, run the experiment to execute that task over the whole dataset, then apply evaluators to score the results — the agent analogue of cases → test function → run → assertions. The other orders invert that dependency: you can’t write evaluators or run an experiment before the dataset and task exist, and evaluators score results, so they come last.

Options a. Dataset → task → run experiment → evaluators b. Task → dataset → evaluators → experiment c. Evaluators → dataset → task → experiment d. Experiment → evaluators → dataset → task

Show answer

Correct: Dataset → task → run experiment → evaluators

eaa-4-flywheel free understand ◆◆◇◇

Explain the production feedback flywheel, its most common bottleneck, and the risk that comes with automating it.

Show answer

The evaluation flywheel is the self-reinforcing loop: production traces → new test cases → updated evaluators → agent improvements → better production data → repeat. Chapter 2's tracing supplies the raw production data. Its most common bottleneck is the speed of promoting a production trace into the dataset: if that needs slow manual review, the loop stalls and the eval set lags reality. The fix is automation of trace-to-dataset promotion, annotation, and evaluator updates — with a quality gate, because unfiltered bots/adversarial inputs entering the dataset degrade evaluation quality. Common wrong answer: "just collect more production data" — volume doesn't help if promotion is manual and slow, or if noisy traffic isn't filtered.

eaa-4-flywheel-bottleneck mcq understand ◆◆◇◇

What is the most common bottleneck in the production feedback flywheel?

Why

The flywheel is only as fast as its slowest stage, and that stage is usually turning a production trace into a dataset case: if it takes a week of manual review, the loop stalls and improvements lag reality. Production run cost and judge model size affect spend or latency, not the loop’s iteration speed, and more evaluators add scoring work but aren’t the structural chokepoint — promotion speed is, which is why teams automate it (with a quality gate).

Options a. The dollar cost of running the agent in production b. The number of evaluators composed per experiment c. The speed of promoting production traces into the dataset d. The parameter size of the model used as a judge

Show answer

Correct: The speed of promoting production traces into the dataset

eaa-4-not-significant free analyze ◆◆◆◇

Variant B scores higher on correctness than the incumbent but costs more, and the difference is not statistically significant. Walk through how you decide whether to ship it.

Show answer

Don't ship on the score alone. A higher mean with no statistical significance may be sampling noise — likely on the small "comprehensive, not exhaustive" dataset — so first reduce the uncertainty: enlarge the eval set or run a proper A/B in production to test whether the gain is real. Meanwhile weigh the *cost*: the variant is more expensive, so even a real gain has to clear that cost to be worth it. Decision: if a larger sample / A/B confirms a significant gain that justifies the added cost (and no dimension regressed), ship; otherwise keep the cheaper incumbent. Common wrong answer: "it scored higher, so ship it" — an insignificant difference isn't evidence of improvement, and the extra cost makes a non-gain a net loss.

eaa-4-ship-decision free analyze ◆◆◇◇

Describe how you decide whether to ship a candidate agent variant, combining per-evaluator scores with cost, latency, and statistical significance.

Show answer

Combine the per-evaluator scores with context, never an aggregate alone. Steps: (1) read each evaluator dimension for gains and regressions, not the mean; (2) check statistical significance — on small datasets a 2-point gain may be noise; (3) factor cost per query and latency against any SLO; (4) get a per-category breakdown to catch a hidden regression on an important segment; (5) if it still looks good, confirm with a small A/B on real user satisfaction. Ship only if a real, significant gain justifies the cost/latency and carries no unacceptable regression. Common wrong answer: "the aggregate score rose, so ship" — aggregates hide dimension-level regressions and say nothing about significance or cost.

eaa-4-wasted-cost-calc free apply ◆◆◇◇

An agent runs 15,000 queries/day with an average of 7 steps; the optimal path is 4 steps and each step costs $0.025. Compute the daily wasted-work cost, then the saving if a new variant cuts the average to 5 steps.

Show answer

Wasted steps per run = avg − optimal = 7 − 4 = 3. Daily wasted steps = 15,000 × 3 = 45,000. Daily wasted cost = 45,000 × $0.025 = $1,125/day (~$410K/year). A variant that cuts the average to 5 steps wastes 1 step/run → 15,000 × 1 × $0.025 = $375/day, a $750/day saving. Method: (avg − optimal) × volume × price-per-step, as a daily dollar figure, so two variants can be compared on efficiency. Common wrong answer: comparing average step counts alone (7 vs 5) without converting to cost — the dollar figure is what makes the efficiency gain comparable to accuracy and latency on the ship dashboard.

eaa-4-wasted-fraction mcq analyze ◆◆◇◇

Across 12 runs an agent took 58 steps in total; the optimal path is 4 steps. What fraction of step cost was wasted?

Why

Optimal total is $12 \times 4 = 48$ steps; the agent used 58, so $58 - 48 = 10$ steps were wasted, i.e. $10/58 \approx 17\%$ . Completion has nothing to do with waste — a run can finish and be inefficient. The 33% figure counts non-optimal runs rather than wasted steps, the quantity that actually drives cost; the wasted-step fraction is what converts to a dollar figure.

Options a. 0% — every run completed successfully, so nothing was wasted b. 8% — only the single longest run counts as wasted c. 33% — a third of the runs were non-optimal d. About 17% — 10 of 58 steps were wasted (optimal is 12 × 4 = 48)

Show answer

Correct: About 17% — 10 of 58 steps were wasted (optimal is 12 × 4 = 48)

eaa-4-why-trajectory mcq understand ◆◇◇◇

Two agents return the same correct answer — one in 3 steps, the other in 9. Why does trajectory evaluation care about the difference?

Why

Same answer, different path: the 9-step route burns roughly 3× the LLM calls, adds latency, and gives non-determinism more chances to derail a run — at scale that decides whether the agent is viable. Extra steps don’t make it more thorough or trustworthy; the final answer is identical. And length isn’t irrelevant once correct (that’s the whole point), nor is it a routing bug by definition — a longer path can be legitimately required for a harder query.

Options a. The 9-step path costs more LLM calls and latency, with more chances to fail b. The 9-step agent is more thorough, so its answer is more trustworthy c. Trajectory length is irrelevant once the final answer is correct d. A longer trajectory indicates a routing bug by definition

Show answer

Correct: The 9-step path costs more LLM calls and latency, with more chances to fail

LLM-Judge & Monitoring

eaa-5-agreement-vs-fpr mcq analyze ◆◆◇◇

A judge has 95% agreement with ground truth but a 30% false-positive rate. Is it safe to gate releases with, and why?

Why

Agreement and FPR are different indicators: agreement is the overall match rate, while FPR isolates leniency — wrong answers passed as correct. A 30% FPR means the judge waves through nearly a third of bad answers, which is disqualifying for a gate even at 95% agreement (the errors are concentrated in the dangerous direction). Raising the agreement target doesn’t address the leniency, and the false-negative rate (strictness) isn’t the problem here — the false-positive rate is.

Options a. Yes — 95% agreement clears the usual bar b. Yes, provided you raise the agreement target to 97% c. No — its false-negative rate is clearly too high d. No — a 30% FPR passes nearly a third of wrong answers as correct

Show answer

Correct: No — a 30% FPR passes nearly a third of wrong answers as correct

eaa-5-debug-judge free apply ◆◆◇◇

Your LLM judge scores only 78% agreement in meta-evaluation. Walk through how you would improve it, in priority order, and how you confirm each change helped.

Show answer

Start with the prompt — about 80% of judge failures are prompt failures. Make the criteria explicit (define what "correct" requires), tighten the output format to the rails, and add chain-of-thought if needed; re-run meta-evaluation. If agreement is still low, add 3–5 few-shot examples covering the boundary cases it gets wrong (calibration). If specific capability gaps remain, try a stronger judge model. For paraphrase false-negatives on non-binary outputs, swap exact match for semantic (embedding) similarity. Change one lever at a time so the agreement delta is attributable. Common wrong answer: "immediately switch to a bigger model" — it's the expensive lever for the least common cause; the prompt is cheaper and usually the culprit.

eaa-5-fpr-threshold free analyze ◆◆◇◇

Explain why agreement rate and false-positive rate are different leniency indicators for an LLM judge, and how you would set an acceptance threshold for each.

Show answer

Agreement rate is the overall match with ground truth; the false-positive rate isolates one direction — wrong answers the judge passes as correct (leniency) — and the false-negative rate the other (strictness). They differ because a high agreement can still hide many false positives if errors cluster on the rarer class, and for a *gate*, leniency is the dangerous error: a lenient judge ships bad answers. Pick both thresholds: require overall agreement ≥ 85% AND FPR ≤ 10% (tighten the FPR cap further for safety-critical gates). Reject the judge if either fails. Common wrong answer: "agreement ≥ 90% is enough" — it can coexist with a high FPR, so a single threshold lets a lenient judge through.

eaa-5-golden-evolution free analyze ◆◆◇◇

Sketch a golden-dataset evolution policy for the first 90 days — when to add rows, when to prune, when to stabilise — so the gate keeps catching real regressions without growing unboundedly.

Show answer

Month 1 (grow): add a regression row for every production bug as it's found (~5–10 rows), so the gate starts encoding real failures. Month 2 (prune + extend): remove rows that have never triggered a failure on a stable feature, and add rows for every new feature/workflow shipped. Month 3 (stabilise): settle at ~40–60 rows with a quarterly review, keeping run time low enough to gate every PR. Throughout: add on every incident, new feature, and testing-discovered edge case; prune only stable-feature rows that never fire; never prune safety-critical rows. The guiding signal is that a gate which never fails is stale, not safe. Common wrong answer: "keep adding every case forever" — past ~100 rows the gate is too slow to run on each PR, so pruning stale non-safety rows is part of the policy.

eaa-5-golden-gate mcq understand ◆◇◇◇

What distinguishes a golden dataset from a general evaluation dataset?

Why

A golden dataset is the shipping gate: every row is a must-pass scenario, it codifies known failure modes from past bugs, it grows as new failures appear, and a regression on it blocks the merge. It is deliberately small and critical, not an attempt to cover every input (that would be too slow to gate on). It composes code-based and LLM-judge evaluators as appropriate, and far from frozen, it must evolve — a golden set that never changes goes stale and stops catching new regressions.

Options a. It is simply larger and tries to cover every possible input b. Curated for criticality, seeded with known failures, and wired as a CI gate c. It uses only LLM-judge evaluators, not code-based ones d. It is frozen at creation and does not change

Show answer

Correct: Curated for criticality, seeded with known failures, and wired as a CI gate

eaa-5-golden-living free understand ◆◇◇◇

What is a golden dataset and how is it a “living document”? Why is a CI gate that has passed 100% for six months a warning sign rather than reassurance?

Show answer

A golden dataset is a curated set of must-pass scenarios used as a CI shipping gate: every agent-code change runs against it and a regression blocks the merge. It's a *living document* — it grows with every production failure (each becomes a regression row) and new use case, so it encodes institutional knowledge that new engineers inherit through the gate. A gate that passes 100% for months is a red flag, not a success: it usually means the set has gone stale and isn't covering newly added features, so it only tests what used to break. The fix is an evolution policy that adds rows on incidents/features and prunes stale ones. Common wrong answer: "100% pass means the agent is solid" — it more often means the dataset stopped keeping pace.

eaa-5-golden-rot free analyze ◆◆◆◇

A payment agent’s 45-row golden dataset passed 100% on every PR for a quarter, yet three production incidents occurred — all on features added after the set was frozen. Diagnose why, and design a process that prevents this staleness.

Show answer

The golden dataset tested only pre-existing features; the new features shipped that quarter had zero coverage, so a 100% pass rate on stale rows was false confidence — the gate tested what used to break, not what might break now. Fixes: (1) require every feature-adding PR to include ≥2 golden rows covering the new feature, enforced in code review; (2) run a monthly coverage review comparing golden rows against the current feature list and flag uncovered features; (3) add a regression row within 48 hours of every production incident. Add the 3 incident cases now as regression rows. Inclusion criteria: incidents, new features/workflows, testing-discovered edge cases; removal criteria: rows untriggered for 6+ months on a stable feature, but never remove safety-critical rows. Common wrong answer: "the gate works, the incidents were bad luck" — the gate silently stopped covering the system, which is the actual defect.

eaa-5-judge-levers mcq understand ◆◆◇◇

Your LLM judge fails meta-evaluation. Which improvement lever should you reach for first, and why?

Why

Roughly 80% of judge failures are prompt failures — vague criteria or a loose output format — so reworking the prompt is the cheapest, highest-yield first move. A larger model targets capability gaps, which are the less common cause; semantic similarity fixes a specific failure (paraphrase false-negatives), not a general low score; and averaging runs addresses variance, not the systematic mis-judgment a failed meta-evaluation reveals. Apply one lever at a time and re-meta-evaluate.

Options a. Swap to a larger judge model, since capability is usually the issue b. Add semantic-similarity scoring to fix paraphrase mismatches c. Average more judge runs to reduce variance d. Fix the prompt first — most judge failures are prompt failures

Show answer

Correct: Fix the prompt first — most judge failures are prompt failures

eaa-5-meta-eval mcq understand ◆◆◇◇

Your dashboard reports the agent is 92% accurate — but that number comes from an LLM judge. What must you do before trusting it?

Why

The 92% is the judge grading itself, which is circular: if the judge is only 75% accurate, the number is noise. Meta-evaluation compares the judge’s labels against a code-based ground-truth reference and reports the agreement rate — the independent check the dashboard lacks. A bigger dataset reduces variance but not the judge’s bias; numeric scoring makes calibration worse; and averaging several uncalibrated judges still gives you no ground truth.

Options a. Increase the eval dataset size to reduce variance in the number b. Meta-evaluate the judge against code-based ground truth before trusting its score c. Switch the judge to numeric 1–10 scoring for more precision d. Average several judges' scores so their errors cancel out

Show answer

Correct: Meta-evaluate the judge against code-based ground truth before trusting its score

eaa-5-meta-eval-design free apply ◆◆◇◇

Your LLM judge scores code correctness and you suspect it is too lenient. Design a meta-evaluation: the ground-truth evaluator, the dataset, the metric(s), and the acceptance threshold.

Show answer

Ground-truth evaluator: execute the generated code in a sandbox and compare actual vs expected output — 100% accurate by construction. Dataset: ~40–50 cases mixing correct code, syntax errors, subtle logic bugs (at least ~30% of the set, since that's where leniency shows), and edge-case failures. Metrics: binary agreement rate AND false-positive rate (judge says "correct" when the code fails) — FPR is the leniency signal you specifically suspect. Acceptance: ship the judge only if agreement ≥ 85% AND FPR ≤ 10%. Common wrong answer: "measure overall agreement only" — a single rate hides the asymmetric leniency; you must break out FPR (and FNR) to see which way it errs.

eaa-5-monitoring-metrics free understand ◆◆◇◇

Name the production-monitoring metrics you would track with alert thresholds, and explain which failure mode the content-quality metrics catch that latency/error dashboards miss.

Show answer

Track, with alert thresholds (~>10% deviation from baseline): running evaluator scores per dimension (quality), convergence / steps-to-terminal (efficiency and loop detection), LLM-call counts per query (a spike signals a runaway loop), latency p50/p99 broken by component, and cost per query/model/skill. The failure these exist to catch is the silent regression — plausible-looking but wrong answers that throw no error, so latency and error-rate dashboards stay green; only content-quality evaluator scores over sampled live traces (plus provenance checks) reveal it. Common wrong answer: "monitor latency and error rate" alone — those miss the silent regression entirely, which is the most dangerous production failure.

eaa-5-pipeline-design free apply ◆◆◇◇

Design a self-improving agent pipeline that turns production feedback into continuous improvement. Name its stages and where the data comes from.

Show answer

Five stages: (1) collect user feedback from production; (2) augment the evaluation dataset with both successful and failed interactions; (3) run CI/CD experiments on every dataset update; (4) incorporate few-shot examples drawn from the collected data; (5) gate deployment via the golden dataset. Tracing (from the observability layer) supplies the raw data, and automation is the multiplier — each interaction makes the system incrementally better. Crucially, insert quality gates: a feedback filter before stage 2 and the human-curated golden dataset at stage 5, so noisy data can't silently degrade the agent. Common wrong answer: "retrain on all feedback automatically" — without filters and the golden gate, the pipeline poisons itself.

eaa-5-pipeline-gate free analyze ◆◆◆◇

In a self-improving pipeline, where is the human safety valve, and why must the deployment gate be the human-curated golden dataset rather than the automated eval score?

Show answer

The human safety valve and the golden dataset are the two firewalls. The golden dataset is human-curated and fixed, so it can't be moved by noisy feedback: a pipeline update that degrades a must-pass scenario fails the gate and can't ship, even if the automated eval (which *can* drift) looks fine. The human review sample (auditing a fraction of feedback and of promoted dataset rows) catches contamination the automatic filters miss. So the pipeline automates the high-volume work but keeps two human-anchored checkpoints — the curated golden gate at deploy time and periodic human audit of the feedback stream. Common wrong answer: "trust the automated eval as the gate" — it learns from the same data the pipeline ingests, so it can be poisoned along with the agent; the gate must be independent and human-curated.

eaa-5-poisoning mcq analyze ◆◆◇◇

In a self-improving pipeline, the automated eval score rises while user-survey satisfaction falls. What most likely happened?

Why

The eval dataset and few-shot examples are built from user feedback; when that feedback is noisy, the evaluator learns from the same poison, so its score climbs as quality drops — the metric self-confirms the degradation. The independent survey is the trustworthy signal here, not the biased one. The agent didn’t genuinely improve (the survey says otherwise), and golden dataset size affects run time, not whether the automated score diverges from reality.

Options a. The survey is biased; trust the automated eval score b. The agent genuinely improved on metrics but users dislike the change c. Feedback poisoning — noisy feedback contaminated the eval dataset, so the score self-confirms d. The golden dataset grew too large to run

Show answer

Correct: Feedback poisoning — noisy feedback contaminated the eval dataset, so the score self-confirms

eaa-5-poisoning-filter free apply ◆◆◇◇

A self-improving pipeline is being poisoned by users who click “helpful” without reading. Design a feedback quality filter that keeps the noise out, and name the independent anchor that catches drift the filter misses.

Show answer

Design filters that screen feedback before it enters the dataset: (1) engagement-time threshold — discard "helpful" clicks submitted within ~2 seconds of the response (too fast to have read it); (2) cross-signal validation — a positive click followed by an immediate re-query of the same topic is treated as noise, not approval; (3) confidence weighting — weight feedback from users with a history (e.g. >10 interactions) above first-touch clicks; (4) a weekly human audit of a ~10% random sample. And keep the golden dataset as an independent anchor: even if some noise slips through, human-curated must-pass rows catch the resulting drift. Common wrong answer: "collect more feedback to drown out the noise" — more unfiltered feedback adds more poison; you need filters plus the golden-dataset firewall, not volume.

eaa-5-silent-regression mcq analyze ◆◆◇◇

Which production failure will latency and error-rate dashboards fail to catch?

Why

A silent regression produces wrong-but-plausible output without any error or slowdown — so latency and error-rate stay green and nothing fires; only content-quality monitoring (evaluator scores over live traces, provenance checks) detects it. The other three all move a system metric a standard dashboard already watches: timeouts and crash loops spike the error rate, and a model-swap slowdown spikes latency.

Options a. Wrong-but-plausible answers returned with no error or slowdown b. An API timeout spike c. A latency increase after a model swap d. A crash loop that exhausts retries

Show answer

Correct: Wrong-but-plausible answers returned with no error or slowdown