LLM-Judge & Monitoring
Meta-evaluation (measuring the judge against code-based ground truth), agreement rate vs false-positive rate as leniency indicators, four levers to improve a judge, golden datasets as CI shipping gates and how they evolve, production monitoring for silent regressions, and the self-improving pipeline with its feedback-poisoning risk.
On this page
Meta-evaluation: evaluating the evaluator
If a dashboard says “92% accuracy” but the judge that produced it is itself only 75% accurate, the number is noise. Meta-evaluation fixes this by scoring the judge against a deterministic reference.
The protocol: [V] Verified (1) select cases with deterministic expected outputs, (2) run a code-based evaluator (a deterministic oracle — only as correct as its comparison logic, but exact on these cases) to get ground truth, (3) run the LLM judge on the same cases, (4) compute the agreement rate — the fraction where the judge’s label matches ground truth. Target ≥ 0.90 (the aspiration); treat ~0.85 as the hard floor below which you don’t ship the judge.
You can't measure a judge without ground truth
EAA-5.1LLM judge accuracy is unknowable without a deterministic reference, and the code-based evaluator on ground-truth data is that reference. A 70% agreement rate means 30% of evaluations are wrong — enough to mask real regressions or flag false improvements. The exception: for purely subjective dimensions (tone, creativity) where no code oracle exists, inter-annotator agreement with humans replaces the code-based reference. [V] Verified
What is meta-evaluation, and why does it require ground truth? EAA-5.1
Meta-evaluation measures an evaluator’s own accuracy by comparing its judgments against a known-correct reference, reported as the agreement rate. It needs ground truth because a judge can’t validate itself — its self-reported score is circular. A code-based evaluator on deterministic cases supplies a 100%-accurate reference; without one (for subjective dimensions) you fall back to human inter-annotator agreement.
Agreement rate is not enough: watch the asymmetry
A single agreement rate hides which way the judge errs. Split its mistakes into a false-positive rate (judge says “correct” when the answer is wrong — leniency) and a false-negative rate (judge says “wrong” when it’s right — strictness). For a quality gate, leniency is the dangerous one: a lenient judge passes bad answers.
A judge agrees with ground truth 95% of the time but has a 30% false-positive rate. Why are these different leniency indicators, and is it safe to gate with? EAA-5.7
Agreement is the overall match rate; FPR isolates one direction of error — passing wrong answers as correct. A high agreement can still hide a high FPR if errors cluster on the rarer class. A 30% FPR means nearly a third of bad answers slip through, so the judge is too lenient to gate releases regardless of the 95% headline — you’d set both thresholds (e.g. agreement ≥ 0.85 and FPR ≤ 0.10) and reject on either.
Improving a judge: four levers, one metric
When meta-evaluation fails the bar, four techniques each target a different failure mode — and the judge-improvement work is measured by the same agreement rate:
- Prompt engineering — make criteria explicit, fix output format, add chain-of-thought. Simplest and often most effective: ~80% of judge failures are prompt failures.
- Few-shot examples — 3–5 labelled judgments covering boundary cases (case-insensitive match, float tolerance) to fix calibration.
- Model selection — a stronger judge model closes capability gaps (GPT-4o might agree 95% where a smaller model agrees 82%).
- Semantic similarity — for non-binary outputs, embedding cosine similarity instead of exact match fixes false negatives on paraphrases.
Apply one at a time, or you can’t attribute the gain. Start with the prompt.
Name the four levers for improving an LLM judge and the single metric that tells you whether any worked. EAA-5.2
Prompt engineering (fix misunderstood criteria), few-shot examples (fix calibration on boundary cases), model selection (fix capability gaps), and semantic similarity (fix false negatives on paraphrases). The single metric is the agreement rate from meta-evaluation — re-run it after each change, applied one lever at a time so the improvement is attributable.
Golden datasets: the shipping gate
A golden dataset is a curated set of must-pass scenarios used as a CI shipping gate: every agent change runs against it, and a regression blocks the merge. It differs from a general eval dataset on three counts — curated for criticality (every row is must-pass), seeded with known failure modes codified from past production bugs, and grown continuously as new failures appear. Because agent runs are probabilistic, run it several times and take the median or worst score with tolerance bands.
Golden datasets are living documents
EAA-5.3A golden dataset encodes institutional knowledge — new team members inherit it through the CI gate — and its value compounds as it grows with every production failure. But it must be maintained: it stabilises around 40–60 rows, and past ~100 run times turn prohibitive, so periodically prune rows that have never caught a regression. The red flag: a golden dataset that never fails isn’t a success — it’s a sign the set has gone stale and no longer tests what might break now. [V] Verified
How does a golden dataset differ from a general evaluation dataset, and why is a 100%-pass-for-months gate a warning sign? EAA-5.3
A golden dataset is curated for criticality (every row must pass), seeded with known failure modes from past bugs, and grown continuously — and it’s wired as a CI gate that blocks merges. A general eval dataset is for broad iteration, not a hard gate. A gate that passes 100% for months usually means it’s gone stale: new features have zero coverage, so it tests only what used to break. The fix is an evolution policy (add on every incident/feature, prune stale rows).
Production monitoring
Production reveals failures no dataset anticipated — unseen queries, new entities, API degradations, distribution shifts. The dev tools carry straight over: Phoenix works the same in prod, enriched by real traffic.
Track production metrics with alert thresholds: running evaluator scores per dimension, convergence (steps to terminal state), LLM-call counts (a spike signals a loop), latency (p50/p99, by component), and cost per query. A common rule: alert on more than ~10% deviation from baseline.
What is a silent regression, and why do latency/error-rate dashboards miss it? EAA-5.4
A silent regression is when the agent returns plausible-looking but wrong answers without erroring — e.g. it gracefully falls back to fabricated content when a data source fails. Latency and error-rate dashboards stay green because nothing crashed and responses are fast, so only content-quality monitoring (evaluator scores over sampled live traces, provenance checks) catches the drop. It’s the failure mode production monitoring exists for.
The self-improving pipeline — and its firewall
A self-improving agent pipeline runs five stages: [V] Verified (1) collect user feedback, (2) augment the eval dataset with successful and failed interactions, (3) run CI/CD experiments on every dataset update, (4) add few-shot examples from collected data, (5) gate deployment via the golden dataset. Automation is the multiplier — each interaction makes the system incrementally better.
But automation without a quality gate is dangerous: feedback poisoning lets noisy or adversarial feedback contaminate the dataset, and — because the evaluator learns from the same poisoned data — the automated score can rise while real quality falls.
The Weekend Outage Nobody Detected
Your production agent’s search_kb skill calls an external knowledge-base API. On
Saturday at 2 AM the provider ships a breaking change: 40% of queries now return
empty results. The agent “gracefully” handles the emptiness by answering from its
training data instead. By Monday, weekend users got plausible but fabricated
answers. Friday’s golden-dataset CI gate passed, and no alert fired all weekend.
Why it slipped through — two gaps. (1) The golden-dataset gate runs only on PR merges, not continuously in production. (2) Monitoring tracked surface metrics (latency, error rate) rather than output provenance — since the agent fell back gracefully instead of erroring, nothing tripped. This is a textbook silent regression.
The monitor that would have caught it in an hour. From the existing traces,
track the search_kb empty-result rate; alert when the 1-hour rolling rate exceeds
baseline (e.g. > 15% when the norm is < 5%). The data is already collected — it just
wasn’t watched.
Folding it into the pipeline. Add the Saturday cases (empty-KB queries + their
fabricated responses) to the golden dataset as negative examples; create a
source-attribution evaluator that checks the response cites KB documents when
the query needs a lookup; auto-fail responses that skip search_kb on
lookup-requiring queries. Each new failure mode becomes a permanent regression test.
Summary — the guide, end to end
You can now trust (or distrust) the evaluators themselves, and run all of this in production: meta-evaluation scores the judge against ground truth; agreement and false-positive rate together reveal leniency; four levers improve a failing judge; golden datasets gate shipping and must evolve; monitoring catches the silent regressions surface metrics miss; and the self-improving pipeline compounds quality — guarded by feedback filters and the golden-dataset firewall.
Across five chapters the arc closes: decompose the agent (ch1), observe it with traces (ch2), evaluate its components (ch3), score its trajectory and compare variants (ch4), and trust the evaluators and watch production (ch5). Paired with the Fine-tuning & RL guide, that’s the full loop — you can train a model and prove it works.