Part 1 Chapter 2 Last verified 2026-06-19

Observability & Tracing

Traces and spans as the units of agent observability, the span hierarchy that mirrors agent architecture (Agent ⊃ Chain ⊃ Tool ⊃ LLM), automatic vs manual OpenTelemetry instrumentation, collecting traces in Phoenix, and turning traces into the bedrock for evaluation, cost attribution, and regression detection.

On this page

Why observability matters
Traces and spans
OpenTelemetry and Open Inference
Automatic vs manual instrumentation
Phoenix: collecting and visualising traces
From traces to evaluation
Attributing cost from traces
Debugging with the trace tree
Summary — what this sets up

Do I know this already? Diagnostic

Answer these confidently and you can skim ahead to Traces and spans; if any is shaky, read closely — each is developed below.

Predict before reading — you’ll check each against the chapter.

Your agent gives a wrong answer (no crash) and took a different tool path than last run. Predict why a stack trace and print statements fail you here.
You add two lines and suddenly see every LLM call’s prompt and token count — but the router’s decisions are still invisible. Guess what kind of instrumentation those two lines were, and what’s missing.
A trace renders as one flat span for the whole run. Estimate how useful that is for finding which step misrouted.
Visualisation is 20% of your agent’s queries but you suspect it dominates cost. Predict what single artifact lets you settle this with data instead of guessing.

Check your answers

The failure is semantic (wrong, not crashed) and the path changes per run — a stack trace needs a crash and a fixed path; print logs can’t be navigated across thousands of variable runs.
Automatic instrumentation (library hooks on the LLM client). Missing: your router logic, tool dispatch, and business code — anything that isn’t an LLM API call.
Nearly useless — a flat trace hides the router→tool→LLM structure, so you can’t localise the misroute. You need manual spans at the boundaries.
A trace: per-span token counts let you attribute cost per skill and confirm (or refute) that visualisation dominates.

Why observability matters

Traditional software is deterministic: a stack trace points at the exact line that failed. Agents break both halves of that promise — the same query can trigger a different tool sequence on a different run, and failures are often semantic (a wrong answer, not a crash). print statements don’t scale to thousands of variable, multi-step runs.

Observability means complete visibility into every layer: prompts, model responses, token usage, latency, and the function calls orchestrating behaviour.

The stack is three layers: (1) instrument with OpenTelemetry, (2) collect in Phoenix, (3) evaluate over the collected traces.

Traces and spans

A trace is the complete end-to-end record of a single agent run — one or more spans in a hierarchical tree, linked by a unique trace ID. A span is one step within it: an LLM call, a tool invocation, or a routing decision, recording start/end time, attributes, and status. Spans nest.

Key concept

The span hierarchy mirrors the agent architecture

EAA-2.4

Span nesting directly reflects the agent’s structure: an Agent span wraps the run, Chain spans are router iterations, Tool spans are skill invocations, LLM spans are model calls — Agent ⊃ Chain ⊃ Tool ⊃ LLM. Reading a trace is reading the architecture diagram with real data. When a trace shows the wrong tool called, you open that Chain span’s LLM call to see what the model received. The failure mode: a poorly instrumented agent produces a flat trace that is harder to navigate than print statements. [V] Verified

Define a trace and a span, and state how they relate. EAA-2.1

A trace is the end-to-end record of one agent run; a span is a single step within it (LLM call, tool invocation, routing decision). A trace is a hierarchical tree of spans sharing one trace ID — the trace is the whole run, each span a nested step inside it.

OpenTelemetry and Open Inference

OpenTelemetry (OTEL) is the vendor-neutral observability framework — a data model and instrumentation API for traces, spans, metrics, and logs: instrument once, export to any compatible backend. [V] Verified For LLM systems, Open Inference (from Arize) adds LLM-specific span kinds and attribute names for prompts, completions, and token counts. The four span kinds most relevant to agents — Agent (full run), Chain (router iteration), Tool (skill invocation), LLM (single model call) — are the vocabulary the hierarchy above is built from. (OpenInference defines more — Retriever, Reranker, Embedding, and others — but these four carry an agent trace.)

Automatic vs manual instrumentation

The fastest path is automatic instrumentation: a library hooks your API client and creates a span for every call with no code changes. For OpenAI it’s effectively three lines:

from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

register(project_name="sales-agent")      # point at the Phoenix collector
OpenAIInstrumentor().instrument()          # every chat.completions.create() is now traced

That captures LLM-level detail — prompts, completions, tokens, latency — but cannot see your router logic, tool dispatch, or business code.

Manual instrumentation fills the gap. The key principle is outside-in: start at the outermost layer (the agent run) and work inward; nesting with blocks produces the correct hierarchy automatically.

tracer = trace.get_tracer(__name__)

def run_agent(user_message: str) -> str:
    with tracer.start_as_current_span("Agent Run", openinference_span_kind="agent"):
        while True:
            with tracer.start_as_current_span("Router Iteration", openinference_span_kind="chain"):
                response = client.chat.completions.create(...)   # auto-traced as an LLM span
                if not msg.tool_calls:
                    return msg.content
                for call in msg.tool_calls:
                    with tracer.start_as_current_span(call.function.name, openinference_span_kind="tool"):
                        result = dispatch_tool(call)

Key concept

Automatic + manual = full observability

EAA-2.2

Automatic instrumentation covers every LLM API call with zero code changes; manual instrumentation adds the business-logic spans (agent, router, tool dispatch). The decision rule: start automatic for immediate visibility, then add manual spans where the trace has gaps. Each has a failure mode — automatic can flood batch jobs with noise; manual demands discipline, since a missed with block leaves a hole in the tree.

Contrast what automatic and manual instrumentation each make visible, and give the decision rule for combining them. EAA-2.2

Automatic hooks the LLM client → every model call’s prompt/completion/tokens/latency, zero code changes, but nothing about routing or tool dispatch. Manual wraps your own code in spans → router iterations, tool calls, business logic. Rule: start automatic for instant LLM visibility, then add manual spans exactly where the trace shows gaps (agent/router/tool layers).

Phoenix: collecting and visualising traces

Phoenix is Arize’s open-source observability tool: it receives OTEL traces, stores them by project, and gives trace-tree views, timelines, span-detail inspection, and evaluation integration. [V] Verified It is where the instrumented spans become something you can read, query, and evaluate.

From traces to evaluation

Traces are the bedrock for evaluation at scale. Every collected trace holds full inputs, outputs, and intermediate steps, so you can: build evaluation datasets from real runs, run automated evaluators over traces, detect regressions by comparing scores across versions, and monitor production quality by sampling live traces.

Key concept

Traces are evaluation infrastructure

EAA-2.5

Every later workflow rides on traces: router evaluations read trace spans, trajectory evaluations score step sequences, LLM-as-a-judge grades span outputs. Without traces, evaluation needs manual data collection that doesn’t scale — so the tracing investment pays off multiplicatively. The catch: traces with missing spans yield incomplete evaluations, so instrument thoroughly before building eval workflows. [I] Inference

Name two evaluation workflows that depend on collected traces, and state what breaks if spans are missing. EAA-2.5

Two of: building eval datasets from real runs, running automated evaluators over traces, regression detection across versions, production quality monitoring by sampling live traces. If spans are missing, the corresponding evaluations are incomplete — e.g. a router eval can’t score a routing decision that was never captured — so coverage of the eval mirrors coverage of the instrumentation.

Attributing cost from traces

Because LLM spans carry token counts (and parent spans aggregate them), traces turn cost questions from guesswork into arithmetic. The method: from a sample of traces, get each skill’s average LLM calls × tokens/call, weight by how often the skill runs, multiply by the daily volume and the price per token.

Per-skill cost attribution Worked example

Problem. A support agent serves 4,000 queries/day. From 200 sampled traces you extract per-skill averages. The router uses 1 LLM call of 250 tokens on every query. Price is $0.01 per 1,000 tokens. Find the daily cost per skill and the total, then the cost per query of the most expensive skill.

| Skill | Avg LLM calls | Tokens/call | Share of queries | | --- | --- | --- | --- | | Knowledge lookup | 1.0 | 700 | 50% | | Answer drafting | 1.0 | 1,400 | 30% | | Report builder | 2.0 | 2,000 | 20% |

Reasoning. Tokens = queries × share × calls × tokens/call; cost = tokens ÷ 1000 × $0.01.

Router: $4000 \times 1 \times 250 = 1{,}000{,}000$ tok → $10.00
Lookup: $4000 \times 0.50 \times 1.0 \times 700 = 1{,}400{,}000$ tok → $14.00
Drafting: $4000 \times 0.30 \times 1.0 \times 1{,}400 = 1{,}680{,}000$ tok → $16.80
Report: $4000 \times 0.20 \times 2.0 \times 2{,}000 = 3{,}200{,}000$ tok → $32.00

Answer. Total = $72.80/day. The report builder serves only 20% of queries but costs $32.00 — its per-query cost is $32.00 / 800 = \$ 0.040 $, versus lookup at$ 14.00 / 2000 = $0.007$, so it is ~5.7× more expensive per query. The trace data localises where to optimise (here, the report builder’s 2 LLM calls), which no end-to-end cost number could.

You have per-span token counts for each skill. Outline how you turn them into a daily cost broken down by skill. EAA-2.6

For each skill: average LLM-calls × tokens-per-call from sampled traces, times the skill’s share of queries, times daily query volume = daily tokens; multiply by price per token. Add the router’s per-query tokens. Summing gives the total and the per-skill breakdown, and dividing a skill’s cost by the queries it serves gives cost-per-query — exposing skills that are cheap overall but expensive per use.

Debugging with the trace tree

A trace’s shape is the debugging workflow: start at the symptom span and walk back through its parents.

Read the span tree, find the bug Worked example

Problem. A support agent routes to search_kb (one LLM call + one API call) then draft_response (two LLM calls: draft, then policy check). The final response violated policy. Sketch the span tree and name the span to inspect first.

Reasoning. Outside-in, the tree is:

Agent Run                        [Agent]
  Router Iteration #1            [Chain]
    LLM: Route Decision          [LLM]
    search_kb                    [Tool]
      LLM: Generate Query        [LLM]
      API: KB Search             [Tool]
  Router Iteration #2            [Chain]
    LLM: Route Decision          [LLM]
    draft_response               [Tool]
      LLM: Generate Response     [LLM]
      LLM: Policy Compliance     [LLM]

Answer. Inspect LLM: Policy Compliance first (child of draft_response). Two cases: if it approved a violating response, the compliance prompt is the bug; if it flagged the violation but the response shipped anyway, the tool’s post-check gating logic is broken. The span hierarchy tells you exactly where to look — symptom span, then up through its parents.

A trace shows a single flat span for the whole agent run. Which spans would you add to expose the router and tools, and why does the flat trace block debugging? EAA-2.8

Wrap the whole run in an Agent span, each router-loop iteration in a Chain span, and each skill in a Tool span (the LLM calls auto-trace inside them). A flat trace blocks debugging because it collapses the router→tool→LLM structure into one node, so you can’t see which decision misfired — there are no boundaries to localise the failure to.

Summary — what this sets up

You can now see the agent: traces capture each run, spans nest to mirror the architecture, automatic instrumentation gives instant LLM visibility, and manual outside-in spans expose the router and tools. Collected in Phoenix, these traces are the substrate for everything next:

Chapter 3 — component evaluations: scoring the router and skills the spans expose.
Chapter 4 — trajectory & structured evals: scoring the whole span sequence.
Chapter 5 — LLM-as-a-judge & monitoring: automated grading over live traces.