Observability & Tracing
Traces and spans as the units of agent observability, the span hierarchy that mirrors agent architecture (Agent ⊃ Chain ⊃ Tool ⊃ LLM), automatic vs manual OpenTelemetry instrumentation, collecting traces in Phoenix, and turning traces into the bedrock for evaluation, cost attribution, and regression detection.
On this page
Why observability matters
Traditional software is deterministic: a stack trace
points at the exact line that failed. Agents break both halves of that promise —
the same query can trigger a different tool sequence on a different run, and
failures are often semantic (a wrong answer, not a crash). print statements
don’t scale to thousands of variable, multi-step runs.
Observability means complete visibility into every layer: prompts, model responses, token usage, latency, and the function calls orchestrating behaviour.
The stack is three layers: (1) instrument with OpenTelemetry, (2) collect in Phoenix, (3) evaluate over the collected traces.
Traces and spans
A trace is the complete end-to-end record of a single agent run — one or more spans in a hierarchical tree, linked by a unique trace ID. A span is one step within it: an LLM call, a tool invocation, or a routing decision, recording start/end time, attributes, and status. Spans nest.
The span hierarchy mirrors the agent architecture
EAA-2.4Span nesting directly reflects the agent’s structure: an Agent span wraps the
run, Chain spans are router iterations, Tool spans are skill invocations,
LLM spans are model calls — Agent ⊃ Chain ⊃ Tool ⊃ LLM. Reading a trace is
reading the architecture diagram with real data. When a trace shows the wrong tool
called, you open that Chain span’s LLM call to see what the model received. The
failure mode: a poorly instrumented agent produces a flat trace that is harder
to navigate than print statements. [V] Verified
Define a trace and a span, and state how they relate. EAA-2.1
A trace is the end-to-end record of one agent run; a span is a single step within it (LLM call, tool invocation, routing decision). A trace is a hierarchical tree of spans sharing one trace ID — the trace is the whole run, each span a nested step inside it.
OpenTelemetry and Open Inference
OpenTelemetry (OTEL) is the vendor-neutral observability framework — a data model and instrumentation API for traces, spans, metrics, and logs: instrument once, export to any compatible backend. [V] Verified For LLM systems, Open Inference (from Arize) adds LLM-specific span kinds and attribute names for prompts, completions, and token counts. The four span kinds most relevant to agents — Agent (full run), Chain (router iteration), Tool (skill invocation), LLM (single model call) — are the vocabulary the hierarchy above is built from. (OpenInference defines more — Retriever, Reranker, Embedding, and others — but these four carry an agent trace.)
Automatic vs manual instrumentation
The fastest path is automatic instrumentation: a library hooks your API client and creates a span for every call with no code changes. For OpenAI it’s effectively three lines:
from phoenix.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
register(project_name="sales-agent") # point at the Phoenix collector
OpenAIInstrumentor().instrument() # every chat.completions.create() is now traced
That captures LLM-level detail — prompts, completions, tokens, latency — but cannot see your router logic, tool dispatch, or business code.
Manual instrumentation fills the
gap. The key principle is outside-in:
start at the outermost layer (the agent run) and work inward; nesting with
blocks produces the correct hierarchy automatically.
tracer = trace.get_tracer(__name__)
def run_agent(user_message: str) -> str:
with tracer.start_as_current_span("Agent Run", openinference_span_kind="agent"):
while True:
with tracer.start_as_current_span("Router Iteration", openinference_span_kind="chain"):
response = client.chat.completions.create(...) # auto-traced as an LLM span
if not msg.tool_calls:
return msg.content
for call in msg.tool_calls:
with tracer.start_as_current_span(call.function.name, openinference_span_kind="tool"):
result = dispatch_tool(call)
Automatic + manual = full observability
EAA-2.2Automatic instrumentation covers every LLM API call with zero code changes;
manual instrumentation adds the business-logic spans (agent, router, tool
dispatch). The decision rule: start automatic for immediate visibility, then
add manual spans where the trace has gaps. Each has a failure mode —
automatic can flood batch jobs with noise; manual demands discipline, since a
missed with block leaves a hole in the tree.
Contrast what automatic and manual instrumentation each make visible, and give the decision rule for combining them. EAA-2.2
Automatic hooks the LLM client → every model call’s prompt/completion/tokens/latency, zero code changes, but nothing about routing or tool dispatch. Manual wraps your own code in spans → router iterations, tool calls, business logic. Rule: start automatic for instant LLM visibility, then add manual spans exactly where the trace shows gaps (agent/router/tool layers).
Phoenix: collecting and visualising traces
Phoenix is Arize’s open-source observability tool: it receives OTEL traces, stores them by project, and gives trace-tree views, timelines, span-detail inspection, and evaluation integration. [V] Verified It is where the instrumented spans become something you can read, query, and evaluate.
From traces to evaluation
Traces are the bedrock for evaluation at scale. Every collected trace holds full inputs, outputs, and intermediate steps, so you can: build evaluation datasets from real runs, run automated evaluators over traces, detect regressions by comparing scores across versions, and monitor production quality by sampling live traces.
Traces are evaluation infrastructure
EAA-2.5Every later workflow rides on traces: router evaluations read trace spans, trajectory evaluations score step sequences, LLM-as-a-judge grades span outputs. Without traces, evaluation needs manual data collection that doesn’t scale — so the tracing investment pays off multiplicatively. The catch: traces with missing spans yield incomplete evaluations, so instrument thoroughly before building eval workflows. [I] Inference
Name two evaluation workflows that depend on collected traces, and state what breaks if spans are missing. EAA-2.5
Two of: building eval datasets from real runs, running automated evaluators over traces, regression detection across versions, production quality monitoring by sampling live traces. If spans are missing, the corresponding evaluations are incomplete — e.g. a router eval can’t score a routing decision that was never captured — so coverage of the eval mirrors coverage of the instrumentation.
Attributing cost from traces
Because LLM spans carry token counts (and parent spans aggregate them), traces turn cost questions from guesswork into arithmetic. The method: from a sample of traces, get each skill’s average LLM calls × tokens/call, weight by how often the skill runs, multiply by the daily volume and the price per token.
You have per-span token counts for each skill. Outline how you turn them into a daily cost broken down by skill. EAA-2.6
For each skill: average LLM-calls × tokens-per-call from sampled traces, times the skill’s share of queries, times daily query volume = daily tokens; multiply by price per token. Add the router’s per-query tokens. Summing gives the total and the per-skill breakdown, and dividing a skill’s cost by the queries it serves gives cost-per-query — exposing skills that are cheap overall but expensive per use.
Debugging with the trace tree
A trace’s shape is the debugging workflow: start at the symptom span and walk back through its parents.
A trace shows a single flat span for the whole agent run. Which spans would you add to expose the router and tools, and why does the flat trace block debugging? EAA-2.8
Wrap the whole run in an Agent span, each router-loop iteration in a Chain span, and each skill in a Tool span (the LLM calls auto-trace inside them). A flat trace blocks debugging because it collapses the router→tool→LLM structure into one node, so you can’t see which decision misfired — there are no boundaries to localise the failure to.
Summary — what this sets up
You can now see the agent: traces capture each run, spans nest to mirror the architecture, automatic instrumentation gives instant LLM visibility, and manual outside-in spans expose the router and tools. Collected in Phoenix, these traces are the substrate for everything next:
- Chapter 3 — component evaluations: scoring the router and skills the spans expose.
- Chapter 4 — trajectory & structured evals: scoring the whole span sequence.
- Chapter 5 — LLM-as-a-judge & monitoring: automated grading over live traces.