The first wave of agent demos taught everyone to ask: “Can it do the task?”
The second wave is teaching a nastier question: “What the hell did it do while trying?”
That question matters now because agents are not cute autocomplete boxes anymore. They read repos, call tools, touch tickets, browse docs, run commands, store memory, ask for approvals, and sometimes push changes into systems that cost money or break customers.
If all you have after that is a friendly model summary saying “done,” congratulations. You built a black box with commit access.
Agent observability is the new logs. Not optional. Not enterprise garnish. The baseline.
Source freshness check: this post was checked on 2026-06-11. The OpenAI Agents SDK repository showed activity on 2026-06-11 and describes agents with tools, guardrails, handoffs, and built-in tracing concepts. The OpenTelemetry GenAI semantic conventions repository showed activity on 2026-06-11, which is current signal that GenAI telemetry is still being standardized. Arize Phoenix showed activity on 2026-06-11 and describes itself around AI observability and evaluation. The LangSmith SDK repository showed activity on 2026-06-10 and positions LangSmith around debugging, evaluating, and monitoring language model apps and agents. The tools will churn. The live trend is clear: serious agent stacks are moving from “chat transcript” toward traces, spans, evals, and receipts.
A chat transcript is not observability
A transcript tells you what the agent said.
That is not the same as what happened.
The useful record is uglier and more mechanical:
- the prompt and policy that shaped the run
- which context was retrieved
- which memory entries were read or written
- which tools were offered
- which tools were actually called
- arguments passed into those tools
- outputs returned by those tools
- retries, failures, fallbacks, and timeouts
- approvals requested and granted
- files changed, commands run, tests executed
- final artifacts, diffs, links, and receipts
That is the difference between a bedtime story and an incident timeline.
The model’s own final answer is not enough because models compress, omit, rationalize, and occasionally explain the universe like a confident intern who deleted the logs.
Agents create distributed traces now
A modern agent run is basically a tiny distributed system wearing a chatbot mask.
One request fans out into retrieval, tool calls, browser actions, shell commands, API writes, eval checks, and human approvals. Some of that happens synchronously. Some of it happens in background jobs. Some of it fails halfway through and gets retried with different context.
That is not a “message.”
That is a trace.
And once you see agent runs as traces, the observability requirements get obvious:
- every run needs a run id
- every meaningful operation needs a span
- every span needs timing, status, inputs, outputs, and error state
- sensitive data needs redaction before storage
- tool calls need structured receipts
- human approvals need immutable evidence
- costs and token usage need to attach to the work, not float in a billing fog
- final answers need links back to the actual execution record
This is boring engineering. Which means it is exactly where the product gets real.
“The agent said it checked” is not a check
Agent summaries are useful UX. They are terrible evidence.
If an agent says it ran tests, the trace should show the command, exit code, timestamp, output summary, working directory, and commit hash.
If it says it read a customer ticket, the trace should show which ticket, which fields, and whether private content was redacted.
If it says it used memory, the trace should show which memory records influenced the answer.
If it says it deployed, the trace should show the artifact, environment, deploy id, and rollback path.
Not because you distrust every model.
Because “trust me bro” is not an operations strategy.
The observability stack needs agent-native fields
Traditional app logs help, but they miss the agent-shaped parts.
You need to know more than “POST /tool/call returned 200.” You need to know why the tool was called, what authority the model had, what context it saw, what it was trying to accomplish, and whether the result changed the next decision.
Minimum useful fields look like this:
- Run identity — task id, user, workspace, model, version, policy, and starting snapshot.
- Context lineage — documents, repo files, memories, tickets, and search results used.
- Decision spans — planning steps, model calls, routing choices, and confidence signals where available.
- Tool receipts — tool name, arguments, outputs, duration, permissions, errors, and redaction status.
- State changes — file diffs, database writes, messages sent, branches pushed, tickets edited.
- Evaluation gates — tests, linters, policy checks, human review, and production safety gates.
- Cost and latency — token spend, API cost, queue time, tool time, and retry waste.
- Final accountability — who or what approved the result, what shipped, and how to roll it back.
Without that, debugging becomes archaeology.
Privacy is part of observability, not an excuse to skip it
Yes, traces can leak sensitive data if you do them like a clown.
That is not an argument against observability. It is an argument for designing it properly.
Agent telemetry needs redaction, retention limits, access controls, tenant boundaries, and separate handling for prompts, tool outputs, secrets, and customer data. The trace should tell you enough to debug and audit without turning your observability backend into a second data breach waiting room.
The rule is simple: collect the shape of the work, the evidence of the work, and the safety-critical details. Do not casually warehouse every private token the agent tripped over.
Evals and traces belong together
Evals without traces tell you a score.
Traces tell you why you got the score.
When an agent fails a task, you need to see whether the model misunderstood the instruction, retrieved bad context, called the wrong tool, hit a flaky API, got blocked by permissions, or passed a bad intermediate result into the next step.
That is how teams improve systems instead of just swapping models and praying.
The eval says: failed.
The trace says: failed because retrieval pulled stale docs, the agent trusted them, the tool call succeeded, and the final patch changed the wrong config.
That second sentence is where engineering happens.
The minimum serious setup
If you are shipping agentic workflows in 2026, the floor is not complicated:
- assign a stable id to every agent run
- trace model calls, retrieval, memory, tool calls, approvals, and state changes
- store tool receipts separately from model prose
- redact secrets before telemetry leaves the execution boundary
- attach tests, evals, and human review to the same run record
- expose a human-readable run timeline in the product
- make final answers link to evidence
- keep enough retention for incidents, debugging, billing disputes, and product learning
That is the boring version.
The spicy version: if your agent can mutate real systems and you cannot replay its path through context, tools, and approvals, you are flying blind with a very expensive hallucination engine.
Ship receipts or keep it in the demo room
Agents are becoming operational software.
Operational software needs logs, metrics, traces, alerts, permissions, reviews, and rollback. Agents do not get a magic exemption because the UI has a chat box and the demo made everyone clap.
The better agent products will not just feel smart. They will be inspectable. They will show their work. They will make mistakes diagnosable instead of mystical.
That is the line between a toy agent and a production agent.
If it acts, trace it.
If it changes state, receipt it.
If it ships, make the evidence boring enough that a tired human can audit it at 2 a.m.
That is not bureaucratic drag.
That is how you let the damn thing do more work without turning every incident into ghost hunting.