Agent Evals Need Shadow Traffic, Not Trophy Boards

Diagram of real production work mirrored into a shadow evaluation lane for AI agents before shipping

Agent evals are getting treated like trophy cabinets.

Nice charts. Cute win rates. A benchmark screenshot in the launch post. A heroic claim about “state of the art” on a task set nobody on your team has ever seen in production.

That is fine for marketing.

It is not how you decide whether an agent should touch real work.

If an agent is going to read tickets, edit repos, call MCP tools, triage incidents, open PRs, summarize customer data, or queue deploy plans, the eval that matters is not a leaderboard. It is shadow traffic: mirror real work into the candidate agent, keep it away from the controls, and compare what it would have done against what actually happened.

No shadow lane? Then you are not evaluating an agent. You are admiring it in a showroom.

Source freshness check: this post was checked on 2026-07-01 against currently active agent/devtool sources. The agent stack is moving this week, not in some stale hype cycle: OpenAI Agents SDK had activity on 2026-07-01, Gemini CLI on 2026-06-30, Claude Code on 2026-06-30, LangGraph on 2026-06-30, Model Context Protocol on 2026-06-30, and GitHub MCP Server on 2026-06-27. The current direction is obvious: agents are becoming tool-using execution systems. That makes production-shaped evaluation a live product requirement, not academic garnish.

Benchmarks are not your blast radius

Benchmarks are useful signal. Do not throw them out.

But they answer the wrong first question.

A benchmark asks: “Can this system solve a known task under known scoring rules?”

Your product needs to ask: “What happens when this agent sees our actual mess?”

Your actual mess includes:

ambiguous tickets
half-updated docs
flaky tests
permissions it should not have
stale memory
weird customer data
tool timeouts
repo conventions nobody wrote down
humans overriding the workflow halfway through
tasks where the right answer is “do nothing”

That last one matters. A lot of agents look good when every prompt expects action. Production has plenty of moments where action is the bug.

An agent that confidently changes code when it should ask for clarification is not “proactive.” It is a liability with autocomplete.

Surprised Pikachu reaction GIF representing a team shocked that a leaderboard-winning agent regressed in production

Shadow mode is how you catch the weird failures

Shadow traffic means the candidate agent watches the same class of work your current system handles, but it cannot affect users or production state.

For a coding agent, that might mean:

mirror incoming issues or internal tickets
let the candidate produce a plan, diff, test command list, and PR description
do not open the real PR automatically
compare against the human or current-agent outcome
score the delta, not just the final answer

For a support or ops agent, it might mean:

replay real conversations with sensitive data stripped or access-scoped
let the candidate choose tools it would call
block writes and external sends
compare recommendations to actual resolutions
flag dangerous, expensive, slow, or policy-breaking choices

The point is not to create a fake exam.

The point is to observe the agent under production-shaped pressure before it gets production-shaped power.

Score the run, not just the answer

Agent evals that only score final output are missing the damn plot.

The path matters.

Did the agent read the right files? Did it call the right tools? Did it avoid secrets? Did it ask for approval at the right point? Did it stop when permissions were missing? Did it cite current sources? Did it notice a stale artifact? Did it spend twenty dollars to save five minutes? Did it create a beautiful patch with no rollback plan?

A useful shadow eval should track:

task success
harmful action attempts
unnecessary tool calls
missing tool calls
latency
token and tool cost
human override rate
stale-source usage
policy violations
test or verification quality
rollback readiness
confidence calibration

This is where agent observability and evals become the same product surface.

A trace is not just for debugging after the fire. It is training data for deciding whether the next agent version is allowed near the stove.

This Is Fine reaction GIF representing a team shipping an agent without production-shaped shadow evaluations

Promote agents like you promote services

A serious agent rollout should look boringly familiar to anyone who has shipped production software:

offline tests
replay tests
shadow traffic
limited internal writes
guarded user-facing actions
canary rollout
kill switch
rollback review

The agent should earn permissions in stages.

First it can suggest.

Then it can draft.

Then it can open a PR behind review.

Then it can handle low-risk classes of work.

Then maybe, after enough clean shadow data and human override history, it gets broader autonomy.

Not before.

“Model got better” is not a deployment plan. “The vendor launched a new agent mode” is not a safety case. “It passed the benchmark” is not permission to touch the money printer.

The eval set should keep rotting on purpose

Static eval sets go stale. Agents overfit. Teams learn the shape of the test. Vendors optimize for the scoreboard. Everyone pretends this is fine because the number went up.

For production agents, your eval corpus should keep changing because production keeps changing.

Keep a rolling sample of:

recently solved tickets
reverted PRs
incident follow-ups
confusing user asks
flaky tool sessions
policy-denied attempts
expensive runs
stale-source failures
cases where humans said “nope”

Then replay them against candidate versions.

If a new agent version improves benchmark score but increases human overrides on your actual ticket stream, ship the override metric, not the marketing slide.

The bottom line

Agents are crossing from answer machines into work machines.

Work machines need production-shaped evaluation.

Leaderboards can tell you where to look. Shadow traffic tells you whether to ship.

So build the shadow lane:

mirror real work before granting write power
score tool behavior, not just output prettiness
track cost, latency, policy violations, and human overrides
keep a fresh rolling eval set from actual failures
promote agent permissions gradually
make rollback and kill switches part of the eval story

If your agent only wins in a benchmark harness, congratulate it politely.

If it survives shadow traffic with fewer mistakes, lower cost, clearer receipts, and less human cleanup, now we are talking.

Everything else is trophy-board theater.