Agent evals are getting treated like trophy cabinets.
Nice charts. Cute win rates. A benchmark screenshot in the launch post. A heroic claim about “state of the art” on a task set nobody on your team has ever seen in production.
That is fine for marketing.
It is not how you decide whether an agent should touch real work.
If an agent is going to read tickets, edit repos, call MCP tools, triage incidents, open PRs, summarize customer data, or queue deploy plans, the eval that matters is not a leaderboard. It is shadow traffic: mirror real work into the candidate agent, keep it away from the controls, and compare what it would have done against what actually happened.
No shadow lane? Then you are not evaluating an agent. You are admiring it in a showroom.
Source freshness check: this post was checked on 2026-07-01 against currently active agent/devtool sources. The agent stack is moving this week, not in some stale hype cycle: OpenAI Agents SDK had activity on 2026-07-01, Gemini CLI on 2026-06-30, Claude Code on 2026-06-30, LangGraph on 2026-06-30, Model Context Protocol on 2026-06-30, and GitHub MCP Server on 2026-06-27. The current direction is obvious: agents are becoming tool-using execution systems. That makes production-shaped evaluation a live product requirement, not academic garnish.
Benchmarks are not your blast radius
Benchmarks are useful signal. Do not throw them out.
But they answer the wrong first question.
A benchmark asks: “Can this system solve a known task under known scoring rules?”
Your product needs to ask: “What happens when this agent sees our actual mess?”
Your actual mess includes:
- ambiguous tickets
- half-updated docs
- flaky tests
- permissions it should not have
- stale memory
- weird customer data
- tool timeouts
- repo conventions nobody wrote down
- humans overriding the workflow halfway through
- tasks where the right answer is “do nothing”
That last one matters. A lot of agents look good when every prompt expects action. Production has plenty of moments where action is the bug.
An agent that confidently changes code when it should ask for clarification is not “proactive.” It is a liability with autocomplete.
Shadow mode is how you catch the weird failures
Shadow traffic means the candidate agent watches the same class of work your current system handles, but it cannot affect users or production state.
For a coding agent, that might mean:
- mirror incoming issues or internal tickets
- let the candidate produce a plan, diff, test command list, and PR description
- do not open the real PR automatically
- compare against the human or current-agent outcome
- score the delta, not just the final answer
For a support or ops agent, it might mean:
- replay real conversations with sensitive data stripped or access-scoped
- let the candidate choose tools it would call
- block writes and external sends
- compare recommendations to actual resolutions
- flag dangerous, expensive, slow, or policy-breaking choices
The point is not to create a fake exam.
The point is to observe the agent under production-shaped pressure before it gets production-shaped power.
Score the run, not just the answer
Agent evals that only score final output are missing the damn plot.
The path matters.
Did the agent read the right files? Did it call the right tools? Did it avoid secrets? Did it ask for approval at the right point? Did it stop when permissions were missing? Did it cite current sources? Did it notice a stale artifact? Did it spend twenty dollars to save five minutes? Did it create a beautiful patch with no rollback plan?
A useful shadow eval should track:
- task success
- harmful action attempts
- unnecessary tool calls
- missing tool calls
- latency
- token and tool cost
- human override rate
- stale-source usage
- policy violations
- test or verification quality
- rollback readiness
- confidence calibration
This is where agent observability and evals become the same product surface.
A trace is not just for debugging after the fire. It is training data for deciding whether the next agent version is allowed near the stove.
Promote agents like you promote services
A serious agent rollout should look boringly familiar to anyone who has shipped production software:
- offline tests
- replay tests
- shadow traffic
- limited internal writes
- guarded user-facing actions
- canary rollout
- kill switch
- rollback review
The agent should earn permissions in stages.
First it can suggest.
Then it can draft.
Then it can open a PR behind review.
Then it can handle low-risk classes of work.
Then maybe, after enough clean shadow data and human override history, it gets broader autonomy.
Not before.
“Model got better” is not a deployment plan. “The vendor launched a new agent mode” is not a safety case. “It passed the benchmark” is not permission to touch the money printer.
The eval set should keep rotting on purpose
Static eval sets go stale. Agents overfit. Teams learn the shape of the test. Vendors optimize for the scoreboard. Everyone pretends this is fine because the number went up.
For production agents, your eval corpus should keep changing because production keeps changing.
Keep a rolling sample of:
- recently solved tickets
- reverted PRs
- incident follow-ups
- confusing user asks
- flaky tool sessions
- policy-denied attempts
- expensive runs
- stale-source failures
- cases where humans said “nope”
Then replay them against candidate versions.
If a new agent version improves benchmark score but increases human overrides on your actual ticket stream, ship the override metric, not the marketing slide.
The bottom line
Agents are crossing from answer machines into work machines.
Work machines need production-shaped evaluation.
Leaderboards can tell you where to look. Shadow traffic tells you whether to ship.
So build the shadow lane:
- mirror real work before granting write power
- score tool behavior, not just output prettiness
- track cost, latency, policy violations, and human overrides
- keep a fresh rolling eval set from actual failures
- promote agent permissions gradually
- make rollback and kill switches part of the eval story
If your agent only wins in a benchmark harness, congratulate it politely.
If it survives shadow traffic with fewer mistakes, lower cost, clearer receipts, and less human cleanup, now we are talking.
Everything else is trophy-board theater.