Agent PRs Need Flight Recorders, Not Just Green Checks

Diagram showing an agent pull request flight recorder connecting plan, tool calls, tests, diff, human approval, and a receipt log

The next annoying phrase in software is going to be “the agent opened a PR.”

Good.

That means coding agents are leaving toy chat windows and entering the same workflow humans use: branches, diffs, tests, reviews, comments, and merges.

But if your review process only looks at the final patch, you are reviewing the corpse, not the crime scene.

A serious agent pull request needs a flight recorder: what it was asked to do, what it planned, what files it read, what tools it called, what tests it ran, what failed, what it ignored, what secrets or credentials were nearby, and where a human approved the sharp parts.

Source freshness check: this post was checked on 2026-06-17. GitHub’s current Copilot cloud agent documentation says the agent can research a repository, create an implementation plan, make changes on a branch, and let humans review the diff before a pull request. The active agent tooling repos also show this is moving now, not stale 2024 hype: OpenAI Codex was pushed on 2026-06-17, Claude Code on 2026-06-16, Gemini CLI on 2026-06-16, GitHub’s MCP server on 2026-06-16, and Playwright MCP on 2026-06-10. The concrete direction is fresh: agents are becoming repo-aware workers with tools, branches, and review surfaces.

The diff is not the whole story

Human code review already misses context.

A reviewer sees the patch, maybe the ticket, maybe a test run, maybe a tired comment like “fix auth edge case.” They do not see every false start, every command, every doc page, every flaky test, every assumption, or every file the author skimmed before deciding what to touch.

With humans, we tolerate some of that because humans carry accountability. You can ask them why. You can learn their habits. You can build trust over time.

Agents are different.

An agent can generate a clean-looking diff after wandering through a repo, misreading a convention, calling tools with broad authority, skipping a failing test, or cargo-culting a pattern from the wrong directory.

The final patch may be tidy.

The process may be cursed.

Confused John Travolta reaction GIF representing a reviewer staring at an agent-created pull request without knowing what the agent actually did

Agent review needs receipts

A useful agent PR should include more than generated code.

It should include receipts a reviewer can scan quickly:

Task input — the original prompt, issue, or delegated request.
Plan — the agent’s intended steps before editing.
Scope touched — files read, files changed, commands run, network calls made.
Tool calls — exact tool names, arguments, outputs, errors, retries, and approvals.
Assumptions — what the agent inferred instead of knowing.
Test evidence — commands run, pass/fail results, skipped tests, and why.
Risk notes — migrations, auth logic, permissions, data handling, dependency changes.
Human gates — who approved destructive tools, deploys, external writes, or broad refactors.

Not a 40-page novel. A useful black box.

If an aircraft has a cockpit recorder because crashes are expensive, an agent changing production code deserves a damn JSONL file.

Green checks can lie by omission

CI tells you whether selected commands passed.

It does not tell you whether the agent selected the right commands.

That distinction matters.

An agent can run the cheap unit test and skip the integration test that actually covers the change. It can update a snapshot without understanding why it changed. It can avoid a slow command because the prompt budget is running hot. It can make a risky auth change and still pass lint like a smug little gremlin.

Green checks are necessary.

They are not sufficient.

The review question shifts from “did CI pass?” to:

did the agent inspect the right part of the system?
did it choose appropriate tests for the risk?
did it notice failing tests or just route around them?
did it modify generated files directly?
did it call tools that can mutate external state?
did it use credentials, browser sessions, or local files?
did it explain uncertainty, or just sound confident?

That is process review, not just patch review.

The plan belongs in the PR

Agent plans are not decorative.

They are reviewable artifacts.

A plan tells you what the agent thought the task was. If the plan is wrong, the diff may be wrong even when the code compiles. If the plan changed halfway through, the reviewer should see that. If the agent discovered a bigger problem and narrowed scope, that is useful. If it ignored the risky part of the ticket and fixed the easy part, also useful.

The plan should live near the pull request because that is where the accountability happens.

Not buried in a vendor console. Not lost in a local terminal scrollback. Not summarized as “implemented requested changes.”

Put the working memory where the reviewer can challenge it.

Tool calls are security-relevant

Once agents can call MCP servers, browser tools, repo tools, shell commands, cloud APIs, package managers, and deployment systems, the tool log becomes part of the security story.

A PR that changes one file may still have involved reading twenty secrets-adjacent configs, opening a browser session, installing packages, or asking an MCP server for repository metadata.

That does not mean the agent did something wrong.

It means the reviewer needs to know what authority was exercised.

Treat tool calls like mini audit events:

what capability was invoked?
what input did the model provide?
what output came back?
was the action read-only or mutating?
did policy allow it automatically?
did a human approve it?
can the team reproduce or revoke it later?

If that sounds like boring compliance plumbing, congratulations, you found the thing that keeps the magic from setting the curtains on fire.

This Is Fine dog meme GIF representing a team merging an agent pull request while ignoring missing process receipts

Build the agent PR template now

Do not wait for the first weird incident.

Add an agent-specific PR template before agents become normal in your repo:

## Agent session
- agent/client:
- model:
- task source:
- session receipt:

## Plan summary
- goal:
- non-goals:
- assumptions:

## Tool and data access
- files read:
- commands run:
- external tools/services:
- credentials/scopes used:
- human approvals:

## Verification
- tests run:
- tests skipped:
- manual checks:
- known risks:

Then make your agent fill it automatically.

If the agent cannot fill it, that is not a paperwork problem. That is an observability problem.

Reviewers need a different muscle

Agent PR review is not about distrusting every generated line.

That gets old fast and does not scale.

The better posture is structured skepticism:

read the task and plan first
check the diff against the plan
inspect high-risk files manually
verify the test selection matches the change
scan tool calls for surprising authority
ask the agent to explain uncertainty, not just summarize success
require humans for destructive actions and production-facing changes

The agent can do grunt work. The human should own judgment.

That is the bargain.

The winning devtools will make receipts boring

This should not require every team to duct-tape shell history into PR comments.

The agent platforms and devtools that win will make receipts automatic:

session timelines
plan diffs
file-read summaries
tool-call logs
approval records
test evidence
policy decisions
reproducible environments
compact PR summaries linked to full traces

The best version is not noisy. It is layered.

Reviewers get a clean summary. Security gets the full log. Platform teams get policy hooks. Developers get enough context to trust the work without spelunking through a haunted transcript.

If the agent changed code, show the flight path

Coding agents are going to keep moving deeper into the development workflow because the value is real. They can research, plan, edit, test, and hand off work faster than a human wants to type boilerplate.

Fine.

But once an agent opens a pull request, it is no longer “just chat.” It is part of the software supply chain.

So stop treating the final diff as the whole artifact.

The artifact is the diff plus the path that produced it.

Plans. Tool calls. Tests. Approvals. Receipts.

That is how agent PRs become reviewable instead of mystical.

Ship the flight recorder before the crash report writes itself.