brain/
conceptartificial-intelligence

Agent output verification

Notes

Agent output verification

One-line summary: Agents that can see their own output — by running tests, starting a server, or using a browser — produce dramatically better results than agents working blind. Giving the agent a verification loop is one of the three biggest levers on Claude Code performance.

The insight

Intelligence alone is not enough. An engineer who can't run code or see a browser is a bad engineer no matter how smart they are. Same is true for an agent. The quality ceiling of an agentic coding session is set not by the model but by what the agent can observe about the result of its actions.

Evidence

From 2026-04-21-boris-claude-techniques, Boris names this as tip #13, and ranks it as one of three things he recommends "almost every time" someone asks how to get better performance from Claude Code:

  1. Use claude-opus-4-5 with thinking, always.
  2. Maintain a good claude-md-team-knowledge-base.
  3. Give Claude a way to verify its output.

Key quotes:

"If I'm building an app, I always use the Chrome extension to have Claude test its own work. And if Claude can verify its own output, the result is going to be way, way better."

"Imagine you're a painter … and you have to wear a blindfold. You're just not going to be that good … Same thing for an engineer. If you have to write code but you can never run the code, or you can never see the output, or you can never see the website — it's just not going to be good."

"As the model gets more intelligent, that first shot is going to get better and better. But really, you want to give it a way to verify the output."

Example verification loops Boris names explicitly:

  • Running tests (for engineering work).
  • Starting a server (for engineering work).
  • Seeing output in a simulator or browser (for engineering or frontend work).
  • The Chrome extension — has Claude test its own work by driving the browser directly.

Design implications

  • Every agent task should be scoped to include a verification step. A task like "fix this bug" should be expanded to "fix this bug and demonstrate the fix runs."
  • Invest in the agent's tools before investing in the model. A weaker model with a browser is likely to outperform a stronger model without one, per Boris's ranking.
  • The Chrome extension is load-bearing for frontend/app work. It's the feedback loop analogue to "run the tests" for UI.
  • This couples tightly to plan-then-execute-coding. A good plan identifies what the verification step will be before execution starts.

Contradictions / tensions

  • None surfaced in source. Boris presents verification as straightforwardly necessary.

Open questions

  • What does verification look like for agent work outside coding (research, writing, analysis)? Boris's examples are all engineering-flavored.
  • How much does "self-verification" actually catch? An agent that writes both the code and the test can still fool itself; at what point do you need an independent verifier (a second agent, a human)?

Sources

Related

Referenced by