Two failure modes drive most production agent quality problems. The first is context anxiety: as the context window fills, models rush toward closure, wrapping up prematurely and declaring tasks done while they are not. The second is self-evaluation bias: when you ask an agent to grade its own work, it praises it. Anthropic observed the exact failure in the lab: the model “identified legitimate issues, then talked itself out of them, and approved the work anyway” [1]. These are not prompt engineering problems. They are structural.
The solution Anthropic published in March 2026 is a structural response to a structural problem: separate the creation function from the evaluation function, and put them in permanent tension. This is the adversarial evaluator pattern, and it borrows the intuition directly from machine learning research. As Prithvi Rajasekaran wrote in the Anthropic engineering blog: “Taking inspiration from Generative Adversarial Networks (GANs), I designed a multi-agent structure with a generator and evaluator agent” [1].
The pattern has a production-grade form from Anthropic and a consumer-grade form from independent practitioners that you can wire up this week. And there is a deceptively simple explanation of why it works that comes from a TypeScript educator, not an AI lab.

Figure 1 - The adversarial evaluator pattern: A generator agent produces output; a dedicated evaluator agent tests, grades, and returns structured feedback. The loop runs 5 to 15 rounds until the evaluator approves. The two agents are in structural tension: the generator’s job is to build, the evaluator’s job is to find every reason the build is not good enough. Per Anthropic engineering blog, March 24, 2026.
Part 1: Anthropic’s Production-Grade Pattern
Two Failure Modes, One Structural Fix
Context anxiety and self-evaluation bias share a common root: both are problems with how a single-agent run degrades over time. A solo agent starts a complex task with a full context window and clean attention. As the window fills, performance degrades. The agent becomes less careful, more inclined to declare completion, and less capable of identifying its own errors. By the time it is reviewing its own work, the review happens with a degraded context and a baseline of sunk effort.
Anthropic’s first long-running-agents harness, documented in November 2025, attacked the context-degradation half of this problem with periodic context resets and an initializer-plus-coding-agent split, but it carried no evaluator, so the generator still graded and approved its own work. The March 2026 evaluator pattern is the direct response to that gap.
The generator plus evaluator structure cuts through both failure modes at once. Context anxiety is addressed because each agent runs in its own context. The evaluator starts every review fresh: no accumulated effort, no context degradation, no sunk-cost rationalization. Self-evaluation bias is addressed because the evaluator’s entire purpose is skepticism. Its system prompt is oriented entirely toward finding problems in the work handed to it.
KEY INSIGHT: It is far easier to make a dedicated evaluator agent skeptical than to make a generator agent skeptical about its own output. The structural separation is the fix, not the prompt.
Three Requirements for an Effective Evaluator Agent
The evaluator pattern sounds simple in the abstract. In practice, building an evaluator that actually drives output quality requires meeting three specific requirements that Anthropic documented in the March 2026 engineering post [1].
Requirement 1: Make subjective quality gradable. Open-ended criteria like “is this design good?” cannot be evaluated reliably by a model. The criteria need to be translated into specific, concrete, verifiable rubrics. Anthropic’s front-end design rubric, for example, included four explicit criteria: design quality (coherent visual identity with colors, typography, and layout working together), originality (evidence of custom decisions versus AI slop patterns), craft (hierarchy, spacing, color harmony, contrast), and functionality (does it actually work for what a user needs). These are specific enough for the evaluator to make a defensible call.
The wording of criteria matters more than you might expect. Anthropic found that phrases like “museum quality” in the rubric actively influenced the aesthetic direction of the generator’s outputs in ways the team had not fully anticipated. The criteria do not just grade the output; they steer the generator’s behavior in the next round.
Requirement 2: Weight criteria toward the model’s capabilities. Not all evaluation criteria are equally within the current model’s reach. Anthropic found that Opus 4.5 scored well on two of their four design criteria but struggled on the other two [1]. Rather than treating all criteria as equally weighted, the team weighted the criteria the model did well on more heavily and added explicit penalization for the failure patterns the model exhibited most frequently. This calibration step is a measurement exercise: run the evaluator, observe which criteria produce the most disagreement between evaluator and generator, and adjust weights accordingly.
Requirement 3: Let the evaluator interact with the output. A passive reviewer reading a static screenshot misses an enormous class of bugs. The Anthropic evaluator used the Playwright MCP [3] so it could navigate the running application the way a user would: clicking through UI features, hitting API endpoints, inspecting database states, and identifying specific bugs with code locations and suggested fixes. This shifts the evaluator from passive judge to active tester.

Figure 2 - Three requirements for an effective evaluator: Gradable criteria replace vague quality judgments. Capability-weighted scoring calibrates to what the model can actually reach. Playwright MCP access turns the evaluator from passive grader into active tester. All three are required for the adversarial pattern to drive output quality rather than just add latency.
The Full Three-Agent Architecture
For complete builds rather than isolated front-end evaluation, Anthropic used three agents rather than two. The third agent is a planner, and its addition reflects a subtle problem with a pure generator-evaluator pair: if the generator receives only a brief prompt, it will interpret that prompt narrowly and under-scope the work. The planner addresses this by receiving the brief, expanding it dramatically into a full product specification with features, development phases, and explicit success criteria, and handing that expanded spec to the generator.
The full architecture runs as: planner receives brief and produces spec, generator implements features from spec, evaluator tests outputs and feeds structured results back to generator, generator iterates. The cycle runs until the evaluator approves or a maximum round count is reached. Anthropic saw typical convergence between 5 and 15 rounds depending on task complexity [1].

Figure 3 - The three-agent planner/generator/evaluator architecture: The planner forces full scope expansion before any implementation begins. Without it, the generator interprets the brief narrowly. Without the evaluator, the generator over-approves its own work. The three-agent structure addresses both the under-scoping and the over-approval failure modes.
The DAW Build: Verified Cost and Time Numbers
The March 2026 engineering post includes a detailed cost breakdown for a demanding test case: building a fully featured DAW (Digital Audio Workstation) in the browser using the Web Audio API, using the Opus 4.6 era harness [1]. The numbers are unusually precise and worth showing in full.
| Agent and Phase | Duration | Cost |
|---|---|---|
| Planner | 4.7 minutes | $0.46 |
| Build Round 1 | 2 hours 7 minutes | $71.08 |
| QA Round 1 | 8.8 minutes | $3.24 |
| Build Round 2 | 1 hour 2 minutes | $36.89 |
| QA Round 2 | 6.8 minutes | $3.09 |
| Build Round 3 | 10.9 minutes | $5.88 |
| QA Round 3 | 9.6 minutes | $4.06 |
| Total | 3 hours 50 minutes | $124.70 |
The planner cost $0.46 and ran for under five minutes. The three QA rounds combined cost $10.39. The three build rounds consumed $113.85 and nearly all the wall-clock time. Quality assurance via the evaluator is cheap compared to generation; the generator is where work happens.
For comparison, an earlier experiment used a solo Sonnet 4.5 agent to build a 2D retro game. That run spent $9 over 20 minutes and produced a game where “entities were on screen but did not respond to input” [1]. The full three-agent harness spent $200 over 6 hours on the same task and produced a functional, playable game [1]. The 22x cost premium purchased a result that actually worked.

Figure 4 - DAW build cost and time breakdown: Three build rounds accounted for $113.85 of the $124.70 total. Three QA rounds cost $10.39 combined. The planner at $0.46 forced full scope before implementation. This is the cost profile of a harness that works: generation is expensive, evaluation is cheap, and the planner is nearly free. Per Anthropic engineering blog, March 24, 2026 [1].
KEY INSIGHT: The QA rounds in this build cost $10.39 combined against $113.85 in build costs. Adding structured adversarial evaluation to a 3h 50min agentic run cost roughly 9 cents on the dollar. If those evaluator rounds prevented even one significant round of rework, they paid for themselves several times over.
Part 2: The Consumer-Grade Version You Can Wire Up Today
The Builder and Verifier Agents Over a Unix Socket
Anthropic’s architecture requires the full three-agent setup and integrates Playwright MCP, which adds tooling overhead. IndyDevDan (Dan Disler) published a lighter-weight implementation of the same structural principle that runs on the Pi agent platform: the verifier-agent [4].
The architecture is a two-agent system running over a Unix domain socket. The builder agent owns a Unix domain socket at /tmp/pi-verifier/<sessionId>.sock. The verifier agent connects over this socket, listens for builder lifecycle ticks, and pulls from the builder’s session JSONL. When the builder completes its work and signals completion, the verifier automatically kicks off without any human prompt [5].
The verifier’s operating principle is decomposition rather than holistic judgment. The README’s description is precise: the verifier’s job is to “break every claim into the smallest atomic unit that can be independently proven or disproven” [4]. A claim like “the authentication flow works correctly” is broken into: does the login form accept valid credentials, does it reject invalid credentials, does it handle expired sessions, does it redirect correctly after success. Each sub-claim is tested independently. Holistic judgment passes things that sub-claim testing catches.

Figure 5 - Builder and verifier agents over a Unix domain socket: The builder signals completion via lifecycle ticks on a socket it owns. The verifier listens, pulls the session JSONL, decomposes the builder’s claims into atomic units, and verifies each independently. No human prompt required between build and verify. From IndyDevDan’s verifier-agent repo on GitHub [4].
The Pi platform itself is intentionally minimal by design. It ships without built-in MCP support, sub-agents, permission prompts, or plan mode. These are extension-layer concerns. The verifier-agent pattern is an IndyDevDan extension, not a built-in Pi capability, which is exactly the right place for opinionated workflow patterns to live.
The Six-Stage Workflow Extension
The verifier-agent addresses the build-then-verify loop. IndyDevDan’s Pi Free Course demonstrates a more complete version that encodes the entire development workflow as a deterministic state machine inside the agent session.
The six stages, built live in the course in approximately 20 minutes [8]:
- Read spec: agent reads a specification file
- Write code: agent implements to spec
- Run tests: deterministic test execution as code, not as a prompt
- Code review: agent reviews with a fresh context window (context is flushed by the extension before this stage)
- Fix issues: agent addresses the review findings
- Verify: tests rerun; all must pass before the extension signals completion
The fresh context at stage 4 is architecturally significant and connects directly to the adversarial evaluator principle. The extension requests a context flush before the code review stage begins. The reviewer starts the review with zero accumulated context from the implementation. It sees the code the way a new engineer would. It cannot rationalize away errors that made sense during implementation.
The state machine advances between stages via the built-in workflow.next tool. As demonstrated in the Pi Free Course: “we are calling the workflow.next. And so, this is a built-in tool where the agent can kind of manage its own state throughout the workflow” [8]. The agent’s position in the pipeline is held by the extension, not by the model’s memory. The model calls workflow.next to advance; the harness holds the phase list. The model cannot lose its place.
Note: workflow.next is demonstrated in the Pi Free Course at approximately the 16:58 mark and is not documented by that name in the official pi.dev package docs as of June 2026 [8].

Figure 6 - The six-stage Pi workflow extension: Each stage transitions via workflow.next, which is held by the extension harness rather than the model’s memory. The critical architectural feature is the context flush before stage 4 (code review). The reviewer starts fresh. From the IndyDevDan Pi Coding Agent Free Course, May 4, 2026 [8].
KEY INSIGHT: Encoding the workflow as a deterministic state machine in an extension harness is not about automation for its own sake. It eliminates the failure mode where the model “loses its place” in a multi-step process, which is a real failure mode in purely prompt-based workflows with no external state.
Part 3: Why It Works
Pocock’s “Smart Zone” Explanation
The Anthropic post and the IndyDevDan demos establish the what and the how. A TypeScript educator named Matt Pocock, best known for his software fundamentals keynote at AI Engineer Europe 2026, provides the clearest explanation of the why.
In a workshop video published April 24 2026, Pocock is walking through his full AI coding workflow when he arrives at the code review step. He makes an observation that cuts through the entire architectural rationale:
“if you get it to sort of try to do its reviewing, it’s going to be doing the reviewing in the dumb zone. And so, the reviewer will be dumber than the thing that actually implemented it… Whereas if you clear the context, then you’re essentially going to be able to just review in the smart zone, which is where you want to be” [10].
The “smart zone” and “dumb zone” refer to the quality regions of a context window. Early in a session, the model has full attention capacity: it reads carefully, reasons clearly, and catches errors. This is the smart zone. As the session accumulates tokens, quality degrades. Attention spreads thin, errors slip past, and the model rationalizes. This is the dumb zone.
When an agent implements a feature and then immediately reviews its own implementation in the same session, it is doing that review in the dumb zone. It has accumulated all the context from implementation, the review stage is operating with degraded attention, and the shared context creates a bias toward approving the work it just produced. The reviewer “will be dumber than the thing that actually implemented it” [10].
Clear the context. Start the review session fresh. Now the reviewer is operating in the smart zone, with full attention, no accumulated implementation bias, and no sunk-cost rationalization to rationalize past.

Figure 7 - Smart zone versus dumb zone in context windows: A same-context reviewer operates in the degraded dumb zone, after all the implementation tokens have filled the window. A fresh-context reviewer starts at the beginning of its context window: full attention, no implementation bias, no sunk-cost rationalization. Per Matt Pocock’s workshop, April 24, 2026 [10].
What makes this observation significant beyond Anthropic’s work is the source. Pocock arrived at the fresh-context reviewer principle independently, without the GAN framing and without the Anthropic engineering context. He was building his own AI coding workflow, noticed that same-context self-review degraded output quality, and his fix was identical: clear the context, start the review fresh.
The context window has a real quality distribution. Clearing it before review reflects the actual mechanics of how model attention works, not a preference or style choice.
KEY INSIGHT: Pocock’s “smart zone” framing explains why the adversarial evaluator pattern works even when the “evaluator” is just a fresh context reviewing the same model’s work. The structural separation produces better results because the context is fresh, not because the evaluator is smarter.

Figure 8 - Same-context versus fresh-context review: Same-context review is reviewing in the dumb zone. The reviewer inherits all the accumulated context from implementation and operates with degraded attention. Fresh-context review starts in the smart zone. Pocock’s observation from his own coding workflow converges on the same structural fix Anthropic encoded in the adversarial evaluator.
The Convergence Signal
Three independent practitioners converged on the same structural principle from different angles. Anthropic arrived there through empirical study of harness failures. IndyDevDan arrived there through building production Pi agent workflows: the verifier-agent and the fresh-context code review stage both implement context separation as the quality mechanism. Pocock arrived there through his own AI coding workflow and named the underlying mechanism: smart zone versus dumb zone.
None of these sources cites the others on this point. The convergence is independent, which is about as strong an empirical signal as practitioners get for a structural pattern.
Conclusion
The adversarial evaluator pattern is a solution to two structural failure modes in long-running agent tasks: context anxiety, which drives premature completion, and self-evaluation bias, which drives false approval. The fix is structural separation between generation and evaluation.
Anthropic’s production implementation uses a planner, generator, and evaluator with the Playwright MCP giving the evaluator active testing access to the running application. The three evaluator requirements are: make subjective criteria gradable, weight criteria toward what the model can actually reach, and let the evaluator interact rather than just observe. The DAW build case study shows the cost profile in production: $10.39 for three QA rounds against $113.85 for three build rounds, for a total of $124.70 over 3 hours 50 minutes [1].
The consumer-grade version runs on Pi with IndyDevDan’s verifier-agent handling the builder-to-verifier handoff over a Unix domain socket, and the six-stage workflow extension encoding the full spec-to-verify pipeline as a deterministic state machine where the model cannot lose its place [4][8].
The intuition behind both implementations is the same insight Pocock articulated: same-context self-review happens in the dumb zone. Fresh-context review happens in the smart zone. Clearing the context before the review stage is not overhead. It is the mechanism by which the adversarial structure actually improves quality.
For the broader harness engineering context, the companion series covers the nine harness components [11], the three eras of AI engineering [12], and the pruning principle for mature harnesses [13].
The next article in this series looks at the other side of the quality problem: what happens when the evaluation criteria are not subjective at all, and deterministic test suites replace adversarial evaluation entirely.
The Series
This is Part 2 of the three-part Production Reliability sub-series:
- The March of Nines: Why Agent Skills Alone Won’t Reach Production Reliability
- The Adversarial Evaluator: GAN-Inspired Harness Architecture from Anthropic (this article)
- Stripe Minions and the Hybrid Secret: Deterministic Rails Around AI
References
[1] P. Rajasekaran, “Harness design for long-running application development,” Anthropic Engineering Blog, March 24, 2026. https://www.anthropic.com/engineering/harness-design-long-running-apps
[2] J. Young, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, November 26, 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
[3] Microsoft, “playwright-mcp: A Model Context Protocol server that provides browser automation capabilities using Playwright,” GitHub, Apache-2.0, 2025-2026. https://github.com/microsoft/playwright-mcp
[4] IndyDevDan (Dan Disler), “the-verifier-agent,” GitHub, 2026. https://github.com/disler/the-verifier-agent
[5] IndyDevDan, “GPT-5.5 VERIFIED, Opus 4.7: A Pi Coding Agent That REVIEWS Like YOU,” YouTube, 2026. https://www.youtube.com/watch?v=EnXKysJNz_8
[6] M. Zechner, “What I learned building an opinionated and minimal coding agent,” mariozechner.at, November 30, 2025. https://mariozechner.at/posts/2025-11-30-pi-coding-agent/
[7] earendil-works, “Pi: AI agent toolkit (coding agent CLI, unified LLM API, TUI and web UI libraries),” GitHub, 2025-2026. https://github.com/earendil-works/pi
[8] IndyDevDan, “Pi Coding Agent (Free Course) (2026),” YouTube, May 4, 2026. https://www.youtube.com/watch?v=6T46BslVzAc
[9] M. Pocock, “Software Fundamentals Matter More Than Ever,” AI Engineer Europe 2026 keynote, April 2026. https://www.youtube.com/watch?v=v4F1gFy-hqg
[10] M. Pocock, “Full Walkthrough: Workflow for AI Coding,” workshop video, April 24, 2026. https://www.youtube.com/watch?v=-QFHIoCo-Ko
[11] Dotzlaw Consulting, “What Is an Agent Harness, Really? Nine Components Most Builders Miss,” dotzlaw.com, June 2, 2026. https://dotzlaw.com/insights/claude-code-01-agent-harness/
[12] Dotzlaw Consulting, “Three Eras of AI Engineering: Prompt, Context, and Harness,” dotzlaw.com, June 3, 2026. https://dotzlaw.com/insights/claude-code-02-three-eras/
[13] Dotzlaw Consulting, “The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning,” dotzlaw.com, June 4, 2026. https://dotzlaw.com/insights/claude-code-05-harness-evolution/
Building production AI, or modernizing a legacy system?
That is the kind of work we do at Dotzlaw Consulting. Book a free 20-minute intro call and tell us what you are trying to build, or what is slowing you down.