Same model. Same benchmark tasks. 45.9% with minimal scaffolding [1][4]. 55.4% with the Claude Code harness [1][4]. That 9.5-point gap has nothing to do with the model. It is entirely scaffolding. If you are optimizing your AI engineering work by chasing the next model release, you are optimizing the wrong thing.
In this article, we’ll cover the three eras of AI engineering, why 2026 is the year the discipline moved decisively into the third one, and what that means for every developer building on top of language models.

Figure 1 - Three eras, four years: The discipline of building useful AI systems passed through three distinct phases in roughly four years. Each era absorbed the prior one and added a layer the model could not provide on its own. By 2026, what you tune is the orchestration code, not the model weights.
Era 1: Prompt Engineering (2022-2024)
The first era was about the single best output from a single call.
The model was treated as a complete system. The engineer’s job was to craft the input so that the one-shot output was as useful as possible. If the output was bad, the prompt was wrong. Fix the prompt, get a better answer. The mental model: the model is the product. Your entire competitive surface was the quality of your instructions.
Characteristic techniques from this era read like a field guide to a different discipline: chain-of-thought reasoning, where you told the model to show its work before answering. Few-shot prompting, where you demonstrated the output format you wanted with two or three examples in the prompt itself. Persona prompting, where you told the model it was a senior engineer or a careful editor and it actually behaved differently as a result. System prompt design, where you stacked instructions at the top of the context to constrain the output shape.
Characteristic tools: the OpenAI Playground and the original ChatGPT web interface. You typed into a box and got back text. The interaction was stateless. The model answered and stopped.
That last sentence is where Era 1 ran into its ceiling. The model could not take action. It could not observe whether its answer had worked. It had no memory of what it had said two minutes earlier in a different conversation. You could write the most elegant chain-of-thought prompt in history and the model would still walk away from the problem the moment it finished generating an answer.
For single-question, single-output tasks, prompt engineering is still the right tool. But for anything requiring multiple steps, state across calls, or interaction with the outside world, the paradigm had to change.

Figure 2 - Era 1 mental model, prompt in, answer out: Prompt engineering treated the model as the complete system. The engineer’s entire leverage was the quality of the input. The model answered once and stopped, which worked well for single-shot tasks and hit a hard ceiling everywhere else.
Era 2: Context Engineering (2024-2025)
The second era was about fitting the right information into the context window.
The model got tools, which changed things considerably. Function calling meant the model could reach out and retrieve a document, run a database query, or search a codebase. The raw capability was there. But capability without structure produced chaos: models that called tools in loops without making progress, retrieved irrelevant documents, burned through context window tokens on noise, and ran out of room before they ran out of work.
The engineer’s job shifted from “craft the input” to “curate the context.” The context window is the product. If you could decide precisely what information went in, in what order, with what framing, the model could handle a much larger range of work than a one-shot prompt ever could.
Retrieval-augmented generation (RAG) was the signature technique of this era. Rather than putting an entire knowledge base into the context, you retrieved only the relevant chunks at query time. Vector stores indexed embeddings so that semantic search could surface the right documents for the question at hand. Conversation memory systems tracked what the model had said and done across a session so it was not starting from scratch each turn.
Characteristic tools: LangChain v1, LlamaIndex, the first generation of retrieval pipelines. You could build agents with these tools in the sense that you could give a model a loop and some tools. Many teams did exactly that and shipped things that worked in demos and fell apart in production.
The ceiling of Era 2 was a single context window. A single agent session has finite capacity. Large tasks, long-horizon tasks, tasks that require coordinating across multiple domains or multiple specialized capabilities, they exceed what one agent in one context window can handle reliably. You could extend the window and add better retrieval, but you were still fundamentally managing a single agent and a single context. The shape of the problem had outgrown the shape of the tool.
KEY INSIGHT: Context engineering is not just “put more in the prompt.” It is an entire discipline of deciding what information to retrieve, when, in what order, and what to exclude when the window is full. That discipline is still necessary in Era 3; it just lives inside the harness rather than being the whole job.

Figure 3 - Era 2 mental model, curate the context window: Context engineering shifted the engineer’s job to deciding what goes into the context window. RAG pipelines, vector stores, and token budgeting all served this goal. The ceiling was the single agent session: tasks that exceeded one context window had no graceful solution.
Era 3: Harness Engineering (2025-present)
The third era is about the orchestration code that wraps the model.
The model is no longer treated as the complete system, and the context window is no longer the whole product surface. Both are components inside a larger structure: the harness. A harness is the fixed architecture that turns a language model into an agent. It handles everything outside the model weights: system prompt assembly, tool definitions, orchestration logic, memory management across sessions, verification loops, and safety guardrails.
The mental model: the orchestration around the model is the product.
A harness is not a framework you configure. It is a working agent you can point at a task. Claude Code, Cursor, Codex CLI, these are harnesses. LangChain and LangGraph are frameworks: you wire the pieces together yourself. The distinction matters because first-party harnesses (Claude Code, Codex CLI) are post-trained in the context of their own tool formats, which means they inherit a performance head start that no third-party wrapper can replicate.
What goes inside a harness? Nine components cover the full set: the while loop that drives iteration, context management and compaction, the tool and skill registry, sub-agent management for parallel or large-horizon work, built-in skills, session persistence for crash recovery, system prompt assembly, lifecycle hooks for pre- and post-tool injection, and the permissions layer. Any production agent uses all nine; the design decisions are which ones to build leanly and which to leave to the platform.
The Era 3 shift that catches most developers off guard is the pruning principle. In Era 2, building meant adding: more context, more retrieval, more structure. In Era 3, mature harness work looks as much like subtraction as addition. Anthropic’s own engineering blog states the principle directly: “Every harness component encodes assumptions about model limitations, and those assumptions merit stress testing since they can quickly become outdated as models improve.” [2]
The practical translation: every component you add to compensate for something the model cannot do has a half-life. When the model gets better, that component becomes dead weight. Anthropic dropped context resets from their reference harness when Opus 4.6 arrived. Opus 4.5 showed context anxiety as the window filled, which required sprint-based context resets between sub-sessions. Opus 4.6 does not. The context reset component, which took real engineering effort to build correctly, was removed entirely because the model no longer needed compensating.
Manus rewrote their harness five times in six months as model capabilities shifted. Vercel removed 80% of an agent’s tools and got better results. More structure is not always better. The question in Era 3 is not “what can we add” but “what can we now remove.”
One particularly useful caution against the reflex to add: the natural-language agent harnesses paper ran module ablations across their NLAH system on two benchmarks [6]. The verifier module, the component that checks whether an agent’s output actually meets the spec before accepting it, is the classic “add structure” instinct. On SWE-Bench Verified, the verifier cost 0.8 percentage points. On OS World, it cost 8.4 points [6]. The verifier was actively hurting performance because the local acceptance layer diverged from the actual benchmark evaluator’s behavior. The lesson is not that verifiers are bad; it is that adding a component without measuring its effect on your specific workload is how you build harnesses that perform beautifully in theory and poorly in production.
KEY INSIGHT: Every harness component encodes an assumption about what the model cannot do. When you add a component, write down the assumption it compensates for. When the model improves, that assumption is the first thing to test for expiry.

Figure 4 - Era 3 mental model, the harness wraps the model: Harness engineering moves the engineer’s leverage to the orchestration code surrounding the model. Sub-agent sessions handle work that exceeds one context window. The harness coordinates them, manages state handoffs, and enforces pruning discipline as model capabilities improve.
The Evidence: Four Data Points
The thesis is not just conceptual. Four measured results from 2026 make the argument in numbers.
SWE-bench Pro: Harness Isolation
SWE-bench Pro forces every model into the same minimal scaffold: a batch-only tool, a fixed turn limit, and no custom optimization. This isolates raw model capability from scaffolding craft, which the original SWE-bench Verified cannot do because every agent brings its own harness to that benchmark.
The Anthropic engineering blog (March 24, 2026) and the Cursor blog reported the same numbers, corroborating independently [1][4]: Opus 4.5 on SWE-bench Pro scores 45.9% under minimal scaffolding. Under the Cursor harness, that same model scores 50.2% [1][4]. Under the Claude Code harness, 55.4% [1][4].
That is a 9.5-point spread, entirely from scaffolding choice, same model, same tasks. The model did not change. The orchestration code around it did.

Figure 5 - Opus 4.5 on SWE-bench Pro, three harnesses: The same model, the same benchmark tasks, and three different scaffolding choices produce a 9.5-point spread. Per the Anthropic engineering blog (March 24, 2026) [1] and the Cursor blog [4], the gain from minimal scaffolding to Claude Code’s harness is driven entirely by orchestration differences. The model is a constant.
LangChain Deep Agents: Rank Climb on Terminal Bench 2
LangChain’s engineering team published results in February 2026 showing their Deep Agents CLI went from outside the Top 30 to Top 5 on Terminal Bench 2.0 [5]. The score improved from 52.8% to 66.5%, a gain of 13.7 percentage points [5]. The model was fixed at gpt-5.2-codex throughout. No model swap. No fine-tuning. Only the harness changed.
Three harness additions drove the result. First, PreCompletionChecklistMiddleware: before the agent exits, this intercepts it and runs a verification pass against the task spec. Second, LocalContextMiddleware: this maps directory structures, available tools, and coding best practices upfront, eliminating wasted planning cycles at the start of each session. Third, LoopDetectionMiddleware: this detects repeated edits to the same file and breaks the loop. The LangChain team paired these middlewares with a reasoning sandwich pattern: allocate extended reasoning to planning and verification phases, standard reasoning to implementation.
The result is worth sitting with. A reasoning model, one of the best coding models available in early 2026, moved from outside the Top 30 to Top 5 on a competitive benchmark with no model change at all. That is the harness era in a single result.
NLAH: OS Symphony Migration
The natural-language agent harnesses paper gives the most striking numbers for what happens when you rewrite the harness representation itself [6].
OS Symphony was a native code harness for desktop automation. The NLAH team migrated it to a natural-language harness representation: same strategy, different encoding. No model change. Table 5 from the paper shows the results from this migration: task completion rate went from 30.4% to 47.2%, a gain of 16.8 points [6]. Runtime went from 361.5 minutes to 140.8 minutes, a 2.6x reduction. LLM calls dropped from 1,200 to 34, a 35x reduction [6].
The mechanism behind the LLM call reduction is notable. The native code harness required brittle GUI repair loops: when an interface element moved or changed, the harness called the model repeatedly to recover. The natural-language harness replaced those loops with durable runtime state and artifact-backed completion criteria. The model calls dropped because the harness stopped asking the model to compensate for its own brittleness.
Meta-Harness: Transfer Across Models
The Meta-Harness paper tests an automated harness discovery system on Terminal Bench 2 [7]. The most striking result is the transfer experiment.
On the benchmark, the Meta-Harness system running Opus 4.6 achieved 76.4% and ranked second overall [7]. More interesting is the Haiku 4.5 result: 37.6%, the top-ranked Haiku-class agent on the benchmark [7]. A smaller, cheaper model class won its bracket through harness craft alone, while the same approach run on Opus 4.6 reached 76.4% and took the #2 slot among Opus-class agents [7].
The transfer result extends this further. The team took the harness discovered and optimized on one model and applied it to five held-out models that had not participated in discovery: GPT-5.4-nano, GPT-5.4-mini, GPT-OSS-20B, and at least one Gemini variant. All five improved [7]. The harness is a transferable asset. When you invest in harness craft, you are not optimizing for one model. You are building something that travels.
On a text classification task, the Meta-Harness approach gained 7.7 points while using 4x fewer tokens [7]. The efficiency side of the harness argument does not get enough attention: a well-designed harness does not just perform better, it performs better while calling the model less. Era 2 instincts say “more calls, more tokens, more context.” Era 3 evidence says the opposite.
KEY INSIGHT: A harness discovered on one model and transferred to five held-out models improved all five. The transferable asset is not model access; it is the orchestration craft.

Figure 6 - Four data points, model fixed throughout: LangChain’s +13.7-point rank climb [5], the OS Symphony migration’s +16.8-point completion gain and 35x call reduction [6], and the Meta-Harness Haiku result topping the Haiku class while the same approach took Opus 4.6 to 76.4% as the #2 Opus-class agent [7] are all harness-only results. The model is a constant in every case.
The Practitioner Takeaway
The Cursor engineering blog (May 2026) makes the case directly [4]: your moat as a builder is not model access, because everyone has access to the same models. The moat is harness craft, which Cursor frames as the orchestration logic, the context strategy, the error handling, and the model-specific tool format choices.
In 2025, competitive advantage in AI engineering came from getting early access to the best model, building quickly on top of it, and shipping before the next release cycle disrupted your stack. That race is over. Model access is essentially commoditized. Every serious builder has access to roughly the same frontier capabilities through the same APIs.
In 2026, the question is not “which model is best.” It is “which harness around this model is best.”
The Cursor and Anthropic convergence on this point is not coincidental. Both teams spent 2025 and early 2026 doing the same thing: measuring which harness engineering decisions produced the SWE-bench Pro spread, identifying the tool format choices and verification patterns that drove the gains, and publishing their findings. They arrived at the same principle from different starting points.
The pruning discipline is the hardest part to internalize. Developers building on LLMs have spent two years adding: adding retrieval, adding context, adding verification, adding structure. The Era 3 reflex is to audit. When Anthropic moved from Opus 4.5 to Opus 4.6, they ran the V2 harness without context resets because Opus 4.6 does not exhibit the context anxiety that made them necessary [1]. The sprint decomposition mechanism had a half-life of approximately one model generation.
The Anthropic principle bears repeating because it is the hardest to act on: “Every harness component encodes assumptions about model limitations” [2]. Those assumptions expire when the model improves. Components that made a harness reliable under Opus 4.5 can become overhead under Opus 4.6, friction under Opus 4.7, and obstacles under whatever ships next quarter.
Build the harness well. Document what each component compensates for. When the next model arrives, stress-test those assumptions before you assume they still hold.

Figure 7 - The pruning discipline cycle: Every harness component encodes an assumption about model limitations. Each model release is an opportunity to test those assumptions for expiry. Components that compensated for limitations the model has since absorbed become dead weight. Anthropic dropped context resets when Opus 4.6 arrived; Vercel removed 80% of an agent’s tools and got better results. Pruning is a maintenance practice, not a failure.
Conclusion
This article is the second in the Harness Fundamentals sub-series. It places the harness in historical context, because understanding where the discipline came from is the fastest way to understand why 2026 decisions look the way they do.
The three-era arc is not just a taxonomy. It is a useful diagnostic. When a team is chasing model releases as their primary competitive lever, they are in Era 1 thinking. When they are focused primarily on retrieval quality and context window management, they are in Era 2 thinking. When they are measuring which harness components have expired and which are earning their keep, they are in Era 3. Most teams are behind the era they think they are in.
The evidence we have surveyed is unusually clear. Four independent results, two academic papers and two engineering blog posts from teams with the largest production deployments in the field, all point the same direction: the orchestration code around the model is where the performance lives now. That is not a theoretical claim. It is measured across SWE-bench Pro, Terminal Bench 2, OS World desktop automation, and text classification, with the model held constant in every case.
The thing we keep coming back to is the pruning principle. Era 3 mature harness work looks as much like subtraction as addition. Every component you carry forward from the last model generation is a candidate for removal. The discipline is not building a bigger harness; it is building a more accurate one, calibrated to what the current model actually needs. The articles ahead will walk through how to do that in practice.

Figure 8 - Harness Fundamentals sub-series roadmap: cc-01 covered the vocabulary and components of a harness. cc-02 (this article) placed the harness in historical context via the three-era arc. Upcoming articles cover harness evolution discipline, the build-from-scratch walkthrough, and the skills hierarchy. The companion ai-05 article covers building your own harness from the nine-component foundation.
The Series
This is Part 2 of the five-part Harness Fundamentals sub-series on Claude Code engineering:
- What Is an Agent Harness, Really? Nine Components Most Builders Miss — a working definition and the nine components every modern harness needs
- Three Eras of AI Engineering: Prompt to Context to Harness (this article) — historical context and four data points that settle the Era 3 argument
- The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning — the V1/V2 case study, the Boris anchor, and a practitioner’s pruning playbook
- Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions (coming soon) — hands-on construction of a minimal harness with all nine components mapped to working code
- Skills, Slash Commands, and Harnesses: A Discipline Hierarchy (coming soon) — where individual skills fit inside the broader harness and how the three layers interact
References
[1] P. Rajasekaran, “Continuous reinvention: A brief history of Claude Code’s development,” Anthropic Engineering Blog, Mar 2026. https://www.anthropic.com/engineering/continuous-reinvention-a-brief-history-of-claude-code
[2] Anthropic, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, Nov 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
[3] E. Schluntz and B. Zhang, “Building effective agents,” Anthropic Engineering Blog, Dec 2024. https://www.anthropic.com/engineering/building-effective-agents
[4] Cursor, “Continually Improving Our Agent Harness,” Cursor Engineering Blog, May 2026. https://cursor.com/blog/continually-improving-agent-harness
[5] LangChain, “Deep Agents on Terminal Bench 2,” LangChain Blog, Feb 2026. https://blog.langchain.com/deep-agents-terminal-bench-2/
[6] Y. Pan et al., “Natural Language Agent Harnesses,” arXiv preprint arXiv:2603.25723, 2026. https://arxiv.org/abs/2603.25723
[7] D. Lee, N. Nair, P. Zhang, J. Lee, O. Khattab, and C. Finn, “Meta-Harness: Automated Harness Discovery for Coding Agents,” arXiv preprint arXiv:2603.28052, 2026. https://arxiv.org/abs/2603.28052