Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions

Saturday morning. You have the Anthropic SDK installed, a task you want to automate, and a blank editor. You write a loop. It works once. You tweak the system prompt, it works again. By noon you have about 200 lines. By Saturday evening you have 700 lines, the prompt is doing things you didn’t intend, and the whole thing pins to a specific model version. Eight weeks later that version ships with a new tokenizer and your harness breaks in ways you can’t trace.

That is the typical arc of a first harness. Not because the builder did anything wrong but because no one told them what they were actually building.

This article tells you what you’re building: a specialized agent harness with nine components mapped to nine Python files, shaped by twelve design decisions. Four of those decisions you make before writing code. The remaining eight unfold as you build and operate the harness. The code blocks below use the verified Anthropic SDK patterns (v0.102.0, May 2026) [4]. Every claim about cost, context windows, and API behavior traces back to vendor-direct sources [1][2][3].

Figure 1 - Architecture diagram showing the harness layer sitting between the Anthropic API at the top and the domain-specific task at the bottom, with 9 labeled component boxes in between

Figure 1 - The Harness Stack: A specialized harness sits between the raw model API and your domain. It is not the model and it is not the task. It is the fixed architecture that coordinates how the model solves the task. Everything you will build lives in this middle layer.

Why a harness and not just a script#

You have probably seen the simpler alternatives: a one-shot prompt function, a LangChain chain, or a sequence of API calls glued together with if statements. These all work for demos. They break under production conditions for the same reason: they have no architecture.

A harness is different. It is a fixed architecture that turns a model into an agent. If we use an OS analogy, the raw language model is a CPU, powerful but inert. The context window is RAM. External databases and files are disk. Tool integrations are device drivers. The harness is the operating system that decides what the CPU sees, when it sees it, and what it is allowed to do with that information.

This article is the fourth in the Harness Fundamentals sub-series. The first three articles made the conceptual case: what a harness is and why it outperforms frameworks, why skills alone don’t reach production reliability, and what good harness measurement looks like. This one is the practical payoff: you build it.

Figure 2 - Diagram comparing a raw script (linear top-to-bottom flow with no feedback) versus a harness (circular while-loop with tool execution, context management, and state persistence)

Figure 2 - Script vs Harness: A script runs once and exits. A harness loops: model calls tool, tool result feeds back into context, model calls the next tool, loop repeats until the task is done or the iteration cap fires. This feedback architecture is what makes the agent able to recover from mistakes instead of failing silently.

KEY INSIGHT: The harness is not code that wraps the model. It is code that decides what the model sees and when, enforces what it is allowed to do, and measures whether the result was worth keeping.

The four questions you must answer before writing code#

Most first harnesses fail because the builder starts coding before answering four foundational questions. These four questions are not implementation details. They determine the shape of everything else. Get them wrong in the while-loop file and you will have to refactor the whole project later.

Question 1: Architecture type#

Five patterns exist for how you can structure a harness. For a first harness, one answer is almost always correct.

Single-threaded supervisor is a specialized harness with a fixed sequence of phases, the same phases every run, with the model doing work inside each phase but not deciding the sequence. This is what you should build first. It is predictable, debuggable, and the one where you learn the most about what the other components do.

The other four patterns (general-purpose, autonomous/event-triggered, hierarchical multi-agent, DAG with branching) are not wrong. They are just not where you start. Building a DAG harness as your first project is like learning to drive by starting with a manual transmission in a parking lot on a hill.

First-harness answer: a single-threaded supervisor.

Question 2: Fixed versus dynamic planning#

A fixed plan means the harness enforces the sequence. The model does not decide what comes next. Every run follows the same phases in the same order.

A dynamic plan means the model generates its own task list at runtime and reorders it as it works. This sounds more powerful. It is also where most beginners lose control. The moment the model is deciding the plan, you have given up the reliability guarantee that specialization is supposed to provide.

For compliance workflows, legal workflows, and any domain where process fidelity matters more than flexibility, use a fixed plan. Dynamic planning is for exploratory tasks like research synthesis or birthday party planning, where flexibility is the whole point.

First-harness answer: a fixed plan with explicitly coded phases.

Question 3: Model selection#

Three models, three price tiers, three context windows. As of May 2026 (vendor-direct) [1]:

Model	Context	Max output	Input	Output
claude-opus-4-7	1M tokens	128K	$5/MTok	$25/MTok
claude-sonnet-4-6	1M tokens	64K	$3/MTok	$15/MTok
claude-haiku-4-5	200K tokens	64K	$1/MTok	$5/MTok

Anthropic’s own docs recommend claude-opus-4-7 for complex agentic tasks. For a cost-conscious first harness, claude-sonnet-4-6 handles orchestration well, and claude-haiku-4-5 is the right choice for sub-agents doing narrow tasks. The price differences in the table above are why the choice matters: a sub-agent that fans out across hundreds of calls is the place to spend Haiku money rather than Opus money.

One more thing about model selection: pick one model per session and stay with it. This is not a preference. It is a hard rule. Switching models mid-conversation causes out-of-distribution tool calls (the new model inherits a conversation history it wasn’t trained on), kills your prompt cache (provider-specific, model-specific, zero cache hits after a switch), and brings different behavioral profiles to prompts tuned for the old model.

First-harness answer: use claude-sonnet-4-6 for orchestration, claude-haiku-4-5 for sub-agents if you need them. Stay with your choices for the full session.

Question 4: Tool format#

Cursor’s engineering blog reports that Anthropic models are trained to edit files using string replacement (“find this exact text in the file, replace it with this other text”) while OpenAI models are trained on patch-based format (git-diff style). Either model can technically use either format, but the cost is real: per the same blog, “giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes” [5]. Anthropic has not published its own training-distribution statement on this point, so treat the claim as Cursor’s empirical observation, not an Anthropic-direct fact.

For a first harness on Anthropic models: string replacement. Define your tools in string-replacement terms from the beginning. If you later add OpenAI model support, that’s when you build per-provider tool configurations. Not before.

First-harness answer: use string replacement format for all file-editing tools.

Figure 3 - Four quadrant diagram showing the two axes (fixed/dynamic plan, single-agent/multi-agent) with first-harness recommendation highlighted in the fixed plan, single-agent quadrant

Figure 3 - Four Questions Before Coding: The intersection of these four choices determines your harness’s architecture class before you write a single function. The upper-left quadrant (fixed plan, single-threaded supervisor, Sonnet for orchestration, string replacement tools) is where first harnesses belong.

The 9 components mapped to 9 files#

Every harness needs these components. Starting from scratch, each one maps to one Python file. Here they are in the order you build them.

engine.py: the while loop#

This is the whole harness in one paragraph. The model makes a call. If it wants to use a tool, you run the tool and feed the result back. If it doesn’t want a tool, you’re done. If you hit the iteration cap, you’re also done.

The verified SDK skeleton:

1
import json
2
import anthropic
3

4
client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from environment
5

6
def dispatch_tool(name: str, tool_input: dict) -> str:
7
    """Route tool name to handler. Returns string result."""
8
    # In a full harness, registry.get(name) handles routing.
9
    # Stub here for clarity.
10
    return f"Error: unknown tool {name}"
11

12
def run_harness(user_task: str, max_iterations: int = 10) -> str:
13
    """
14
    The while loop. Exits when stop_reason is end_turn or iterations exhausted.
15
    A cap of 10-25 is a reasonable starting point; there is no official
16
    Anthropic recommendation for a general harness.
17
    """
18
    messages = [{"role": "user", "content": user_task}]
19
    iterations = 0
20

21
    response = client.messages.create(
22
        model="claude-sonnet-4-6",
23
        max_tokens=4096,
24
        system=assemble_system_prompt(),  # see prompts.py
25
        tools=registry.descriptors(),     # see registry.py
26
        tool_choice={"type": "auto"},
27
        messages=messages,
28
    )
29

30
    while response.stop_reason == "tool_use" and iterations < max_iterations:
31
        iterations += 1
32

33
        # Scan content array; tool_use block is not always at index 0
34
        tool_use = next(b for b in response.content if b.type == "tool_use")
35

36
        # Pre-tool hook: may block or modify the call (see hooks.py)
37
        allow, tool_input = hooks.run_pre(tool_use.name, tool_use.input)
38
        if not allow:
39
            result = "Tool call blocked by permission check."
40
        else:
41
            result = dispatch_tool(tool_use.name, tool_input)
42

43
        # Post-tool hook: logging, telemetry, cannot block
44
        hooks.run_post(tool_use.name, tool_input, result)
45

46
        # Append assistant turn, then tool result as user turn
47
        messages.append({"role": "assistant", "content": response.content})
48
        messages.append({
49
            "role": "user",
50
            "content": [
51
                {
52
                    "type": "tool_result",
53
                    "tool_use_id": tool_use.id,  # must match
54
                    "content": result,
55
                }
56
            ],
57
        })
58

59
        response = client.messages.create(
60
            model="claude-sonnet-4-6",
61
            max_tokens=4096,
62
            system=assemble_system_prompt(),
63
            tools=registry.descriptors(),
64
            tool_choice={"type": "auto"},
65
            messages=messages,
66
        )
67

68
    # Return the final text block, or empty string if none
69
    return next((b.text for b in response.content if b.type == "text"), "")

Two rules Anthropic documents explicitly: tool_result blocks must immediately follow the assistant tool_use message (no messages between them), and if a user message contains both tool results and text, the tool results must come first. The iteration cap (10 in this example) is a design decision you make. Anthropic’s computer uses 10 but 10 to 25 is a reasonable starting point for most tasks.

prompts.py: system prompt assembly with caching#

The system prompt is not a static string. It is a pipeline. And the order of blocks in that pipeline is load-bearing for cost.

Prompt caching in the Anthropic API works on token prefixes: if the prefix is identical across calls, the second call is a cache hit. Cache reads cost $0.50/MTok (0.1x the base input price of $5.00/MTok for Opus 4.7)[2]. Cache writes cost more upfront ($6.25/MTok for a 5-minute cache, $10.00/MTok for a 1-hour cache)[2]. You pay more to write the cache, you pay much less to read from it. Over a 100-call session, a well-cached 10,000-token system prompt costs roughly $0.50 in cache reads instead of $5.00 at base rate [2]. The savings ratio is 10x and it compounds across every call that hits the cache [2].

The ordering rule: static content first, dynamic content after the cache breakpoint. Reversing this kills the cache because every call has a unique prefix. The minimum to qualify for caching is 4,096 tokens for Opus 4.7 or 1,024 tokens for Sonnet 4.6 [2].

1
def assemble_system_prompt(task_context: str = "") -> list:
2
    """
3
    Returns a list of TextBlockParam dicts.
4
    Static instructions get cache_control; dynamic task context does not.
5
    Prefix order enforced by Anthropic: tools -> system -> messages.
6
    """
7
    static_block = {
8
        "type": "text",
9
        "text": (
10
            "You are a specialized harness agent.\n\n"
11
            "## Rules\n"
12
            "- Use read_file to inspect files before answering.\n"
13
            "- Never guess at file contents.\n"
14
            "- Report what you find, not what you expect.\n"
15
            # Add your static domain instructions here.
16
            # The longer this static block, the more caching saves you.
17
            # Minimum 1,024 tokens for Sonnet 4.6 to qualify for caching.
18
        ),
19
        "cache_control": {"type": "ephemeral"},  # cache breakpoint here
20
    }
21

22
    if task_context:
23
        dynamic_block = {
24
            "type": "text",
25
            "text": f"Current task context:\n{task_context}",
26
            # No cache_control on dynamic blocks. This changes per call.
27
        }
28
        return [static_block, dynamic_block]
29

30
    return [static_block]

Verify caching is working by checking response.usage.cache_read_input_tokens > 0 after the second call.

Figure 4 - Two-column diagram showing correct prompt ordering (static block with cache_control breakpoint, then dynamic block) on the left versus incorrect ordering (dynamic block first, static block second) on the right, with annotations showing cache hit vs cache miss consequences

Figure 4 - Static First, Dynamic Second: The cache breakpoint must fall on the last static block. Dynamic content after the breakpoint is fine, it does not invalidate the static prefix. Dynamic content before the static block means every call has a unique prefix and the cache never fires. This single ordering decision can reduce per-call costs by 10x on a long-running session.

KEY INSIGHT: Prompt caching does not require any special API feature flag. It is always on. The only variable is whether your content ordering earns cache hits. Static first, dynamic last, cache breakpoint on the last static block.

registry.py: the tool registry#

Every tool your harness can call lives here. The registry has three methods: register() to add a tool, get(name) to retrieve a handler for dispatch, and descriptors() to return the lightweight list the API needs (name, description, input schema). The model sees the descriptors, not the handlers.

Keep descriptors concise but precise. The model uses the description to decide whether to call the tool. Vague descriptions produce wrong calls; the harness then runs error recovery, which costs tokens and latency.

agents.py: sub-agent spawning#

Sub-agents are just a second client.messages.create call with isolated context. There is no dedicated sub-agent API in the standard Messages API as of May 2026. The pattern: spawn, restrict, collect.

1
def spawn_sub_agent(
2
    task: str,
3
    sub_system: str,
4
    sub_tools: list,
5
    model: str = "claude-haiku-4-5",  # cheaper for narrow sub-tasks
6
    max_tokens: int = 2048,
7
) -> str:
8
    """
9
    Spawn an isolated sub-agent with its own context and restricted toolset.
10
    The sub-agent runs its own while loop and returns a string result.
11
    """
12
    messages = [{"role": "user", "content": task}]
13

14
    response = client.messages.create(
15
        model=model,
16
        max_tokens=max_tokens,
17
        system=sub_system,
18
        tools=sub_tools,
19
        messages=messages,
20
    )
21

22
    # Sub-agents need an iteration cap too. A tool that consistently
23
    # returns errors can spin a sub-agent indefinitely otherwise.
24
    sub_iterations = 0
25
    max_sub_iterations = 5
26
    while response.stop_reason == "tool_use" and sub_iterations < max_sub_iterations:
27
        sub_iterations += 1
28
        tool_use = next(b for b in response.content if b.type == "tool_use")
29
        result = dispatch_tool(tool_use.name, tool_use.input)
30
        messages.append({"role": "assistant", "content": response.content})
31
        messages.append({
32
            "role": "user",
33
            "content": [
34
                {"type": "tool_result", "tool_use_id": tool_use.id, "content": result}
35
            ],
36
        })
37
        response = client.messages.create(
38
            model=model,
39
            max_tokens=max_tokens,
40
            system=sub_system,
41
            tools=sub_tools,
42
            messages=messages,
43
        )
44

45
    return next((b.text for b in response.content if b.type == "text"), "")

Each sub-agent gets a fresh context window with no pollution from the parent’s history. This is the “context isolation” benefit from Question 3 (model selection): Haiku handles narrow tasks while Sonnet or Opus handles orchestration. The cost difference is real.

Note: Anthropic’s Managed Agents platform (a separate hosted service requiring the beta header anthropic-beta: managed-agents-2026-04-01) adds session durability and webhooks on top [3]. That is a different product. Your first harness uses the simple spawn pattern above.

primitives.py: built-in tools#

The baseline operations every harness needs: read a file, run a bash command, write to a file. These must use only the standard library, no framework dependencies. The agent needs to take actual actions in a clean environment.

1
from pathlib import Path
2
import subprocess
3

4
def read_file(path: str) -> str:
5
    """Read the contents of a file. Returns error string if not found."""
6
    try:
7
        return Path(path).read_text(encoding="utf-8")
8
    except FileNotFoundError:
9
        return f"Error: file not found: {path}"
10

11
def run_bash(command: str, cwd: str = ".") -> str:
12
    """Run a bash command and return stdout. Captures stderr."""
13
    result = subprocess.run(
14
        command, shell=True, capture_output=True, text=True, cwd=cwd
15
    )
16
    if result.returncode != 0:
17
        return f"Error (exit {result.returncode}):\n{result.stderr}"
18
    return result.stdout

Register these in registry.py with their appropriate permission levels. Read file is READ permission; bash gets classified dynamically by permissions.py.

memory.py: append-only session log#

Long sessions are stateful. If the process crashes, all context is lost unless you write state to disk. The pattern: append every significant event as a JSON line the moment it happens. The replay() method reads back the full session on restart.

1
import json
2
from pathlib import Path
3

4
class SessionMemory:
5
    def __init__(self, path: str):
6
        self.path = Path(path)
7

8
    def append(self, event: dict) -> None:
9
        """Write one JSON event line. Opens in append mode; flushes immediately."""
10
        with open(self.path, "a") as f:
11
            f.write(json.dumps(event) + "\n")
12
            f.flush()  # ensures durability on crash
13

14
    def replay(self) -> list[dict]:
15
        """Read back all events line by line to reconstruct session state."""
16
        if not self.path.exists():
17
            return []
18
        events = []
19
        with open(self.path) as f:
20
            for line in f:
21
                line = line.strip()
22
                if line:
23
                    events.append(json.loads(line))
24
        return events

The f.flush() call is not optional. It guarantees the line is written to disk before the next line of code runs. If the process crashes after the flush, the event is safe. If you skip the flush, you may lose the last several events.

Because the file is append-only, two concurrent readers (a monitoring process, a different agent run) can share the same log without conflicts.

hooks.py: pre-tool and post-tool callbacks#

Hooks are the extensibility layer. They inject logic before or after tool execution without touching the core while-loop. Two types:

Pre-tool hook: fires before any tool runs. Receives tool name and input. Can allow, deny, or modify the call. This is where permission enforcement happens, dangerous bash commands get intercepted, and human-in-the-loop approval gates fire.

Post-tool hook: fires after the tool runs. Receives tool name, input, and output. Cannot block anything. This is where audit logging, telemetry, and observability go.

1
from typing import Callable
2

3
PreToolHook = Callable[[str, dict], tuple[bool, dict | None]]
4
PostToolHook = Callable[[str, dict, str], None]
5

6
class HookRegistry:
7
    def __init__(self):
8
        self._pre: list[PreToolHook] = []
9
        self._post: list[PostToolHook] = []
10

11
    def pre(self, fn: PreToolHook):
12
        self._pre.append(fn)
13

14
    def post(self, fn: PostToolHook):
15
        self._post.append(fn)
16

17
    def run_pre(self, name: str, tool_input: dict) -> tuple[bool, dict]:
18
        for hook in self._pre:
19
            allow, maybe_input = hook(name, tool_input)
20
            if not allow:
21
                return False, tool_input
22
            if maybe_input:
23
                tool_input = maybe_input
24
        return True, tool_input
25

26
    def run_post(self, name: str, tool_input: dict, output: str) -> None:
27
        for hook in self._post:
28
            hook(name, tool_input, output)

permissions.py: the bash classifier#

The permissions layer separates useful tools from dangerous ones. Three levels: READ (inspection only), WORKSPACE (modify files in the current session directory), FULL (arbitrary system access).

1
import re
2
from enum import Enum
3

4
class Permission(Enum):
5
    READ = "read"
6
    WORKSPACE = "workspace"
7
    FULL = "full"
8

9
# These sets are design decisions specific to your domain.
10
# Adjust for your risk profile.
11
SAFE_COMMANDS = {
12
    "ls", "cat", "grep", "find", "echo", "pwd",
13
    "wc", "head", "tail", "sort", "uniq", "diff"
14
}
15
DANGEROUS_COMMANDS = {
16
    "rm", "sudo", "shutdown", "reboot", "dd",
17
    "mkfs", "chmod", "chown", "curl", "wget", "pip"
18
}
19

20
def classify_bash(command: str) -> Permission:
21
    """Classify a bash command into read / workspace / full permission level."""
22
    first_word = command.strip().split()[0] if command.strip() else ""
23
    if first_word in SAFE_COMMANDS:
24
        return Permission.READ
25
    if first_word in DANGEROUS_COMMANDS:
26
        return Permission.FULL
27
    return Permission.WORKSPACE

Wire this into your pre-tool hook so that any FULL-permission command either triggers human approval or gets blocked outright, depending on your harness’s security policy.

context.py: compaction logic#

This file handles what happens when the message history grows too long. The basic strategy: when total message tokens exceed a threshold (say, 80% of the model’s context window), summarize the oldest messages and replace them with the summary. The most recent messages stay verbatim.

For a first harness, this can be a stub that just monitors token usage and logs a warning. Build the actual compaction logic once you’ve seen how your harness actually behaves over long sessions.

Figure 5 - Diagram mapping each of the 9 Python files to its corresponding harness component, with brief function descriptions alongside each file name

Figure 5 - Nine Components, Nine Files: Each component maps to one file. The while loop in engine.py is the only file that imports all the others. primitives.py, memory.py, permissions.py, and prompts.py have no dependencies within the harness. Build in that order.

The remaining 8 questions, in build order#

The four pre-code questions set the architecture. These eight come up as you build. Each has a recommended first-harness answer.

Question 5: File system and workspace design#

What files does your harness create during execution? Where do they live? Does the workspace persist across sessions?

First-harness answer: one directory per run, session ID as the folder name. Each phase writes its output to a named file (phase-01-extraction.txt, phase-02-classification.json). If the run fails partway through, you can restart from the last completed phase by checking which files exist.

Question 6: State management approach#

An explicit state machine tracks which phase is current, whether each phase succeeded or failed, and what each phase produced. The alternative (inferring state from the message history) is fragile and hard to debug.

First-harness answer: a simple Python dataclass or dictionary tracking phase status (pending, in_progress, completed, failed). Store it as JSON in the session directory so it survives restarts.

Question 7: Memory persistence#

Short-term memory lives in the messages list. Long-term memory needs to persist outside a single process. The append-only SessionMemory class in memory.py handles this. For a first harness, one log file per session is enough.

First-harness answer: SessionMemory in memory.py logging every tool call, tool result, and phase transition.

Question 8: Sub-agent delegation#

Do you need sub-agents? For a single-phase first harness, no. For a multi-phase harness with parallel work (analyzing multiple documents simultaneously, for example), yes.

First-harness answer: add sub-agents only when you have a concrete task that benefits from context isolation. Don’t add them speculatively.

Question 9: Context compaction threshold#

At what point do you start summarizing older messages? Starting too aggressively loses useful context. Starting too late risks hitting the context limit mid-task.

First-harness answer: monitor response.usage.input_tokens on each iteration. Log a warning when you hit 70% of the model’s context window. Build the actual compaction logic after you have data on how your specific tasks grow context.

Question 10: Human-in-the-loop gates#

Where do humans approve before the harness proceeds? Code these as explicit checks in the phase sequence. Do not leave them to the model’s judgment, because the model will not consistently decide when to ask.

First-harness answer: one HITL gate before any action that is difficult to reverse (writing to external systems, sending emails, executing destructive file operations).

Question 11: Validation loops#

Each phase’s output should be verified before it passes to the next phase. For code: run the tests. For structured data: validate the JSON schema. For text: at minimum, check that the expected fields are present and non-empty.

First-harness answer: deterministic validation (schema check, length check, required-field check) at the exit of each phase. Add LLM-based validation (the adversarial evaluator pattern) only when deterministic checks aren’t sufficient for your domain.

Question 12: Observability from day one#

Log every phase start and end with a timestamp and token count. Log every sub-agent spawn with its task. Log every tool call with input and output. Structured JSON logs are better than print statements because they’re machine-parseable.

First-harness answer: instrument the post-tool hook to write structured log lines. Use memory.append() with an event_type field so you can filter by phase transitions, tool calls, and errors later.

Figure 6 - Timeline diagram showing 12 decisions organized into three rows: decisions 1-4 at the far left labeled "before code", decisions 5-8 in the middle labeled "during build", decisions 9-12 at the right labeled "during operation"

Figure 6 - When Each Decision Bites You: The first 4 questions are architectural: get them wrong and you refactor the whole project. Questions 5-8 are structural: they shape how the harness handles state and delegation. Questions 9-12 are operational: they determine whether you can debug, cost-control, and improve the running harness.

Reference example: a contract review harness pattern#

The most instructive concrete example we can point to is the contract-review harness pattern described by legal-tech engineering teams in interviews on the Latent Space podcast [7], reinforced by Anthropic’s own engineering guidance on orchestrator-worker patterns [6]. The specific implementation details and exact token counts vary across teams; what is consistent across reports is the eight-phase sequential structure described below.

The sequence is fixed: extract text, verify extraction (deterministic check), classify contract type (LLM call with structured output schema), ask clarifying questions (explicit HITL gate), load playbook from a knowledge base (RAG), extract clauses with chunking for large documents, run risk analysis, generate redlines plus executive summary and a Word document from a template.

The architecture answer (Question 1) was single-threaded supervisor. The planning answer (Question 2) was fixed: as practitioners building this class of harness describe it, the goal is to keep the model on deterministic rails rather than letting it generate its own task sequence at runtime.

The sub-agent answer (Question 8) was the interesting one. Phase 7 (risk analysis) runs dozens of dedicated sub-agents, one per clause, each with its own isolated context window loaded with the playbook and relevant knowledge base content. The main orchestrator coordinates; the sub-agents do the analysis. This produces a striking token profile: the orchestrator consumes a small fraction of the total tokens while the sub-agents do the analytical heavy lifting. Practitioners building at this scale have reported the orchestrator staying below 10K tokens while the full thread runs into the hundreds of thousands [7]. The sub-agents do the heavy lifting while the orchestrator stays lean enough to coordinate cleanly.

Each phase writes its output to a file (Question 5). If the harness fails at phase 6, it restarts from phase 5’s output file rather than rerunning the entire pipeline. The state machine (Question 6) tracks phase status. The observability layer (Question 12) logs every sub-agent spawn with its clause ID so failed analyses can be rerun individually.

This is what a fixed plan with explicit sub-agent delegation actually looks like in practice. The total token cost sounds high. For a law firm processing hundreds of contracts per week, the reliability and the structured Word document output are worth it.

Figure 7 - State machine diagram showing the 8-phase contract review sequence with phase names, decision points, HITL gate at phase 4, and the multi-sub-agent fan-out at phase 7

Figure 7 - Contract Review State Machine: Eight phases, one direction, one explicit HITL gate at phase 4. The fan-out at phase 7 (one sub-agent per clause for risk analysis) is the architectural move that keeps the main orchestrator lean while the sub-agents do the analytical work. Each phase outputs a file; the next phase reads that file as its input.

KEY INSIGHT: Sub-agents are not just a scalability pattern. They are a context discipline. Each sub-agent sees only what it needs for its task. The main orchestrator sees only the sub-agents’ summaries. A small orchestrator can coordinate a very large amount of analysis work.

Knowing whether your harness works#

Building the harness is half the project. Knowing whether it’s any good requires measurement, and most first-harness builders skip this step entirely.

Cursor’s engineering team measures harness quality with a metric they call keep rate: what fraction of the code changes the agent produces remain in the user’s codebase after fixed intervals of time? [5] Cursor reports achieving 10x error reduction for the same model through harness tuning alone [5]. The model did not change; the harness did.

You can run an informal version of keep rate on your own harness over 20 trials. Note every output you accepted versus every output you rewrote. Accepted output is a signal that the harness gave the model what it needed. Rewritten output is a signal that something in the context, the tools, or the phase structure is producing incorrect or low-quality work.

Before you can improve keep rate, though, you need to know what kind of failure produced the rewrite. Cursor’s taxonomy classifies harness errors into three categories:

InvalidArguments: the model passed bad arguments to a tool. This is a model mistake. Harness fixes: improve the tool description, give the model better examples of valid inputs, or add input validation that returns a helpful error message rather than a stack trace.
UnexpectedEnvironment: the model’s assumption about the environment state was wrong. The model expected a file to exist and it didn’t. The model expected a command to succeed and the environment configuration was different. This is also a model mistake, but the harness fix is different: give the model tools to inspect the environment before acting, rather than letting it assume.
ProviderError: the Anthropic API actually failed. Rate limit, service interruption, transient error. This is not a model mistake and not a harness design mistake. Handle it with exponential backoff and a fallback, not by redesigning your prompts.

Categories 1 and 2 are where harness tuning lives. Category 3 is where infrastructure handling lives. Mixing them up is a common source of wasted effort on first harness projects.

Figure 8 - Three-column diagram showing the Cursor error taxonomy: InvalidArguments (model mistake, fix the tool description or input validation), UnexpectedEnvironment (model mistake, fix by giving the model inspection tools), ProviderError (vendor failure, fix with retry/backoff)

Figure 8 - Three-Category Error Taxonomy: Cursor’s taxonomy separates model mistakes (categories 1 and 2, where harness tuning helps) from vendor failures (category 3, where retries and backoff help). Every misclassified error leads to the wrong fix: rewriting prompts for a service outage, or adding retries for a bad tool description.

The trap: mid-chat model switching#

You will be tempted to switch models mid-session. You start with Sonnet for the main loop, and then you hit a complex reasoning step and reach for Opus. Or you have finished the hard planning and want to switch to Haiku for the execution phase to save cost.

Don’t. Cursor is explicit: “We generally recommend staying with one model for the duration of a conversation unless you have a reason to switch” [5].

Three things go wrong when you switch models mid-conversation:

Out-of-distribution tool calls. The new model inherits a conversation history that was produced by a different model. It sees tool calls and intermediate outputs generated with different tool formats and behavioral profiles. This is out-of-distribution from the new model’s training. The result: misinterpreted tool outputs, irrelevant tool calls, inconsistent behavior.

Prompt cache invalidation. Prompt caches are provider-specific and model-specific. Switching on turn N means the new model starts with zero cache hits on everything built up so far. The first turn after the switch pays full price for context that was previously cached.

Behavioral mismatch. Anthropic models are more intuitive and tolerant of imprecise instructions. OpenAI models are more literal and precise. Those are not prompt engineering observations; they reflect training distribution differences. A system prompt tuned for one profile will perform suboptimally for the other.

The correct pattern for multi-model workflows is sub-agents with clean context boundaries. Each sub-agent starts fresh, picks the right model for its specific task, and passes structured output back to the orchestrator. The orchestrator never sees the sub-agent’s raw conversation history. No OOD contamination. No cache penalty. No behavioral mismatch.

The maturity arc#

Your first harness is not your final harness. This is a feature, not a failure.

Every component you write encodes an assumption about what the model cannot do alone. As models improve, those assumptions expire. When they expire, the component becomes overhead that can actively hurt performance. Anthropic engineers have documented cases of this: a specific harness removed context resets entirely when moving from one model generation to the next, because the newer model no longer exhibited the context anxiety that made resets necessary. The harness got simpler and faster.

The practical implication: document the assumption behind each component you add. “We added context resets because the model terminates tasks prematurely when the context window fills.” When you upgrade the model and that behavior is gone, the reset logic is a candidate for removal.

Teams building at scale have reported rewriting their harnesses multiple times within a year as model capabilities shifted. Engineering practitioners report removing large fractions of an agent’s tool catalog and seeing better performance, not worse. More structure is not always better. The components you prune are usually the ones that were compensating for model limitations that no longer exist.

Start minimal. Add components only when you have measured evidence that a specific failure mode requires them. When the model improves, run the same tests and remove what is no longer needed.

The next article in this series covers the Bootstrap Framework: a system for tracking harness evolution across teams and projects, so the pruning decisions don’t have to be rediscovered from scratch each time.

Figure 9 - Evolution timeline showing a harness from month 1 (9 components, several with "compensating for model limitation" labels) through month 6 (5 components, simplified, with removed components shown as faded boxes labeled "model absorbed this")

Figure 9 - The Harness Maturity Curve: A harness over six months typically gets simpler, not more complex. Components added for Sonnet 4.5 become unnecessary on Opus 4.6. The question is whether you document the assumption behind each component so you know when to remove it.

KEY INSIGHT: The components you build this weekend are not permanent. They are educated guesses about what the current model needs. Write down the assumption behind each one. The model will prove you wrong on at least two of them within a year.

Where to go next#

You have a working mental model, a nine-file structure, and twelve design decisions with recommended first-harness answers. The SDK skeleton in engine.py is verified against the Anthropic SDK v0.102.0 (May 2026) [4]. The cost figures are from the vendor-direct model table [1]. The error taxonomy and the keep-rate metric are from Cursor’s engineering blog [5].

Build the simplest harness that makes your specific task work reliably. Measure it with an informal keep rate over 20 runs. Classify your failures with the three-category taxonomy. Fix categories 1 and 2 with harness changes. Fix category 3 with retry logic.

Six months from now, check your assumptions. Remove the components that are compensating for model limitations that no longer exist. The harness that survives is the one that stays close to what the model actually needs and nothing more.

Figure 10 - Summary diagram showing the full path from "blank Python file" through the 9-component structure, the 12 decisions, the keep-rate measurement loop, back to "harness that improves over time"

Figure 10 - Build, Measure, Prune: The full cycle: build the 9-component structure, measure with keep rate and error classification, prune what model improvements make obsolete. First harnesses skip the measurement and prune steps. Production harnesses rely on them.

The Series#

This is Part 4 of the five-part Harness Fundamentals sub-series on Claude Code engineering:

What Is an Agent Harness, Really? Nine Components Most Builders Miss — a working definition and the nine components every modern harness needs
Three Eras of AI Engineering: Prompt to Context to Harness — how the discipline moved and what each era absorbed from the one before
The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning — the V1/V2 case study, the Boris anchor, and a practitioner’s pruning playbook
Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions (this article) — hands-on construction of a minimal harness with all nine components mapped to working code
Skills, Slash Commands, and Harnesses: A Discipline Hierarchy — where individual skills fit inside the broader harness and how the three layers interact

References#

[1] Anthropic, “Models overview,” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/about-claude/models/overview

[2] Anthropic, “Prompt caching,” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/build-with-claude/prompt-caching

[3] Anthropic, “Tool use with Claude,” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/agents-and-tools/tool-use/overview

[4] Anthropic, “anthropics/anthropic-sdk-python,” GitHub Repository, 2026. https://github.com/anthropics/anthropic-sdk-python

[5] Cursor, “Continually Improving Our Agent Harness,” Cursor Engineering Blog, 2026. https://cursor.com/blog/continually-improving-agent-harness

[6] E. Schluntz and B. Zhang, “Building effective agents,” Anthropic Engineering Blog, Dec 2024. https://www.anthropic.com/engineering/building-effective-agents

[7] swyx and Alessio, “Building an AI Lawyer: Harvey AI Engineering,” Latent Space Podcast, 2026. https://www.latent.space/p/harvey