What Is an Agent Harness, Really? Nine Components Most Builders Miss

This is a Claude Code article on what an agent harness actually is. If you build with Claude Code, Cursor, Codex CLI, or any other agentic coding tool, you are already using a harness, whether or not you call it that. Knowing what is inside the box is what separates a power user from someone who just clicks “accept.”

Figure 1 - Cinematic abstract hero showing a luminous language model core surrounded by a structured architectural shell of nine interconnected harness components.

Figure 1 - The harness wraps the model: A working agent is never just a language model. The model sits at the center, but it is the surrounding harness, the loop, the tools, the memory, the hooks, the permissions, that turns raw text generation into something you can trust on a Tuesday afternoon. This article opens up that shell and walks through every component inside.

The hook that woke us up#

LangChain published a blog post in early 2026 that is, we think, the cleanest evidence we have for what is going on inside this generation of AI tools.

Their Deep Agents CLI was sitting outside the top 30 on Terminal Bench 2.0, scoring 52.8% [1]. They held the model fixed at gpt-5.2-codex, made three changes to the surrounding architecture, and rank-jumped to fifth place at 66.5% [1]. Same model, same benchmark, same prompts as far as the eye can see. Plus 13.7 points, no model upgrade.

Their phrasing: “we only changed the harness” [1].

Figure 4 - Bar chart comparing LangChain Deep Agents CLI on Terminal Bench 2.0 before and after harness changes, 52.8% at rank 30 versus 66.5% at rank 5, with the model held constant at gpt-5.2-codex.

Figure 4 - Same model, +13.7 points: LangChain published Terminal Bench 2.0 numbers in early 2026 [1] showing that three architectural changes around a fixed gpt-5.2-codex model were enough to move them from outside the top 30 (52.8%) to fifth place (66.5%). No model upgrade. The visible 13.7-point gap is the harness itself, doing work the model could not do alone.

We sat with that one for a while. For the last two years, the dominant assumption has been that the model itself is the lever: that every meaningful improvement in agent quality comes from the next checkpoint shipped by Anthropic or OpenAI. The Terminal Bench result says the opposite. Most of the variance now lives in the layer wrapping the model, not in the model itself. Academic work backs this up: ablation studies on the NLAH (natural-language agent harnesses) system out of Tsinghua University and the Harbin Institute of Technology show double-digit benchmark swings from changing harness modules while holding the model fixed [2], and the Meta-Harness paper out of Stanford, MIT, and KRAFTON reinforces the same finding across multiple held-out models [3]. LangChain’s “we only changed the harness” line is the same finding in a more digestible form.

So the obvious next question is, what is a harness?

KEY INSIGHT: When the same model produces double-digit benchmark swings under different harnesses, the cheapest performance lever you have is no longer the model upgrade, it is the architecture wrapped around it.

Why the term is used so loosely#

If you scroll any developer thread on agents, you will see people use “harness,” “framework,” “agent skill,” and “agent” almost interchangeably. The word has gone fuzzy because the field moved faster than the vocabulary. We want to define a working definition before we walk through the components, because the definition is what makes the rest of the article useful instead of word-soup.

A harness is a fixed architecture that turns a language model into an agent. The model produces text. The harness gives the model the ability to take action, observe consequences, and keep going until a problem is actually solved.

That sentence is not just academic. The distinction it sets up matters at the architectural level.

Harness versus framework#

These two words describe opposite things, and the conflation is the source of most of the confused agent debates online.

Dimension	Framework	Harness
Examples	LangChain, LangGraph, AutoGen, CrewAI	Claude Code, Cursor, Codex CLI, Windsurf
What it provides	Abstractions: state graphs, chains, memory connections, retrievers	A complete working agent
Assembly required	Yes; you wire the pieces together	No; arrives pre-wired
Built for	A human architect to configure an agent	The agent itself, to do a task
Starting point	You define the architecture	Fixed architecture ships with the tool

A framework is a kit of parts. A harness is a finished vehicle. Both are legitimate tools and the right choice depends on what you are doing. If you are building a new agentic product from scratch, you reach for a framework. If you want to ship code on a Tuesday afternoon, you reach for a harness.

Figure 3 - Side-by-side comparison diagram of two architectural camps: framework on the left as labeled disconnected parts (chains, graphs, retrievers, memory) waiting for assembly, harness on the right as a complete pre-wired agent with the model nested inside.

Figure 3 - Two camps, one common confusion: Frameworks like LangChain, LangGraph, AutoGen, and CrewAI ship a kit of parts that a human architect assembles. Harnesses like Claude Code, Cursor, Codex CLI, and Windsurf ship a complete pre-wired agent. Both are legitimate, but they answer different questions, which is why agent debates online so often go in circles.

KEY INSIGHT: Reach for a framework when you are designing the architecture. Reach for a harness when you are trying to ship code by Friday. The two tools are not competitors, they are answers to different questions.

The operating-system analogy#

The clearest mental model we have found is the one that maps onto operating systems. If you are old enough to remember explaining what a kernel was, this lands instantly. If not, it still works.

The raw LLM is the CPU. Powerful, but inert without instructions.
The context window is the RAM. Small, fast, lost when the process ends.
External databases are the disk. Persistent, larger, slower.
Tool integrations are the device drivers. They let the CPU talk to the outside world.
The harness is the operating system. It decides what the CPU sees and when, schedules the work, manages memory, enforces permissions, and keeps the whole machine from setting itself on fire.

Figure 2 - Layered architectural diagram mapping computer components to harness components: LLM as CPU at the bottom, context window as RAM, external databases as disk, tool integrations as device drivers, and the harness shell labeled as the operating system orchestrating the stack.

Figure 2 - The harness is the operating system: A useful mental model maps the agent stack to a familiar one. The model is the CPU, the context window is the RAM, external databases are the disk, tool integrations are the device drivers, and the harness is the operating system above them all. The OS is what schedules work, enforces permissions, manages memory, and keeps the CPU from running into the wall. The same job description fits the harness exactly.

Anthropic uses a different analogy that we also like: the model is a horse and the harness is what gets the cart somewhere [4]. The horse has raw power but, left to itself, it goes wherever it wants. The harness controls the power, sets it in a direction, and turns motion into useful work. Without the harness, you have an animal in a field instead of transport.

Pick whichever analogy lands for you. The point is the same: the model alone is not the agent.

The nine components#

Every modern agentic coding harness we have looked at, including Claude Code, has the same nine components. Some are more obvious than others. A couple of them are the difference between a demo and something you would trust on a Friday afternoon. We are going to walk through all nine, with one named example per component where it helps.

Figure 5 - Circular architecture diagram with the while loop at the center and the other eight harness components arranged around it: context management, tools and skills, sub-agent management, built-in skills, session persistence, system prompt assembly, lifecycle hooks, and permissions and safety.

Figure 5 - The nine components arranged around the loop: The while loop is the foundation, but the other eight components are what make the loop trustworthy. Each one earns its place by absorbing a different failure mode: runaway context, runaway cost, runaway permissions, runaway scope. Reading this diagram clockwise from the top is roughly the order in which a from-scratch builder discovers they need each piece.

1. The while loop#

This is the foundation. On every iteration of the loop:

The model reads the assembled system prompt and the current context.
The model decides which tool to call.
The harness runs the tool.
The tool result is fed back into the context.
The loop repeats.

The loop continues until the model produces a text-only response with no tool call, or it hits a maximum-iteration cap that the harness sets to keep things from running away.

That is it. Five lines of pseudo-code:

1
while not done and iterations < max_iters:
2
    response = model.complete(context)
3
    if response.is_text_only:
4
        done = True
5
    else:
6
        result = registry.dispatch(response.tool_call)
7
        context.append(response, result)
8
    iterations += 1

Everything else in the harness exists to support those five lines. When you read the “an agent is just a while loop” tweets, this is what they mean. The tweets are not wrong. They are just incomplete, because the other eight components are what make the while loop trustworthy.

2. Context management#

On every turn, the context tree grows: user messages, tool calls, tool results, sub-agent reports, file reads. Left unmanaged, the context will eventually exceed whatever window the model was trained on. When that happens, things get bad in interesting ways: the model starts forgetting what it just did, contradicts earlier decisions, and on long tasks it can wrap things up prematurely just to escape the pressure.

The harness’s job is to decide what to keep verbatim, what to summarize, and what to discard. This process is called compaction.

Claude Code’s window was 200K tokens for a long time and then expanded to 1M for Opus. At roughly 80 to 90 percent capacity, compaction triggers automatically. Recent messages stay verbatim because they are most relevant. Older messages get summarized because their cost-to-value ratio is poor.

Compaction done badly has consequences. Sonnet 4.5 famously showed what people came to call context anxiety, where the model would notice the context filling up and rush through remaining steps. Opus 4.6 is much less twitchy about this and Anthropic actually removed the explicit context-reset machinery from their internal long-running agents harness because the new model handles compaction natively. That is a story for another article, but it is a useful preview: the harness does not just grow over time, it also prunes.

Figure 7 - Context-window timeline diagram showing three stages: a growing context tree at 60% capacity (verbatim), the compaction trigger at 80-90% capacity (recent verbatim, older summarized), and the post-compaction window with summary blocks compressed and recent activity preserved.

Figure 7 - Compaction in flight: Context grows on every turn until it bumps the compaction trigger around 80 to 90 percent capacity. The harness keeps recent messages verbatim because they are most relevant to the current step, summarizes older messages because their cost-to-value ratio has dropped, and discards what no longer matters. Done well, the agent stays oriented. Done badly, you get context anxiety and rushed conclusions.

3. Tools and skills#

We want to draw a distinction here that the wider community is still settling on.

Tools are primitives: read file, edit file, run bash, search code, fetch URL. They are universal in the sense that the same tool works across teams and projects with no domain knowledge.

Skills are organizational knowledge encoded above the tool layer, usually as Markdown files. Skills are specific to your team, your workflow, your domain. The dotzlaw.com project where this article was drafted has a write-article skill, a review-servoy-code skill, a generate-linkedin-post skill, and so on. Each one is a Markdown file the agent reads when invoked. It tells the agent how we like things done here.

Both tools and skills are managed by the same component: a registry. The registry maps every name to a permission level, a handler function, and a one-line description. Three operations:

register(name, handler, perms, desc) adds a new tool or skill.
get(name) retrieves the handler for dispatch.
descriptors() returns a lightweight list of name, permission, and description that the harness sends to the model so it knows what is available.

The descriptor list matters more than people realize. It is what the model sees in its system prompt as “here are your tools.” Get the descriptions wrong and the model picks the wrong tool. Get them too verbose and you waste the prompt budget. Get them right and the model usually does the right thing on the first try.

4. Sub-agent management#

When a task is too large or too parallel for a single conversation thread, the harness spawns sub-agents. Each sub-agent gets its own session with an isolated context, its own restricted tool set, and a focused system prompt scoped to the task at hand. The pattern is spawn, restrict, collect.

There are three sub-agent archetypes we see across reference implementations:

Exploration sub-agents have broad read permissions and a discovery-focused prompt. They go find things.
General sub-agents have a standard toolset for normal work.
Verification sub-agents are read-only or very limited write, with a validation-focused prompt. They check work.

The dotzlaw.com pipeline uses this pattern explicitly. The article-researcher (exploration), the article-writer (general), and the article-reviewer (verification) are three sub-agents the team-lead orchestrator dispatches in sequence. The reviewer cannot edit drafts. The writer cannot publish. Each one is bounded to its job, and the orchestrator collects the outputs.

Parallel sub-agents are where this component starts paying for itself. Manus, the research agent, has been reported to run hundreds of web pages in parallel as separate sub-agents, then collect the results [5]. Practitioners building contract-review harnesses at scale describe similar fan-out: dozens of per-clause sub-agents running simultaneously, with the parent orchestrator staying under 10K tokens while the cumulative sub-agent work runs into the hundreds of thousands [5]. That is the kind of fan-out you cannot do in a single thread without melting the context window.

Figure 6 - Parent-and-children orchestration diagram showing a top-level agent dispatching to three labeled sub-agent archetypes (exploration, general, verification) each running in an isolated context with its own restricted tool set, with arrows showing spawn, restrict, and collect.

Figure 6 - Spawn, restrict, collect: The pattern is the same in every harness that scales beyond a single thread. The orchestrator spawns a sub-agent with an isolated context, restricts the tool set to what the task actually needs, and collects the result. The contract-review example above is the structural argument for this component: a lean parent orchestrator holding the strategy, dozens of focused children doing the per-clause analysis. You cannot do that in one conversation; you can do it across many small ones.

5. Built-in skills#

Every harness ships a baseline set of non-negotiable primitives that work out of the box. If your agent cannot read or edit files, it is not a coding agent.

The primitive set:

File operations: read, write, edit, search.
Cell or shell execution: bash, REPLs, scripts.
Code navigation: find references, jump to definition, structural search.

Modern harnesses ship higher-level built-ins on top of those primitives:

How to make a git commit (with a sane message format).
How to open a pull request (with a body, not just a title).
How to run tests and interpret results (parse the failure, tell the user what broke).

These are not optional. The reason your agent feels useful from minute one with Claude Code or Cursor is that someone already wrote those built-ins for you. The reason a from-scratch LangChain agent feels janky for the first three weeks is that you are still wiring the file-edit tool yourself.

One non-obvious detail: built-in skills must use pure standard libraries. No framework dependencies. The agent will run in environments where the developer has not installed your framework, and a built-in skill that breaks on a missing import is no better than no built-in skill at all.

6. Session persistence and memory#

Long agent sessions are stateful, and processes crash. If the harness does not write state to disk, every crash erases the entire session. That is fine for a five-second toy demo and unacceptable for an agent that is running for an hour while you are at lunch.

The modern approach is the boring one, and it is the right one: append-only JSON or Markdown files. Every event the agent generates, every message, every tool call, every tool result, every compaction event, gets written as one line. The append() method opens the file in append mode, writes the event, flushes to disk, and closes. If the process dies on the next line, this one is already safe.

The replay() method reads the file back line by line and reconstructs the session. Because the file is append-only, two concurrent runs can share the log without locking, and you can tail -f the file in another terminal to watch the agent think in real time. We do this often. It is genuinely fun, and you also catch problems early.

Anthropic’s recent managed-agent work explored building session management as a separate service from the harness itself, rather than embedding it inside. That is interesting for very long-running production agents but for most people the embedded append-only-log approach is plenty.

7. System prompt assembly#

The system prompt is not a static string. It is the output of a pipeline that walks ancestor directories looking for specific instruction files (CLAUDE.md, AGENTS.md, and similar) and injects their contents dynamically. That is how Claude Code knows the conventions of your project the moment you open the terminal in the project root.

There is a critical ordering rule here that almost every from-scratch builder gets wrong on the first attempt:

Static content must come first. Dynamically loaded content comes second.

Reverse the order and you break prefix caching. Prompt caching in modern LLM APIs works by hashing the token prefix of your system prompt across calls [8]. Identical prefixes hit the cache, which can be 5x or more cheaper and faster [8]. If your dynamically loaded CLAUDE.md content sits at the front of the prompt, every call has a unique prefix because the file’s modification time has bumped a header somewhere, and you get zero cache hits. Your bills go up, your latency goes up, and you do not understand why because the diff in your system prompt looks tiny. The fix is one line: put the static boilerplate first, dynamic content second, so the cache-hit prefix is as long as possible.

It is a single-line fix, and it routinely cuts inference cost by half on long-running agent sessions.

8. Lifecycle hooks#

The extensibility layer. Hooks let you inject custom logic before or after a tool runs, without touching the harness core.

There are two types and they are not symmetric.

Pre-tool hooks fire before the tool runs. They receive the tool name and tool input. They can do one of three things:

Allow the call to proceed.
Deny the call and return an error to the model.
Modify the call (rewrite arguments, swap one tool for another).

Pre-tool hooks are how you implement organizational policy. “Never run rm -rf against /home.” “Always pipe pytest through this wrapper.” “Block writes to any file matching secrets/*.” These rules live in the hook, not the model. The model is not asked to be trusted with the policy; the harness enforces it.

Post-tool hooks fire after the tool runs. They receive the tool output. They cannot block anything, by design. They are for auditing, logging, observability, and telemetry. If your security team needs an audit trail, this is where it goes.

Figure 8 - Lifecycle hook timing diagram showing a single tool call in the loop with two labeled extension points: a pre-tool hook gate before dispatch (allow, deny, or modify) and a post-tool hook audit after dispatch (log, observe, telemetry only, no block).

Figure 8 - Two hooks, one loop iteration: Pre-tool hooks gate the call before it runs and can allow, deny, or rewrite it. Post-tool hooks audit the call after it runs and cannot block anything. The asymmetry is intentional: enforcement happens up front, observability happens at the back. Both are how you wire organizational policy and audit infrastructure into a harness without touching the core loop.

Hooks are also the mechanism by which enterprises plug harnesses into existing approval workflows and audit infrastructure. The harness does not care what you put in the hook, it just guarantees the hook will fire at the right point in the loop.

9. Permissions and safety#

The component that turns a useful tool into one you will let near production.

Most harnesses use a three-level permission hierarchy:

read: can read files and directories; no modifications.
workspace: can modify files within the session workspace.
full: can execute arbitrary commands and modify any accessible path.

Each tool declares its minimum required permission level. The harness enforces the declared permission at dispatch time, before the tool runs. A read tool simply cannot be invoked under a read permission session if it tries to escalate, and that check is in the harness, not in the model.

The interesting case is bash. The same bash tool can be safe or dangerous depending on what command string the model wants to run. So the harness does dynamic command classification:

Safe commands like ls, cat, grep, pwd get classified as read-only.
Dangerous commands like rm, sudo, shutdown, mkfs jump straight to full access (which the model probably does not have, so the call fails fast).
Everything else lands at the workspace level.

The classifier is not perfect, and that is why the final piece exists: interactive approvals. For any destructive action, the harness can pause and request explicit user approval before execution. You see this in Claude Code as the “do you want to allow this command?” prompt. That gate has to be in the harness, not in the model. A model can be talked out of caution by a sufficiently clever prompt; a hard-coded “ask the human first” check cannot.

Figure 9 - Permission hierarchy diagram with three stacked tiers (read, workspace, full), a side branch showing the dynamic bash classifier routing safe commands to read, dangerous commands to full, and everything else to workspace, plus a final interactive approval gate for destructive actions.

Figure 9 - Three tiers, one classifier, one human gate: The harness checks permission at dispatch time, before the tool runs. The bash tool gets a special path because the same tool can be safe or dangerous depending on the command, so a classifier sorts commands into the right tier on the fly. Anything destructive then hits the interactive approval gate, which lives in the harness on purpose. A clever prompt can talk a model out of caution; it cannot talk a hard-coded “ask the human first” check out of firing.

How the nine components fit Anthropic’s five canonical patterns#

Anthropic published a useful piece of architectural vocabulary called the five canonical patterns [4] for how the model gets called inside a harness. Every production agent we have looked at combines two or more of these.

Prompt chaining: sequential LLM calls where each output feeds the next.
Routing: classify the input, then route it to a specialized sub-handler.
Parallelization: fan out work to multiple concurrent LLM calls.
Orchestrator-workers: one LLM decomposes the work and coordinates multiple worker LLMs.
Evaluator-optimizer loops: a generator and an evaluator in a feedback loop.

Figure 10 - Five-panel diagram showing each of Anthropic's canonical agent patterns as a small flow: prompt chaining (sequential), routing (classifier branches), parallelization (fan-out, fan-in), orchestrator-workers (one parent, many children), and evaluator-optimizer (generator into evaluator into feedback).

Figure 10 - The five canonical patterns: Anthropic’s vocabulary for agent design at the architectural level rather than the prompt level. Most production agents combine two or more of these. Once you can name the pattern you are building, the conversation moves from “what should I tell the model” to “what is the shape of the system that calls the model,” which is the move from Era 2 (context engineering) to Era 3 (harness engineering).

The nine components are how you implement those patterns. The while loop (#1) gives you sequential execution, which is prompt chaining. Sub-agent management (#4) plus tools (#3) give you orchestrator-workers and parallelization. A verification sub-agent (#4) plus a pre-tool hook (#8) give you the evaluator-optimizer pattern. Routing is just a tool selection that happens to use a classifier as its dispatcher.

Memorize the patterns and you have a vocabulary for talking about agent design at the architectural level rather than at the prompt level. That is the move from Era 2 (context engineering) into Era 3 (harness engineering), which is its own subject and the topic of the next article in this series.

Practical advantages of thinking this way#

When you have a working definition and a component map in your head, three things change.

You stop arguing semantics in agent threads. When someone says “harness,” you can ask whether they mean the LangChain kind (framework) or the Claude Code kind (harness). The conversation gets shorter and clearer.
You stop blaming the model when the harness is the problem. The Terminal Bench result is a permanent reminder that 13.7 points of benchmark performance can hide in three middleware components. If your agent is underperforming, the model is rarely the first place to look.
You stop being intimidated by from-scratch builds. Nine components, a handful of files each, mostly under 200 lines of Python apiece. The minimum viable harness for a coding agent fits in an afternoon. The good harnesses are not bigger; they have just had more pruning passes.

The third point is the one we want to leave you with. There is a specific principle that mature harness work converges on, which is that the harness shrinks as the model improves. Components that encoded assumptions about what the model could not do become dead weight when the model can suddenly do them. Manus has reported rewriting their harness repeatedly as model capabilities shifted [5]. Vercel’s engineering team has reported removing roughly 80% of an agent’s tools and getting better results, not worse [6]. Anthropic dropped context resets entirely when Opus 4.6 arrived because the model no longer needed them [7]. We will write that article next, with the historical eras frame and a closer look at the pruning principle.

Figure 11 - Forward-arrow timeline diagram showing three labeled eras of AI engineering with a glowing arrow advancing from Era 1 (prompt engineering) through Era 2 (context engineering) to Era 3 (harness engineering), with the harness era highlighted as the current frontier and a faint trailing label pointing toward the next article in the series.

Figure 11 - Three eras, one direction of travel: The field has moved from prompt engineering through context engineering and landed on harness engineering. The next article in this series tells the historical story behind the arrow and explains why mature harnesses get smaller, not bigger, as the model gets stronger. This article was the definition, the next one is the trajectory.

KEY INSIGHT: A mature harness shrinks as the model improves. The components you cannot remove later are the ones worth building today.

For now, the working knowledge is enough: a harness is a fixed architecture that turns a model into an agent, it has nine components, and it is where most of the wins live in this generation of AI tools.

The Series#

This is Part 1 of the five-part Harness Fundamentals sub-series on Claude Code engineering:

What Is an Agent Harness, Really? Nine Components Most Builders Miss (this article) — vocabulary, components, and why first-party harnesses have a performance head start
Three Eras of AI Engineering: Prompt to Context to Harness — how the discipline moved and what each era absorbed from the one before
The Harness Evolution Principle: Why Mature Harnesses Look Like Pruning — the V1/V2 case study, the Boris anchor, and a practitioner’s pruning playbook
Building Your First Specialized Harness in Python: 9 Components, 12 Design Decisions — hands-on construction of a minimal harness with all nine components mapped to working code
Skills, Slash Commands, and Harnesses: A Discipline Hierarchy — where individual skills fit inside the broader harness and how the three layers interact

References#

[1] LangChain, “Deep Agents on Terminal Bench 2,” LangChain Blog, Feb 2026. https://blog.langchain.com/deep-agents-terminal-bench-2/

[2] Y. Pan et al., “Natural Language Agent Harnesses (NLAH),” Tsinghua University and Harbin Institute of Technology, arXiv preprint arXiv:2603.25723, 2026. https://arxiv.org/abs/2603.25723

[3] D. Lee, N. Nair, P. Zhang, J. Lee, O. Khattab, and C. Finn, “Meta-Harness: Automated Harness Discovery for Coding Agents,” Stanford, MIT, and KRAFTON, arXiv preprint arXiv:2603.28052, 2026. https://arxiv.org/abs/2603.28052

[4] E. Schluntz and B. Zhang, “Building effective agents,” Anthropic Engineering Blog, Dec 2024. https://www.anthropic.com/engineering/building-effective-agents

[5] Anthropic, “Effective harnesses for long-running agents,” Anthropic Engineering Blog, Nov 2025. https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents

[6] Vercel, “Lessons from building an AI agent at Vercel,” Vercel Engineering Blog, 2026. https://vercel.com/blog/lessons-from-building-an-ai-agent

[7] P. Rajasekaran, “Continuous reinvention: A brief history of Claude Code’s development,” Anthropic Engineering Blog, Mar 2026. https://www.anthropic.com/engineering/continuous-reinvention-a-brief-history-of-claude-code

[8] Anthropic, “Prompt caching,” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/build-with-claude/prompt-caching

[9] Anthropic, “Tool use with Claude,” Claude Platform Documentation, 2026. https://docs.claude.com/en/docs/agents-and-tools/tool-use/overview