Anthropic's 2026 Code with Claude: What Doubled Limits and Infinite Context Mean for Production

The interesting thing about Anthropic’s Code with Claude 2026 conference is that four of the five “announcements” had already shipped. Memory in Managed Agents went public beta on April 23, in its own announcement (this article skips a dedicated Memory walk-through because the file-system-mount and workspace-scoping design is the part most readers already understand). Dreams hit research preview on April 21. Managed Agents webhooks went public beta on May 7. The 1M-token context window has been generally available at flat pricing since March 13. Only the doubled rate limits landed days before the conference (May 6, the conference itself ran the week of May 10). Everything else was a confirmation, not a reveal.

That is the part worth reading carefully. The conference was not a roadmap announcement; it was Anthropic ratifying a roadmap that had already shipped quietly. The headline is “these primitives are now the production surface, plan accordingly.”

Figure 1 - A timeline showing five Anthropic feature ship dates from March to May 2026, with the Code with Claude conference at the right marking the confirmation point rather than the launch point.

Figure 1 - The conference confirmed what had already shipped: Four of the five Code with Claude 2026 announcements had been live for weeks before the conference. The conference was a ratification of the roadmap, not a launch event. That distinction matters for production teams trying to decide what to plan around.

This article walks through Compute, 1M Context, Outcomes, Dreams, and Webhooks with the question a senior engineer actually wants answered: what changes in my production architecture now that this exists, and what stays the same. The customer metrics are flagged where they come from Anthropic’s own benchmarks rather than independent third parties, and the beta-versus-GA distinctions are kept explicit throughout.

Compute as the moat#

Four days before the conference, on May 6, Anthropic published a news post titled “Higher usage limits for Claude and a compute deal with SpaceX.” Two things landed together. The first was a compute partnership using all of the capacity at SpaceX’s Colossus 1 data center: more than 220,000 NVIDIA GPUs, over 300 megawatts, online within a month of announcement. The second was a doubling of Claude Code’s five-hour rate limits across the Pro, Max, Team, and Enterprise consumer plans, with the peak-hour throttle that had been in place since late March 2026 lifted entirely.

The cause and effect chain is the part that matters. Anthropic now has the compute floor to support the token-hungry orchestration work that the rest of the conference talked about: multi-agent runs, rubric-graded iteration loops, out-of-band memory consolidation. None of that is economically feasible without the throughput increase backing it.

A small but load-bearing distinction on the rate-limit framing. The word “doubled” applies to the consumer-plan five-hour bucket. On the API side, Anthropic’s news post says rate limits were “considerably” increased for Claude Opus models. Third-party coverage from WebProNews and Slashdot cites “almost 17x” higher API limits for some Opus tiers, which is consistent with the post-doubling tier table at platform.claude.com/docs/en/api/rate-limits. The “17x” framing is third-party, not Anthropic-direct. Use “doubled” for the consumer plan and look up the table for API-side capacity sizing.

Figure 2 - A diagram showing 220,000 GPUs and 300 megawatts at SpaceX Colossus 1 connecting to a stack labeled 'Doubled rate limits, multi-agent orchestration, Outcomes, Dreams' with labeled arrows.

Figure 2 - Compute floor, feature ceiling: The Colossus 1 partnership is not a vanity announcement. The 220K GPUs and 300 MW are the structural backing for everything else the conference confirmed. Without the throughput increase, the new orchestration features would have hit rate-limit walls at production scale.

The orbital-compute reference is interest, not commitment. Anthropic’s news post discussed multi-gigawatt orbital data centers as a forward-looking exploration. Nothing is signed for orbit. If you see that quoted as a product roadmap item, it is not.

What this means for your team is straightforward. The throughput floor moved. Workloads that ran into rate-limit walls in March can be retried in May without architectural change. The interesting question is whether your orchestration logic was the bottleneck or the rate-limit ceiling was. If it was the ceiling, you have headroom now. If it was the orchestration, the headroom does not help, and you are about to discover that.

One million tokens, flatly priced#

A quick framing question: the “infinite context window” rhetoric. Lucas, the Anthropic Research PM who gave the Expanding Toolkit keynote, said it directly: “this is how we get much closer to the feeling of an infinite context window that was mentioned at the morning keynote today.” Emphasis his. The technical number is 1,000,000 tokens, available on Claude Mythos Preview (gated research preview, Project Glasswing partners only), Opus 4.7, Opus 4.6, and Sonnet 4.6. Sonnet 4.5 and Haiku 4.5 are still 200K-context models. Nothing is literally infinite.

The 1M context predates the conference. On March 13, Anthropic moved 1M to general availability at standard pricing for both Opus 4.6 and Sonnet 4.6. A 900K-token request now bills at the same per-token rate as a 9K-token request. That is the economic detail that drives everything else.

What feels infinite is the combination of three things. First, the 1M window itself. Second, server-side compaction, which is the platform absorbing the rolling-summary work that every long-running-agent harness used to ship its own version of. Third, context editing, which lets you clear stale tool results (screenshots, search outputs, file reads) per turn while keeping the decisions those results informed. Lucas’s tip from the keynote: “every n turns you actually clear tool results.”

Figure 3 - A horizontal bar showing a context window labeled 1,000,000 tokens with three callouts labeled 'flat pricing,' 'server-side compaction,' and 'context editing,' each pointing to a different region of the bar.

Figure 3 - “Infinite context” is three things, not one: The practitioner experience that Anthropic markets as infinite is the combination of 1M tokens, flat pricing across the full window, server-side compaction, and per-turn context editing. None alone is sufficient. Together they reduce most of the context-window pressure long-running agent harnesses had to engineer around in 2025.

KEY INSIGHT: When the platform absorbs your summarization scaffolding, the value of having written a clever summarizer drops to zero. Audit your harness for components whose only job is to compensate for the context window being too small.

For production teams, the architectural question is whether your harness was built around context-window pressure. A lot of 2025 harnesses ship their own RAG layer, per-turn summarization, and cache-breakpoint placement logic. On 1M-context Opus 4.7 with server-side compaction, much of that scaffolding becomes overhead that actively hurts performance, because it is competing with the platform’s version of the same work. The Opus 4.7 tokenizer caveat is the only friction worth flagging: it can use up to 35% more tokens for the same fixed text versus Opus 4.6, so a request that costs $X on 4.6 may cost up to $1.35X on 4.7 for tokens-in alone.

Outcomes: the rubric grader as a primitive#

This one is the most interesting of the five if you have been writing your own evaluator harnesses. The transcript I worked from called it an “outcome loop system.” Anthropic’s official name is Outcomes, with the docs page titled “Define outcomes.” The Cookbook entry describes it as “agents that verify their own work.”

Here is what Outcomes actually does. You define a rubric, a markdown document with explicit gradeable criteria. You attach it to a session via a user.define_outcome event. The harness automatically provisions a grader, which is a separate agent in a fresh context window with only the rubric and the artifact. The grader scores each criterion independently. If the artifact does not satisfy the rubric, the original agent revises. Default max_iterations is 3, max is 20.

The architectural detail that matters is the separate context window. Same-context self-critique is biased by the path the agent took to produce the output. A second agent with only the rubric and the artifact has no such bias. This is the adversarial-evaluator pattern, shipped as a platform primitive. If you rolled your own evaluator in 2025, you wrote approximately this code, and yours probably had to fight the bias problem because spinning up a separate context was expensive enough that it usually got skipped.

Figure 4 - A diagram showing an agent producing an artifact, the artifact and the rubric being handed to a separate grader agent in a fresh context, the grader returning per-criterion scores, and a feedback loop back to the original agent with a max_iterations counter.

Figure 4 - Outcomes loop with separate-context grader: The agent produces an artifact. The rubric and the artifact are handed to a separate grader agent in a fresh context window. The grader scores per criterion. If the artifact fails, the original agent revises. Default ceiling is 3 iterations; maximum is 20. The separate context is what distinguishes Outcomes from naive same-context self-critique.

1
{
2
  "type": "user.define_outcome",
3
  "description": "Build a DCF model for Costco in .xlsx",
4
  "rubric": {"type": "text", "content": "# DCF Model Rubric\n..."},
5
  "max_iterations": 5
6
}

(max_iterations is optional. Default is 3; maximum is 20. The example above uses 5 to make the parameter visible.)

The performance number worth quoting is +10 percentage points task success versus standard prompting loops, with the largest gains on the hardest tasks. File-format breakdown: +8.4% on .docx outputs, +10.1% on .pptx outputs. Outcomes pays back most where the deliverable is structured, not where it is a chat reply. Both numbers come from Anthropic’s internal benchmarks via third-party Cookbook coverage; not independent.

The load-bearing variable for Outcomes quality is rubric authoring. Vague criteria produce noisy graders. The docs are explicit:

Structure the rubric as explicit, gradeable criteria, such as “The CSV contains a price column with numeric values” rather than “The data looks good.” The grader scores each criterion independently, so vague criteria produce noisy evaluations.

If you wrote retry-with-grader loops in 2025, the work that compounds is your rubrics, not your loop logic. The loop is now Anthropic’s.

Sessions emit three new event types (span.outcome_evaluation_start, span.outcome_evaluation_ongoing, span.outcome_evaluation_end) with terminal result values satisfied, needs_revision, max_iterations_reached, failed, or interrupted. The events stream is the integration point you wire up in production.

Dreams: out-of-band memory curation#

Naming first. The transcript called this “Dreaming.” Anthropic’s canonical name for the resource is Dreams, with “dreaming” used as the verb form for the process. The docs page is at platform.claude.com/docs/en/managed-agents/dreams. Beta header is dreaming-2026-04-21.

Status: research preview. This is the only one of the five conference primitives still in research preview. Memory, Outcomes, and webhooks are all in public beta; rate limits are live. Dreams is the announcement Anthropic is least sure of, and that affects how to plan around it.

What a dream actually is, verbatim from the docs:

A dream reads an existing memory store alongside past session transcripts, then produces a new, reorganized memory store: duplicates merged, stale or contradicted entries replaced with the latest value, and new insights surfaced.

The input store is never modified, so you can review the output and discard it if you don’t like the result.

The review-before-attach flow is part of the design, not an optional nicety. The output is a candidate memory store; you decide whether to swap it in.

Figure 5 - A diagram showing many parallel agent sessions writing into a shared memory store on the left, an asynchronous Dreams process pulling from the store and the session transcripts in the middle, and a curated output memory store on the right with a manual-review gate.

Figure 5 - Dreams as out-of-band curation: A dream consumes an existing memory store and recent session transcripts, runs asynchronously, and produces a candidate curated store. The input store is never modified. The candidate goes through a review gate before replacing the live store. Up to 100 sessions per dream, supported on Opus 4.7 and Sonnet 4.6 during research preview.

The operational ceiling worth knowing about is up to 100 sessions per dream. Not unlimited continuous learning. If you are running 1,000 agents per day, you do not feed all 1,000 transcripts into a single dream; you feed slices, scheduled appropriately. Supported models during research preview are claude-opus-4-7 and claude-sonnet-4-6 only. The 4,096-character instructions field is enough to bias what the dream focuses on without writing its prompt for it. Billing is at standard API rates.

For teams running compiled-knowledge layers, the Karpathy wiki pattern, or any version of “we curate the memory store on a separate cadence from the agent’s task,” the primitive caught up to the pattern. The async-batch consolidation shape is correct; Anthropic now sells it as a managed service. Whether you should switch depends on how customized your existing curation logic is, and whether the 100-sessions ceiling fits your access pattern. Without periodic curation the memory store gets stale and bloated and agents repeat each other’s mistakes; Dreams is one way to do that curation, and not the only way.

Managed Agents webhooks#

This shipped May 7 in public beta, alongside Outcomes and the multi-agent orchestration features. The naming distinction worth getting right: the platform-API surface is Managed Agents webhooks. The earlier transcript called this “Webhook Support,” which is not the canonical name. There is also a separate consumer-side product called Claude Code Routines that uses cron-and-HTTP-triggered saved configs; that is a different feature.

Eight session events fire as outbound webhooks: session.status_run_started, session.status_idled, session.status_rescheduled, session.status_terminated, session.thread_created, session.thread_idled, session.thread_terminated, and session.outcome_evaluation_ended. The last one is the integration glue between Outcomes and your downstream system: when a grader reaches a terminal result, the platform pushes; your receiver fetches the full session state via the API to read outcome_evaluations[].result.

There are also vault-credential events for credential lifecycle, including vault_credential.refresh_failed. Treat that one as an operational alert, not a normal-path event.

Figure 7 - A grid showing the eight session event names organized into three groups: status lifecycle (run_started, idled, rescheduled, terminated), thread lifecycle (thread_created, thread_idled, thread_terminated), and outcomes (outcome_evaluation_ended), with a separate vault_credential.refresh_failed event called out on the side.

Figure 7 - The eight session events plus vault credentials: Webhooks fire on four status-lifecycle events, three thread-lifecycle events, and one outcome-evaluation event. The session.outcome_evaluation_ended event is the integration glue between Outcomes and the receiver’s downstream system. Vault credential events sit in their own bucket as operational alerts.

The payload shape is the GitHub and Stripe pattern: deliberately thin, containing only the event type and id. Full object details require a callback to the API. Three properties of the delivery semantics are worth committing to memory:

At-least-once delivery with retries. Receivers must be idempotent on id. Use the id as a dedupe key in your handler.
No ordering guarantees. A session.thread_idled followed by session.thread_terminated may arrive in either order. Receivers that need ordered state should reconstruct from the API on receipt, not from event order.
X-Webhook-Signature header with 5-minute replay protection. Verify on every request; reject events older than 5 minutes.

Figure 6 - A side-by-side diagram showing a 'before' panel with an agent session and a polling client that calls GET /sessions/id repeatedly, and an 'after' panel showing the same agent session pushing a webhook to a receiver with a callback arrow back to the API.

Figure 6 - From polling to push: Before webhooks, the only way to know a Managed Agents session reached a terminal state was to poll. For long-running outcome-oriented sessions that may iterate several times, polling was expensive and noisy. Webhooks turn the integration shape from pull-based to push-based. The receiver still calls the API for full state, but only when the platform tells it something happened.

Webhooks is the conference primitive that is most production-ready. It is in public beta with documented signature verification, retry semantics, and a stable event schema. If you are running outcome-oriented sessions that take 10+ minutes to iterate, replacing polling with webhooks is the cleanest production win available out of the conference, with the standard public-beta caveat that the API may evolve before GA.

What the conference confirmed#

Step back from the individual announcements and look at the pattern. Five primitives: rate-limit headroom, long-context architecture, retry-with-grader loops, memory consolidation, and agent event push. Every one of them is something production teams hand-rolled in 2025. Every one of them is now part of Anthropic’s product surface.

Lucas, the Anthropic Research PM, put the rule cleanly in his keynote:

Any code that you are writing that is compensating for model unreliability will have a half-life of just months. You should leave that work to us. We will continue to make Claude more reliable and more capable through this expanding toolkit that comes with the model.

The corollary is the part that compounds:

Code that connects your model to your world tends to compound. Your custom tools, your data, your auth, your specific context. The model can’t absorb what it can’t see.

In 2025 you owned	In 2026 the platform owns
Tool routers and retry decorators	Native tool selection plus error recovery
Per-turn summarization, RAG for memory	Server-side compaction, 1M context
VM lifecycle for code execution	Hosted code execution tool
Image-scaling math for computer use	Native 1440p screenshots, 1:1 coordinates
Bespoke memory abstractions	Memory in Managed Agents (file system)
Per-team adversarial evaluators	Outcomes (built-in grader)
Cross-session learning loops	Dreams (out-of-band curation)
Polling for session state	Managed Agents webhooks

Figure 8 - A two-column diagram with the left column labeled 'Compensating code' showing items like routers, summarizers, eval harnesses with downward arrows and a 'half-life: months' label, and the right column labeled 'Connecting code' showing custom tools, data, auth, and proprietary context with upward arrows and a 'compounds over time' label.

Figure 8 - Compensating code versus connecting code: Lucas’s organizing rule from the Expanding Toolkit keynote. Code that compensates for model unreliability has a half-life of months. Code that connects the model to your data, auth, tools, and proprietary context tends to compound. Mature harness work is mostly subtraction on the left side and investment on the right side.

The conference made the timeline visible. Every harness component compensating for model unreliability is on a depreciation schedule whether you do anything about it or not. If the model absorbs the scaffolding on a months-scale clock, your engineering effort needs to spend itself somewhere that compounds.

KEY INSIGHT: Audit every harness component against the question “what model limitation does this compensate for, and is that limitation still real?” Most components fail that audit on the next model upgrade. The ones that survive are the ones connecting the model to something only you have.

Practitioner caveats#

Three things look smaller than they are.

First, none of the five is GA. Webhooks, Memory, and Outcomes are all public beta. Dreams is research preview. Rate limit changes are live in production. Plan deployment accordingly. Beta APIs change. Research-preview APIs change a lot. If your production architecture depends on Dreams behavior staying exactly as documented today, you are taking on schedule risk that does not show up on a feature checklist.

Second, the doubled rate limits are tier-dependent and asymmetric. The “doubled” framing applies to the Claude Code consumer-plan five-hour bucket. On the API side, the increase is “considerable” but not uniform across tiers; the actual numbers are in the post-doubling tier table at platform.claude.com/docs/en/api/rate-limits. WebProNews and Slashdot’s “almost 17x” framing applies to specific Opus tiers, not to all of them. If you are sizing capacity for a specific workload, look up your actual tier; do not assume 2x or 17x as a uniform multiplier.

Third, the customer metrics from the conference are early-customer datapoints from one vendor’s marketing, not independent benchmarks. Mahes cited “Rocketon 90%” in his keynote; third-party Wisedocs coverage names different customers (Netflix, Rakuten, Wisedocs, Ando) with different metrics for Memory specifically. Treat them as directional. The +10pp Outcomes data point is the most rigorously sourced of the bunch (Anthropic’s internal benchmarks via the Cookbook), and even that is a vendor-internal number.

The honest production read: webhooks are deployable today; Outcomes is deployable for structured-deliverable workloads if you treat rubric authoring as a first-class deliverable; 1M context plus flat pricing means you can audit your harness for context-pressure scaffolding that no longer earns its keep; Dreams is worth piloting on a non-critical memory store while it is in research preview; Memory adoption is mostly a question of whether the workspace-scoped directory model fits your existing access patterns.

Where this lands for production teams#

The five primitives confirmed at Code with Claude 2026 are the production surface for the next 12 months. The harness work that compounds is the connecting layer (your data, your auth, your tools, your specific context), not the compensating layer the platform now owns.

If your team is rethinking a Claude Code workflow for a real codebase or building production AI on a legacy platform, that is the kind of work we do at Dotzlaw Consulting (multi-agent pipelines, knowledge-harvest loops, production rubrics that survive contact with reviewers who actually read them). Reach out via the contact page if a conversation about your harness would be useful.