From Agentic RAG to Compiled Knowledge: Why Karpathy's Wiki Idea Is Spreading

The “RAG is dead” headlines have been running for about twelve months and most of them are noise. So when Pinecone (the company that arguably defined the RAG era, with 800,000+ active developers and 9,000+ paying customers running their AI on the platform) published a blog post called Better Models Won’t Save Your Agent, we paid attention.

Figure 1 - Hero diagram showing the architectural shift from agentic RAG retrieval loops to a compiled knowledge layer that the agent queries once.

Figure 1 - The Compiled Knowledge Pivot: Agentic RAG wraps retrieval in an agent loop that searches, evaluates, and re-searches until it has enough context. The compiled-knowledge pattern moves that work upstream. The agent stops rediscovering knowledge on every request and starts reading from a pre-synthesized artifact that has already done the cross-referencing. This is the architectural shift four independent teams converged on in the spring of 2026.

Pinecone’s two May 4 blog posts (the framing post Pinecone Nexus: The Knowledge Engine for Agents and the engineering post Better Models Won’t Save Your Agent) say the quiet part out loud. Roughly 85% of an agent’s effort goes to knowledge retrieval, per the framing post [1]. Task completion rates sit at 50-60% in production [1]. Token costs run away. Agentic RAG, the current standard, has fundamental problems that better models alone cannot fix. And Pinecone is hardly the only team that has noticed. Andrej Karpathy posted a gist five weeks earlier describing the same architectural argument as a personal LLM wiki. Microsoft Fabric IQ and Google Cloud Knowledge Catalog land in the same neighborhood from different starting points. Four implementations of the same idea, shipping at the same time, is not a coincidence. It is a pattern.

This article is a tour through that pattern: what agentic RAG actually does, the three failure modes Pinecone is reacting to, the wiki gist Karpathy wrote, the Knowledge Engine Pinecone shipped, and the question every team running AI agents in production should be asking by the end of 2026.

Where agentic RAG sits today#

Naive RAG retrieves chunks based on the incoming question and stuffs them into the prompt. It works for trivial lookups and not much else. The current standard is agentic RAG, sometimes called RAG 2.0, which wraps retrieval in an agent loop. The agent decides when to query, evaluates what came back, decides whether to query again with a different strategy, and repeats until it is satisfied. Layered on top: query decomposition, query expansion, hybrid retrieval mixing vector search with BM25, re-ranking. Real improvements over naive RAG. Real complexity.

Figure 2 - Diagram showing the agentic RAG loop: question goes to the agent, which queries the retriever, evaluates the chunks returned, and either stops or loops back to query again, often producing seven or eight passes per question.

Figure 2 - The Agentic RAG Retrieval Loop: Each pass through the loop costs tokens and latency. The agent rarely stops after one query because it does not know what it does not know. Pinecone’s KRAFTBench measurements averaged 7.77 passes per question on a 150-question SEC 10-K test set [2]. The retrieval loop is doing the bulk of the work, not the reasoning model on top of it.

Agentic RAG is not the wrong tool. For exploratory agents on long-tail question spaces it is still the right answer. The argument Pinecone is making is more specific: when your question patterns are predictable and your reliability targets are high, the loop itself is the problem.

The three failure modes#

Pinecone’s framing post outlines three.

Non-deterministic. Ask the same question ten times and the agent may use ten different retrieval strategies. With ten or thirty tools available, the agent does not consistently pick the right ones. The retrieval decisions are generated rather than determined. More agent complexity creates more decision points which creates more variance. For compliance, financial, or legal workflows, “it works most of the time” is not acceptable. Businesses cannot audit inconsistent outputs.

Reasoning cannot compensate for poor retrieval. If the underlying chunks are unreliable, no amount of intelligence on top fixes that. A smarter model with better reasoning will use the unreliable retrieval with higher confidence, not avoid it. Pinecone calls this the “ten blue links” era of agentic retrieval. Vector search returns ten independent chunks. You can re-rank them, expand the query, run hybrid retrieval. None of it helps if the relevant information is not in the candidate chunks. Web search transitioned from ranked links to direct answers years ago. Agentic retrieval has not.

Token blowout and latency. Every loop iteration costs tokens and time. Every sub-agent spawn, every reflection, every tool call. Pinecone built a benchmark called KRAFTBench (Knowledge Retrieval Assessment Framework for Text) on 493 S&P 500 SEC 10-K filings totaling around 245MB, against 150 hard questions covering nine sectors and ten financial topics [1]. The constraints were 120 seconds and 1M tokens per question. The numbers are striking.

Figure 3 - Bar chart comparing average tokens per question across three approaches: Pinecone Nexus at 6,733, agentic RAG at 49,103, and a coding agent at 528,301, on the same 150-question SEC 10-K benchmark.

Figure 3 - KRAFTBench Token Comparison: Same 150 hard questions, same SEC 10-K corpus, three retrieval architectures. Agentic RAG averaged 49,103 tokens and 7.77 steps per question with 98.7% completion [1]. The same task wrapped in a coding-agent harness averaged 528,301 tokens and 14.77 steps with 62.7% completion [1]. Pinecone Nexus, querying its compiled knowledge layer, averaged 6,733 tokens and 1.69 steps with 100% completion [2].

The wow stat in that figure is not the Nexus column. It is the gap between agentic RAG and the coding agent: 49,103 versus 528,301 tokens for the same SEC 10-K question. The coding agent is doing more work (file loading, sub-agent context management, tool routing) and burning roughly an order of magnitude more tokens to do it. Both are reasonable architectural choices. Both are also paying a substantial tax for treating retrieval as an open-ended exploration problem on every single query.

KEY INSIGHT: 85% of an agent’s effort goes to knowledge retrieval, not reasoning. If your agents feel slow and expensive, the retrieval loop is almost certainly where the budget is going. Compiled artifacts are how you reclaim it.

The convergent answer#

Five weeks before Pinecone’s blog posts, on April 4, Karpathy posted a gist titled LLM Wiki on his personal GitHub. It went viral in a week. The argument is short and unsentimental.

The wiki is a persistent, compounding artifact. The knowledge is compiled once and then kept current, not re-derived on every query.

Karpathy is making the same architectural argument Pinecone is making, with different vocabulary. Stop forcing the LLM to rediscover knowledge from scratch on every question. Compute the synthesis once at ingestion time. Store the result in a queryable form. When the agent needs knowledge, it retrieves the pre-synthesized artifact instead of doing fresh retrieval-and-synthesis.

This is the move from query-time reasoning to build-time reasoning. The expensive work happens once per source update, not once per question.

Karpathy’s three layers and three workflows#

Karpathy’s gist describes the wiki as a three-layer structure with three operational workflows. It is small enough that one motivated developer can build it on top of any agent harness, and Karpathy explicitly says he runs his own research wiki this way at substantial scale.

Figure 4 - Three-layer architecture diagram of Karpathy's LLM Wiki: raw immutable sources at the bottom, the LLM-owned wiki in markdown at the middle, and a CLAUDE.md schema at the top defining structural conventions.

Figure 4 - The Three-Layer Wiki: Raw sources are immutable. The LLM reads them but never modifies them. The wiki layer is LLM-owned: summary pages, entity pages, concept pages, and syntheses, all in markdown. Each new source touches ten to fifteen wiki pages on ingest. The schema layer is a CLAUDE.md (or equivalent) configuration that defines naming conventions, page structures, and the operational workflows the LLM must follow when maintaining the wiki. The top layer is what makes the whole thing run unattended.

The three workflows give the wiki its lifecycle. Ingest: a new source arrives, the LLM reads it, writes a summary page, updates the index, revises relevant entity and concept pages, flags any contradictions it noticed, and logs the action. Query: the LLM searches relevant wiki pages, synthesizes an answer with citations, and may file new pages as discoveries. Every query is a chance to compound the artifact. Lint: a periodic health check that finds contradictions, stale claims superseded by newer sources, orphan pages, and knowledge gaps.

Figure 5 - Cycle diagram showing the three operational workflows of the wiki - Ingest, Query, and Lint - feeding back into the wiki itself so explorations and corrections compound over time.

Figure 5 - Ingest, Query, Lint: Ingest feeds new sources in. Query reads from the wiki and may file discoveries back as new pages. Lint runs periodically to find contradictions and stale entries. The cycle is what turns a flat directory of markdown into something that gets better the more it is used.

There are two pieces of infrastructure that hold the wiki together: an index.md content catalog and a log.md append-only audit trail. Karpathy frames these as the bookkeeper layer. They are what make the wiki navigable and auditable when a single LLM session has updated dozens of pages and forgotten which.

KEY INSIGHT: The cross-references are already there. The contradictions have already been flagged. The synthesis already reflects everything you’ve read. That is the entire pitch for compiled knowledge in three sentences.

The honest weakness: summaries are lossy. LLM-as-judge synthesis can compound errors over time. You are treating compile-time synthesis as ground truth, and that synthesis can drift from the actual sources as documents update. Karpathy acknowledges this directly. The wiki is only as good as the synthesis quality. It is not a substitute for the raw sources, only a faster path to most questions you will actually ask.

Pinecone Nexus: the production-scale version#

Pinecone Nexus is the same idea built for enterprise. The Pinecone framing post is blunt: “Web search already transitioned from ranked links to direct answers. Knowledge infrastructure needs the same leap” [2]. Nexus is what they think the leap looks like.

Figure 6 - Architecture diagram of Pinecone Nexus showing the four primitives - Artifact, Context, Knowledge, Knowledge Engine - with the Context Compiler operating at build time and KnowQL serving structured queries at query time.

Figure 6 - Pinecone Nexus Components: The Context Compiler is an autonomous coding agent that runs at build time, not query time. It consumes data sources (CRMs, vector databases, document stores) and produces typed governed artifacts. KnowQL is the declarative query language agents use at query time. It is structured like SQL, not procedural. The agent sends one structured query with intent, filtering, scope, and response shape. It receives one structured typed response. No retrieval loop.

The architectural mapping to Karpathy’s wiki is almost line-for-line. The Context Compiler plays the role of the CLAUDE.md schema plus the bookkeeper, building artifacts that match the agent’s anticipated questions. The artifacts themselves play the role of the wiki markdown pages. Where Karpathy uses natural language and a log.md, Pinecone uses typed schemas and a query language. Same pattern, different ergonomics.

The constraint that defines Nexus is the eval set. For the Context Compiler to build useful artifacts, you have to define a representative set of tasks with known right answers up front. Without an eval set, the compiler has no target to optimize against. With one, it produces governed artifacts the agent can query in a single round trip.

KEY INSIGHT: Nexus is not a magic retrieval upgrade. It is a build-time investment that pays off when your question patterns are stable. Known patterns get cheap. The long tail gets impossible.

That tradeoff defines the entire compiled-knowledge family. Pinecone’s framing of the payoff is “up to 90% reduction in token usage” [2] on the question patterns covered by the eval set. Pinecone’s framing of the cost is, implicitly, that an artifact you did not anticipate cannot answer a question that nobody asked at compile time. You move the work from query time to build time. You also move the failure mode from “my agent is slow” to “my agent does not know.”

Microsoft and Google: the same idea, different bet#

Pinecone and Karpathy are on the synthesis side of the family. Both compile artifacts from sources. Both accept lossy LLM synthesis as the cost of getting to a single-round-trip query. Microsoft and Google are on the other side.

Microsoft Fabric IQ is a compiled ontology that sits above raw data inside Microsoft Fabric. The shape of the ontology is defined upfront and bound to specific data sources. The agent asks a question, the ontology pushes the query down to the underlying data sources, and the result comes back through the ontology. Less LLM-driven. More governed. No LLM-synthesis risk because Microsoft is not synthesizing.

Google Cloud Knowledge Catalog, announced at Google Cloud Next 26, is a continuously refreshed semantic layer over data sources. Agents access it via MCP, which lets them navigate the data without rediscovering structure on every query.

Figure 7 - Convergence diagram plotting four implementations on two axes: synthesis (Karpathy and Pinecone) versus view-on-source (Microsoft and Google), and personal scale (Karpathy) versus enterprise scale (Pinecone, Microsoft, Google).

Figure 7 - Four Implementations, Two Bets: Karpathy’s wiki and Pinecone Nexus build synthesized artifacts and accept lossy LLM synthesis. Microsoft Fabric IQ and Google Cloud Knowledge Catalog expose semantic views that push queries down to the source. The two camps have different fidelity profiles, different update costs, and different governance stories. They share the architectural claim that motivated all four products: compile knowledge upstream, query a structured artifact downstream, stop making the agent rediscover everything on every query.

The fidelity tradeoff is real. Microsoft and Google avoid LLM-synthesis risk because they push queries to the source. Pinecone and Karpathy accept LLM-synthesis risk in exchange for query-time speed and a query interface that does not require deep familiarity with the underlying schemas. Neither is obviously right. Both are reasonable bets, and the choice depends on whether your governance team will accept LLM-synthesized artifacts as evidence.

KEY INSIGHT: Four teams, two bets, one architectural claim. When Karpathy, Pinecone, Microsoft, and Google ship the same architectural shift in the same six-week window from different starting points, that is the signal worth paying attention to.

When compiled knowledge actually makes sense#

The interesting question is not “should we adopt this.” It is “where in our stack does this pay back the build-time cost.” Here is how we think about it.

Figure 8 - Decision matrix splitting workloads into two columns: known repeatable question patterns where compiled knowledge wins, versus exploratory long-tail patterns where agentic RAG remains the right tool.

Figure 8 - When To Compile And When Not To: Compiled knowledge wins on stable, repeatable, governance-heavy, high-frequency question patterns. Agentic RAG keeps winning on exploratory, long-tail, rapidly-changing, or low-volume work. Most production stacks will end up running both layers, with the compiled layer absorbing the high-volume known patterns and the agentic layer handling the remainder. The decision is not which architecture to bet on. It is which workload belongs in which layer.

Compiled knowledge is a good fit when:

You have a large volume of data sources that need a unified view.
Your task patterns are repeatable, you know the questions that will be asked thousands of times.
Your governance requirements are tight enough that field-level citations beat LLM-cited paragraphs.
The cost of repeated retrieval at scale is starting to dominate the cost of running the agent.

It’s a poor fit when:

Your agents are exploratory and the question patterns are not knowable in advance.
Your query space has a long tail. You cannot build artifacts for every possible question.
Your sources change rapidly and recompilation costs would exceed retrieval savings.
Your dataset is small enough that the overhead is not worth the engineering investment.

The pragmatic progression: start with agentic RAG because the setup cost is lower and the coverage is broader. Improve retrieval with re-ranking, query decomposition, and hybrid retrieval. If you still cannot reach your reliability target, build compiled knowledge artifacts for the highest-frequency question patterns and leave the long tail on agentic RAG. You almost never want to pick one and abandon the other. You want both layers, with the compiled layer absorbing the high-volume known patterns and agentic RAG handling the exploratory remainder.

The DIY route#

If your artifact structure is well-understood, you do not need Nexus or any branded knowledge engine. The minimum viable compiled-knowledge layer is a Postgres table with a JSONB column, a batch job that populates it from your sources, and delta updates as documents change. The query language is SQL with appropriate indexes. There is no LLM in the query path. The synthesis risk is whatever your batch job’s logic introduces.

We will walk through exactly that pattern in Build Your Own Compiled Knowledge Engine in Postgres, shipping in a few weeks. For now, the takeaway is that the architectural pattern is genuinely simple. The branded products are wrapping a small idea in a lot of operational polish. You can build the pattern without buying the polish, as long as you accept that you will own the bookkeeper layer yourself.

A footnote: this just became a production primitive#

While we were writing this article, Anthropic shipped Memory in Managed Agents to public beta on April 23 and Dreams as a research preview on April 21. Memory is a filesystem-mounted persistence layer the agent reads and writes with bash and code-execution tools. Dreams is a scheduled background pass that consolidates the memory store between sessions, the way a search index gets rebuilt overnight. The Karpathy wiki idea is no longer just an architectural argument from a respected practitioner or a competitive bet from one vector database company. It is a production primitive on a frontier model lab’s platform, in the same six-week window the rest of the convergence happened in.

Figure 9 - Timeline diagram showing the six-week convergence window from April 4 to May 14, 2026.

Figure 9 - The Six-Week Window: April 4, Karpathy posts the LLM Wiki gist. April 21, Anthropic ships Dreams as a research preview in the Managed Agents API. April 23, Anthropic ships Memory in Managed Agents to public beta. May 4, Pinecone opens Nexus early access alongside the Better Models Won’t Save Your Agent benchmark. May 14, this article ships. Microsoft Fabric IQ and Google Cloud Knowledge Catalog land in the same neighborhood from Ignite and Cloud Next. When this many independent teams ship the same architectural shift inside a six-week window, the question is not whether the pattern is real. It is when it shows up in your stack.

We will cover Memory and Dreams in depth in Anthropic’s Memory and Dreaming Primitives on July 2. What matters here is that the trend has crossed from theory into a vendor-shipped product on the platform a substantial number of production agents already run on. If your team is making architectural bets in 2026, the compiled-knowledge layer is no longer a thing you experiment with on your own. It is a thing the major labs are productizing in real time.

The Series#

This is Part 1 of the three-part Compiled Knowledge sub-series:

From Agentic RAG to Compiled Knowledge: Why Karpathy’s Wiki Idea Is Spreading (this article) — the architectural case for moving from query-time retrieval loops to compile-time synthesis, and the four-team convergence in spring 2026
Build Your Own Compiled Knowledge Engine in Postgres — the practical walkthrough: one artifacts table, three indexes, JSONB plus pgvector plus tsvector, no vendor product required
Memory and Dreaming: How Anthropic Just Shipped the Karpathy Wiki Pattern — the production primitive in Anthropic’s Managed Agents API and what it means for teams already running on the platform

References#

[1] Pinecone, “Pinecone Nexus: The Knowledge Engine for Agents,” Pinecone Blog, May 2026. https://www.pinecone.io/blog/knowledge-infrastructure-for-agents/

[2] Pinecone, “Better Models Won’t Save Your Agent,” Pinecone Blog, May 2026. https://www.pinecone.io/blog/introducing-nexus-knowledge-engine/

[3] A. Karpathy, “LLM Wiki,” GitHub Gist, Apr 2026. https://gist.github.com/karpathy/442a6bf555914893e9891c11519de94f

[4] Anthropic, “Memory in Managed Agents (Public Beta),” Anthropic News, Apr 2026. https://www.anthropic.com/news/managed-agents-memory

[5] Anthropic, “Dreams (Research Preview),” Anthropic News, Apr 2026. https://www.anthropic.com/news/dreams

[6] Microsoft, “What is Fabric IQ?”, Microsoft Fabric Documentation, 2026. https://learn.microsoft.com/en-us/fabric/iq/overview

[7] Google Cloud, “Knowledge Catalog (formerly Dataplex),” Google Cloud Documentation, 2026. https://cloud.google.com/products/knowledge-catalog