Neo4j Code Graph: How Graph-Based Code Intelligence Changes What AI Agents Can Do

Our first attempt at tracing the project scheduling module did not involve a code graph at all.

We asked the researcher agent to explore the module by reading source files and using text search. The agent read the entry point function, found some callers by searching for the function name as a string, and produced a summary that looked plausible. Then we checked it against the code. Half the callers it listed were not callers at all. They were comments, variable names, and strings that happened to contain the function name. Two real callers were missed entirely. The call tree it reconstructed had six functions in it. The actual call tree had twenty-one.

Text search treats code like a document. A code graph treats it like what it actually is: a network of function calls. That distinction is what makes the difference between an agent that produces plausible-looking wrong answers and one that can actually trace how your system works.

Knowledge Trapped in Code#

The application is a Servoy enterprise system backed by PostgreSQL. It has 22 modules, 10,000+ functions across 1,000+ files, and over a decade of accumulated business logic. The rules governing how project schedules are calculated, how costs are allocated across project tasks, and how data isolation is enforced all live in the code, but they are not documented anywhere.

This creates three compounding problems that get worse over time.

For developers, working in an unfamiliar module means reading thousands of lines of code just to understand the call graph before making any change. For AI agents, there is no way to reason about business rules they have never seen. Tell an agent to “add a milestone status check to the billing flow” and it will generate syntactically correct code with no knowledge of the validation gates that must fire first, the tables that require data isolation, or the approval workflow that must complete before invoicing.

For the organization, the sharpest form of this problem is turnover. When a domain specialist leaves, their knowledge leaves with them. The code remains, but the intent behind it (the edge cases, the historical gotchas, the “never do it this way” rules) is gone. New developers discover these rules through bugs.

The traditional response is to ask specialists to write documentation. This rarely happens at the level of detail that matters. Specialists are busy, documentation is tedious, and the result is usually high-level descriptions that contain none of the specific knowledge an agent or a new developer actually needs.

The approach we built inverts the process. An AI agent does the systematic tracing, and the result becomes a reusable skill that any future agent can load on demand.

Hero visualization of the Neo4j code graph showing 10,000+ function nodes and CALLS relationships forming a network of interconnected clusters across 22 modules

Figure 1 - The Code Graph: 10,000+ function nodes connected by CALLS relationships, indexed into Neo4j. Clusters correspond to modules. The density of connections between clusters reveals integration boundaries. An agent can query any function’s full neighborhood in milliseconds.

Text Search vs. a Code Graph#

The core problem with text search for code understanding is that it treats code as text. Text search tools can find where a function name appears. They cannot tell you whether that appearance is a call, a comment, a string, or a coincidence.

A code graph indexes the actual call relationships as first-class data. The difference in what questions you can answer is not incremental. It is categorical.

Question	Text search	Code graph
Where is `buildProjectSchedule` defined?	Finds the function definition (and every mention of the string)	Returns the single function node with exact file and line
What functions does `buildProjectSchedule` call?	Returns all occurrences of called function names, including comments, strings, and similar names	Returns the exact list of callees from parsed CALLS relationships
Who calls `buildProjectSchedule`?	Returns all occurrences of the string “buildProjectSchedule”	Returns exact callers with file and line, with zero false positives
What is the full call tree from the entry point?	Not feasible at depth	Walk the graph iteratively to any depth
What are the most complex functions in the module?	Not possible	Cyclomatic complexity analysis on function nodes
Are there functions with no callers?	Manual audit of every function	`find_dead_code` query in seconds

When we tried the text search approach for the project scheduling pilot, the researcher agent found callers of buildProjectSchedule through string matching. Some text matches were non-call references. Some real callers used indirect call patterns that text search did not catch. More importantly, text search could not reconstruct the call tree at all. The depth-4 structure that the graph can return in one query would have required the agent to read dozens of files and manually piece together the relationships.

Split comparison showing text search results with false positives and missed callers on the left, code graph results with exact caller list and call tree on the right

Figure 2 - Text Search vs. Code Graph: The same structural question answered two ways. Text search returns noisy string matches that include comments, strings, and near-misses. The code graph returns exact call relationships derived from parsed source. For tracing at scale, the difference determines whether the agent’s map of the codebase is trustworthy.

Building the Neo4j Graph#

We built custom tooling that bridges the JavaScript codebase and the graph database. It parses every JS source file, extracts function definitions, resolves call relationships, and writes everything into Neo4j. Each function becomes a node. Each function call becomes a CALLS relationship.

After indexing all 22 modules, the graph contains 10,000+ function nodes across 1,000+ source files. Every node carries the function name, file path, line number, parameter list, and the full source code of the function. The source storage is deliberate: an agent can read function bodies directly from graph query results without a separate file read, which matters when traversing dozens of functions in a single session.

The tooling exposes the graph through an MCP server that agents call directly during development and discovery sessions. The tools agents actually use are:

Graph Tool	What it returns
`find_code`	Function node with file and line (the locator)
`analyze_code_relationships` (calls)	Exact callees of a function
`analyze_code_relationships` (callers)	Exact callers of a function
`find_most_complex_functions`	Highest cyclomatic complexity in a module, for prioritizing where to start
`find_dead_code`	Functions with zero callers, which are potential entry points or candidates for removal
`execute_cypher_query`	Arbitrary graph queries for cases the built-in tools do not cover

The execute_cypher_query tool is the escape hatch. The built-in tools cover the majority of needs. When an agent needs something more specific (all functions in a particular file ordered by line number, cross-module callers filtered to two specific modules, a two-level call tree in a single result) raw graph queries provide it without any intermediate tooling.

Architecture diagram showing custom tooling parsing JS source files on the left, Neo4j storing function nodes and CALLS relationships in the center, and MCP tools exposed to agents on the right

Figure 3 - Graph Architecture: JS source files are parsed into function nodes and CALLS relationships, stored in Neo4j. The MCP server exposes five primary tools plus an arbitrary query interface to agents running in VS Code. Agents never read the database directly. They use the MCP tools, which abstract the graph schema.

How Agents Use the Graph#

The researcher agent queries Neo4j at the start of every development workflow. Before any code is read or changed, the researcher maps the scope: which functions are involved, who calls them, what they call, and where the complexity hotspots are. This structural picture tells the rest of the agent team what they are dealing with before they encounter it in the source.

The researcher’s query pattern is consistent. It starts with find_code to locate the entry point or the function under discussion, then calls analyze_code_relationships (callers) to understand the upstream impact surface, then (callees) to understand the downstream call graph. When the task involves a full module, find_most_complex_functions identifies where to focus first.

For deep domain exploration, the researcher walks the call tree iteratively: find the entry point, get its callees, trace each callee to the next level, classify each function encountered as core domain logic, utility, or external domain boundary. The graph makes this traversal fast. Each step is a single query rather than a file read.

The skill-builder also queries Neo4j directly. When creating or updating a domain skill, the skill-builder uses graph queries for deep context: the full call tree for the domain’s entry points, the module’s complexity profile, and the external domain boundaries. This ensures that skills reflect the actual structural topology of the code, not just what was described in a document.

Diagram showing the researcher agent and skill-builder both querying the Neo4j graph, with the researcher feeding the development workflow and the skill-builder feeding the skills library

Figure 4 - How Agents Query the Graph: The researcher agent maps scope and structure at the start of every workflow. The skill-builder queries for deep domain context when creating or updating skills. Both agents query Neo4j directly through the MCP tools. The graph is the shared source of structural truth.

Structural Queries That Change Everything#

The queries that agents run against the Neo4j graph are not sophisticated in themselves. What changes is that they are fast, exact, and composable in ways that text search is not.

Finding callers takes a millisecond. The researcher agent can ask “who calls this function” and get an exact list with file paths and line numbers. No false positives, no missed results, no need to open any files. This single capability eliminates the most time-consuming part of manual code review for any non-trivial change.

Call trees to configurable depth are returned in a single query. A researcher tracing a module change can get the full downstream impact tree from the entry point in one round trip. The tree includes every function at every depth, their files, and the CALLS relationships between them. What would have required reading dozens of files becomes a single structured result.

Complexity analysis identifies where to focus. The find_most_complex_functions query returns the highest cyclomatic complexity functions in a module, ranked. For a module with 200 functions, this tells the researcher agent where the business logic is concentrated before a single line of source is read.

Dead code detection finds functions with zero callers. In a codebase this size, dead code accumulates over a decade. The find_dead_code query surfaces it in seconds. Some dead functions are true candidates for removal. Others are entry points called by external systems not indexed in the graph. The query identifies them. The developer decides.

Module-level summaries are possible through arbitrary Cypher queries. How many functions are in this module? What is the average depth of the call tree from the main entry point? Which modules does this module call into? These questions have answers in the graph. They are guesses or manual audits without it.

Domain boundary identification is structural rather than documentary. When the researcher traces a call tree and encounters a function that belongs to a different module, the graph relationship reveals that. The boundary is in the data, not in any documentation that may or may not exist.

The Project Scheduling Example#

The project scheduling module was the first domain we ran through the full workflow. Project scheduling is a sophisticated Gantt-type layout with multiple levels of sub-tasks, a domain with significant algorithmic complexity and no documentation.

Entry point: buildProjectSchedule().

The researcher agent started by querying the code graph for complexity hotspots in the module, then located buildProjectSchedule and began tracing. The structural picture that emerged, across two sessions totaling under 30 minutes:

Metric	Result
Functions traced	21
Database tables documented	11
Domain boundaries identified	14
Business rules numbered	13
Gotchas captured	11
Sessions required	2
Total elapsed time	Under 30 minutes
Final knowledge document	29KB

The 14 domain boundaries are significant. Project scheduling touches calculations, resource allocation, cost tracking, and UI rendering, each of which is a separate domain. The researcher identified each boundary by its structural position in the call graph, classified it, and stopped at the boundary rather than recursing into it. Without the graph, distinguishing “this is scheduling logic” from “this is a calculation domain call” would require reading each function’s source to make the judgment. The graph’s structural view makes the boundaries visible without reading every file.

The skill-builder produced the project scheduling skill from the researcher’s findings. The graph shows the researcher what the code does. Business rules about why a validation exists, which branches handle real edge cases versus dead code, and what domain terminology actually means get captured into skills through the self-improving loop as the team works in the domain over time.

Project scheduling pilot results showing 21 functions traced, 11 tables, 14 domain boundaries, 2 sessions, 29KB document with timeline

Figure 5 - Project Scheduling Example Results: 21 functions traced across 2 sessions in under 30 minutes. The domain boundaries metric is the one that surprised us: 14 separate external domains touching the project scheduling module, each correctly identified structurally and stopped at the boundary rather than recursed into. A text-search-based approach would have had no mechanism to identify those boundaries without reading every downstream function.

Skills Built from Graph Intelligence#

The researcher’s graph-powered tracing feeds directly into skill creation. After a domain has been traced and reviewed, the skill-builder uses graph queries to verify the structural picture before assembling the skill file.

The project scheduling skill contains 13 business rules, 11 gotchas, a complete data model for 11 tables, workflow recipes for the four most common scheduling changes, and a data isolation checklist for the patterns specific to this domain. It is the densest knowledge document in the skill library because the domain itself is complex, and the graph made it possible to trace that complexity systematically rather than selectively.

After the project scheduling pilot proved the workflow, additional skills followed. The process is repeatable: the researcher starts at the domain’s entry point, traces through the call graph, captures findings, and the skill-builder assembles the skill. Total time for a mid-complexity domain: two to three sessions, under an hour. The skill library now holds 18 domain skills.

Every skill reflects the actual structural topology of the codebase, verified against the graph and validated by the skill-auditor.

Transformation diagram showing graph-powered research feeding into skill creation, with the skill-builder producing a structured SKILL.md from researcher findings

Figure 6 - From Graph Intelligence to Domain Skills: The researcher’s graph-powered tracing produces structured knowledge. The skill-builder assembles it into a SKILL.md structured for progressive disclosure. Quick start at the top for common tasks. Full context for complex changes. Reference files for the dense material. 18 skills in the library. Each one grounded in the graph.

What the Graph Makes Possible#

The development workflow runs efficiently because the code graph removes the hard parts of code exploration.

Finding callers takes a millisecond instead of reading every file. Identifying domain boundaries takes a structural query instead of reading every downstream function. Discovering complexity hotspots takes an index scan instead of a manual review. Agents spend their token budget on understanding what functions do (reading source, extracting business rules, identifying gotchas) rather than on the mechanical work of finding which functions to read.

The project scheduling example showed what that looks like in practice: 21 functions traced, 11 tables documented, 14 domain boundaries identified, 13 business rules captured, 11 gotchas recorded. Under 30 minutes. The skill-builder packaged the result into a form that will remain useful for every future agent working in that domain.

18 domains done. Knowledge that took years to accumulate, now queryable in seconds.

KEY INSIGHT: Text search treats code as a document. A code graph treats it as a network. The difference is not a speed improvement. It is a category change in what questions you can answer. Callers, call trees, and complexity analysis are only possible when the structure is indexed as structure.

KEY INSIGHT: The graph shows agents what the code does. Business rules about why a validation exists, which branches handle real edge cases versus dead code, and what domain terminology actually means get captured over time through the self-improving loop. The AI traces the structure. The reviewer captures the meaning. The skill-builder packages it. The skill-auditor verifies it.

Coming up in Part 4: The 18 domain skills are useful individually. The self-improving loop is what makes them compound over time. Every code review the reviewer agent runs is also a knowledge audit, scanning for business rules, gotchas, and patterns that are not yet in the existing skills. New findings queue for the skill-builder to apply. The skill-auditor verifies that updates do not introduce contradictions. Each review cycle makes every agent in the system slightly smarter, without additional developer effort.

The Series#

This is Part 3 of a 5-part series on building an AI development methodology with GitHub Copilot:

Beyond Code Completion. The enterprise AI gap and why agent mode changes everything
The Development Workflow. How seven agents turn a ticket into reviewed code
Neo4j Code Graph (this article). How a code graph database makes AI agents understand your codebase
The Knowledge Flywheel. How code reviews feed a self-improving knowledge loop
Enterprise AI Lessons. What building an AI methodology taught us about enterprise software