1,024 notes. Zero manual links. 2,757 bidirectional connections discovered automatically.
By the Dotzlaw Team

Figure 1 — Initial disconnected state (left) vs. semantic knowledge graph (right). 1,024 notes, zero manual links, 2,757 auto-connections.
The Transformation
1,024 isolated notes. 2,757 bidirectional links. A knowledge graph where every note connects to 3-5 semantically similar notes — all discovered automatically.
Before we built this system, the vault looked like a galaxy of orphans. Open Obsidian’s graph view and you’d see clusters of notes huddled together by folder, with vast empty space between them. A note about RAG architectures sat in one corner. A note about vector databases floated in another. A third about LangChain drifted somewhere else entirely. They were obviously related — but nothing connected them.

Figure 2 — The disconnected state: related concept clusters (RAG Architectures, Vector Databases) sit in isolated corners with no awareness of each other.
After running the semantic linking pipeline, the graph view transformed. Those same notes now form a dense web of blue lines, every dot radiating connections outward. Click into any note and you can see exactly what else in your vault relates to it. The note about RAG architectures links to vector databases, which links to embedding models, which links to LangChain retrieval patterns. Knowledge that previously required searching or remembering now surfaces through navigation.
The key insight behind this transformation: when we link Note A to Note B, we also link Note B back to Note A. Bidirectional linking is what turns a flat collection of files into a navigable knowledge graph.
How Semantic Similarity Works
Two notes are related if they’re about similar things. “Similar things” is fuzzy for humans but precise for math: convert notes to vectors (embeddings) and measure the angle between them.
Note A: "RAG architectures combine retrieval and generation..." --> embed [0.23, -0.45, 0.78, 0.12, ...] (1536 dimensions)
Note B: "Vector databases enable semantic search..." --> embed [0.21, -0.42, 0.75, 0.15, ...] (1536 dimensions)
Cosine similarity: 0.87 <-- These notes are related!Each note becomes a point in 1,536-dimensional space. Notes about similar topics cluster near each other. A note about Docker Compose will land far from a note about trading psychology, but close to a note about container orchestration. Cosine similarity measures how closely two vectors point in the same direction — 1.0 means identical, 0.0 means completely unrelated.
We use OpenAI’s text-embedding-3-small model to generate these vectors. It’s fast, cheap, and produces embeddings that capture semantic meaning well enough to find non-obvious connections across a thousand-note vault.

Figure 3 — Vector representation and cosine similarity. Related topics (Docker Compose, Container Orchestration, PostgreSQL) cluster tightly in vector space while unrelated topics (Trading Psychology) are distant.
KEY INSIGHT: What you embed matters more than how you embed it. Title + description + tags + a content preview produces dramatically better similarity matches than embedding the raw note body, because frontmatter captures the intent of a note while the body contains noise.
Building the Index
Every note in the vault gets indexed, but we don’t embed the entire note. Raw body text is noisy — full of filler words, code blocks, and formatting artifacts that dilute the semantic signal. Instead, we extract the parts that carry the most meaning.
def build_embedding_text(frontmatter: dict, body: str) -> str: """Build text for embedding from note components."""
parts = []
# Title carries heavy semantic weight if title := frontmatter.get("title"): parts.append(f"Title: {title}")
# Description is a concise summary if desc := frontmatter.get("description"): parts.append(f"Description: {desc}")
# Tags indicate topic areas if tags := frontmatter.get("tags"): tag_str = ", ".join(tags) parts.append(f"Topics: {tag_str}")
# First 500 chars of body for context if body: preview = body[:500].strip() parts.append(f"Content: {preview}")
return " | ".join(parts)This produces embedding text like: Title: Building RAG Systems | Description: A guide to retrieval-augmented generation | Topics: ai/rag, ai/llm, coding/python | Content: RAG combines the power of large language models with external knowledge retrieval...
The title carries the heaviest semantic weight. Tags act as topic classifiers. The description provides a concise summary. And the first 500 characters of body text add just enough context without drowning the signal in noise. This composition consistently outperformed full-body embeddings in our similarity testing.

Figure 4 — Signal vs. noise: optimizing embedding input by filtering raw body text down to high-semantic-weight components. Intent outperforms content.
The generated embedding gets stored in Qdrant, a self-hosted vector database running on our Proxmox server. We chose it because our notes contain personal and business information — keeping the vector database on-premise means the data never leaves our network. Alongside the vector, we store metadata (file path, title, tags, content hash) as a Qdrant payload, so search results come back with everything we need to create links.

Figure 5 — Architectural flow of the privacy-focused embedding stack. Personal data stays on-premise; only transient text for embedding touches the API.
We also track content hashes in PostgreSQL to support incremental indexing. When re-indexing the vault, only notes whose content has actually changed get re-embedded. This makes a full vault re-index fast — unchanged notes are skipped entirely.
Finding Similar Notes
When we need to find notes related to a given note, we generate an embedding for it and query Qdrant for the nearest neighbors. The query excludes the source note itself and filters results by a similarity threshold.
The 0.70 Threshold
The threshold determines the quality-quantity tradeoff. We tuned it empirically across our vault:
| Threshold | Results |
|---|---|
| 0.90+ | Almost no matches (too strict) |
| 0.80-0.90 | Very high quality, few matches |
| 0.70-0.80 | Good balance of quality and quantity |
| 0.60-0.70 | More matches, some noise |
| 0.60- | Too many irrelevant matches |

Figure 6 — Impact of similarity threshold on connection quality. The 0.70-0.80 sweet spot discovers cross-domain connections a human might miss.
At 0.70, the matches are genuinely related. You won’t get a note about Docker linked to a note about philosophy. But you will get a note about “Testing AI Agents” linked to “Integration Testing Best Practices” — a cross-domain connection a human might miss but that makes perfect sense once you see it.
We cap results at 10 similar notes per query, though most notes return 3-5 matches above the threshold. This natural distribution means the system self-regulates: highly specific notes get fewer links, broad topics get more.
Bidirectional Linking
This is the key differentiator. Most similarity systems stop at “here are notes related to X.” We go further: when we link Note A to Note B, we also write a link from Note B back to Note A.
KEY INSIGHT: Bidirectional links are what turn a collection of notes into a knowledge graph. If A relates to B, B relates to A — always link both directions. Without this, you get a tree. With it, you get a web.

Figure 7 — Unidirectional vs. bidirectional linking. One-way links create a tree; two-way links create the dense web that makes a knowledge graph navigable.
The logic is straightforward:
for each similar_note found: # Forward link: source --> target add "[[target]]" to source's Related Notes section
# Backward link: target --> source add "[[source]]" to target's Related Notes sectionThe linking function finds or creates a “Related Notes” section at the bottom of each markdown file and appends Obsidian wiki links ([[Note Title]]). Before adding a link, it checks whether the link already exists to avoid duplicates. The result in each note looks like:
## Related Notes
- [[Building RAG Systems]]- [[Vector Database Comparison]]- [[LangChain Deep Dive]]These are standard Obsidian wiki links. Click one and you navigate directly to the related note. Obsidian’s graph view picks them up automatically, which is what produces the dense visual web of connections.
Self-Reference Detection
A subtle edge case nearly undermined the whole system: notes linking to themselves. A note about “Building RAG Systems” might exist as Building-RAG-Systems.md or Building_RAG_Systems.md or Building RAG Systems.md, depending on how the filename was sanitized during creation. Without careful matching, the note would appear as its own top similarity match — because nothing is more similar to a note than itself.
We handle this at two levels. First, we exclude the source note’s file path from the Qdrant query, which catches exact path matches. But that’s not enough. The same note might be referenced by a slightly different path or a truncated filename. So we also compare sanitized versions of the title and link target — stripping special characters, normalizing spaces, and comparing the first 50 characters to catch truncation. This layered approach catches self-references even when the title, filename, and stored path don’t match exactly.

Figure 8 — The self-reference loop bug and its multi-layered fix. Filename variations can trick vector search into linking a note to itself.
The Results
We ran the batch linking script on our entire vault: 1,024 files processed, 2,757 links added, average 2.7 links per note. Processing time: about 15 minutes.
| Metric | Value |
|---|---|
| Files processed | 1,024 |
| Links added | 2,757 |
| Average links per note | 2.7 |
| Processing time | ~15 minutes |

Figure 9 — The result: a connected brain. Every dot radiates connections. The blue web represents 2,757 relationships discovered entirely by algorithm.
What those numbers mean in practice: every note in the vault now connects to at least one other note, and most connect to three to five. The connections surface relationships that aren’t obvious from titles or folder structure alone.
Some examples from the vault:
| Note | Auto-Linked To |
|---|---|
| ”RAG Architecture Deep Dive" | "Vector Database Comparison”, “LangChain Retrieval”, “Embedding Models" |
| "Trading Psychology" | "Risk Management”, “Journal Review Process”, “Emotional Discipline" |
| "Docker Compose Patterns" | "Container Orchestration”, “Development Environment”, “PostgreSQL Setup” |

Figure 10 — The UI transformation in Obsidian: before (manual metadata only) vs. after (automated WikiLinks enabling immediate knowledge surfing).
The system finds connections humans would miss. It links across topic boundaries and folder structures. And because new notes get linked automatically on save — the same embed-search-link pipeline runs every time a note is created — the knowledge graph grows denser over time without any manual effort.

Figure 11 — The automated curation cycle that scales the knowledge graph. Broad topics accumulate many links; niche topics get few — the system self-regulates.
KEY INSIGHT: The 0.70 similarity threshold is where signal separates from noise. Below it, you get spurious connections that erode trust. Above 0.80, you miss the cross-domain links that make a knowledge graph valuable. The sweet spot produces connections you didn’t expect but immediately recognize as correct.

Figure 12 — Navigation over search: the semantic note network surfaces knowledge through interconnection rather than recall.
The Series
This is Part 3 of a 5-part series on building an AI-powered knowledge management system:
- From YouTube to Knowledge Graph — Turning 1,000+ videos into an interconnected knowledge base for $1.50
- Anthropic Batch API in Production — 50% cost savings at scale, and the bug that almost corrupted everything
- Building a Semantic Note Network (this article) — Vector search turned 1,024 isolated notes into a dense knowledge graph
- Obsidian Vault Curation at Scale — Three years of tag chaos, fixed in 30 minutes for $1.50
- Ask Your Vault Anything — A RAG chatbot that answers from your notes in 2.5 seconds
Next: Obsidian Vault Curation at Scale — What happens when you hand an AI 1,280 chaotic tags, including a hex color that somehow became a category