Ask Your Vault Anything: Building a RAG Chatbot for Your Obsidian Notes

“What techniques help with trading discipline?” Two and a half seconds. Five source notes. One click to Obsidian.

By the Dotzlaw Team

The Demo#

Figure 1 — The chatbot in action: a natural language question returns a grounded answer in 2.5 seconds, citing five source notes with relevance scores. Zero hallucinations — every fact traces back to an actual note.

“What techniques help with trading discipline?”

Two and a half seconds later, an answer appears — drawn entirely from our own notes, with clickable source attribution:

Assistant: Based on your notes, several techniques can help with trading discipline:

Pre-trade checklists - From “Trading Psychology Fundamentals”: Having a written checklist before each trade forces you to verify your setup meets all criteria.
Journal reviews - “My Trading Journal Process” describes reviewing each trade weekly to identify emotional patterns.
Position sizing rules - “Risk Management 101” emphasizes that fixed position sizing removes emotional decision-making.
Cool-down periods - After a losing trade, “Emotional Control in Trading” suggests waiting 30 minutes before the next entry.

Sources: Trading Psychology Fundamentals (78%), My Trading Journal Process (72%), Risk Management 101 (68%), Emotional Control in Trading (65%), Discipline Systems for Traders (52%)

Five source notes. Relevance scores from 78% down to 52%. Click any source to preview the full note. Click “Open in Obsidian” to jump straight to it in your vault.

This is what happens when 1,000+ indexed notes, semantic search, and an LLM work together. The system did not hallucinate a single technique — every answer is grounded in notes we actually wrote. And each source is one click away from verification.

What RAG Actually Is#

RAG stands for Retrieval-Augmented Generation. The name is clunky, but the idea is simple: before asking an LLM to answer a question, first retrieve the relevant content from your own data. The LLM generates its answer using that retrieved content as context — not its training data, not the open internet, just your notes.

The pipeline has four steps. Embed the question into a vector. Search Qdrant for note chunks with similar vectors. Build a context window from the top matches. Pass the question plus context to Claude Haiku for generation. The answer comes back grounded in your actual notes, with source attribution built in.

The difference between RAG and just asking ChatGPT is the difference between “what does the internet say about trading discipline” and “what do my notes say about trading discipline.” One gives you generic advice. The other gives you your own curated knowledge back, synthesized and searchable.

Side-by-side comparison of ChatGPT (The Internet) versus RAG (Your Vault) -- ChatGPT query asks what the internet says about trading, returning generic average advice as statistical probability of the web; RAG query asks what my notes say about trading, returning synthesized personal strategies grounded in your experience

Figure 2 — Generic advice vs. curated knowledge: ChatGPT returns statistical averages from the internet. RAG returns synthesized strategies from your own notes, grounded in your experience.

KEY INSIGHT: The LLM doesn’t hallucinate because it’s grounded in your actual notes. No making things up — only synthesizing what’s already in your vault.

See It In Action#

Here is a second interaction, this time asking a technical comparison question:

User: “How does LangGraph differ from CrewAI?”

Assistant: According to your notes, the key differences are:

LangGraph (from “LangGraph Architecture Deep Dive”):

Graph-based workflow definition
Explicit state management
Fine-grained control over agent transitions
Better for complex, conditional workflows

CrewAI (from “CrewAI Framework Overview”):

Role-based agent definition
Automatic task delegation
Simpler setup for common patterns
Better for team-style agent collaboration

Your note “Comparing Agent Frameworks” summarizes: “LangGraph when you need control, CrewAI when you need speed.”

Sources: LangGraph Architecture Deep Dive (82%), CrewAI Framework Overview (79%), Comparing Agent Frameworks (71%)

Three source notes, each from a different video we processed months ago. The chatbot pulled the relevant comparison points, synthesized them into a coherent answer, and cited where each fact came from. We did not have to remember which video covered which framework — the system found the right notes and surfaced the information.

Performance Characteristics#

Operation	Time	Notes
Embed question	~100ms	OpenAI API call
Search Qdrant	~10ms	5,000 vectors, local
Generate answer	~2s	Haiku is fast
Total chat latency	~2.5s	Feels instant
Index one note	~500ms	Embed + upsert
Index full vault	~9 minutes	1,051 notes

The bottleneck is LLM generation, not retrieval. Qdrant’s sub-millisecond search means the experience feels snappy — you type a question, and the answer is there before you finish reaching for your coffee.

How It Works#

RAG pipeline architecture diagram showing four stages -- Query (Input Question) flows via OpenAI API to Embed (text-embedding-3-small), then via Cosine Similarity to Search (Qdrant Vector DB), then via Context Window to Synthesize (Claude 3 Haiku)

Figure 3 — The RAG pipeline: a question is embedded into a vector, matched against 5,000 chunks in Qdrant via cosine similarity, and the top results are passed to Claude Haiku for grounded answer generation. Search happens in sub-milliseconds.

The RAG pipeline has three stages: chunk the notes, retrieve the relevant chunks, and generate an answer. Each stage has decisions that matter.

Chunking: Why 2,000 Characters#

Full notes are too long for effective retrieval. A 10,000-character note about trading psychology would dominate search results even if only one paragraph mentioned discipline techniques. The solution is to split notes into overlapping chunks before embedding them.

We chose 2,000 characters (~500 tokens) with 400-character overlap, and the splitting is paragraph-aware — chunks break at double newlines, not mid-sentence.

1
# Chunking parameters
2
chunk_size  = 2000   # chars (~500 tokens) - enough context, not too broad
3
overlap     = 400    # chars (~100 tokens) - catches info spanning boundaries
4
splitting   = "paragraph-aware"  # break at ¶ boundaries, not mid-sentence
5
first_chunk = "includes frontmatter"  # title + tags + description for topic signal

Figure 4 — The physics of chunking: too small (500 chars) fragments context, too large (5,000 chars) drowns signal in noise. The sweet spot at 2,000 characters with 400-char overlap produces coherent, specific chunks. 1,051 notes became ~5,000 searchable chunks.

Why these numbers? Chunks that are too small (500 characters) lose context — you get a sentence fragment about position sizing with no indication it came from a trading psychology discussion. Chunks that are too large (5,000 characters) retrieve irrelevant content alongside the relevant passage. At 2,000 characters, each chunk is coherent enough to stand alone and specific enough to match precisely.

The 400-character overlap ensures that if important information spans a chunk boundary, at least one chunk contains it completely. And the first chunk of each note includes frontmatter metadata — title, description, tags — so the embedding captures what the note is about, not just what the opening paragraph says.

For 1,051 notes, this produced approximately 5,000 chunks — an average of 4.8 chunks per note.

Retrieval: Threshold Tuning and Deduplication#

When a question comes in, we embed it using the same model (OpenAI text-embedding-3-small) and search Qdrant for the most similar chunks. Two decisions shape the quality of results: the similarity threshold and deduplication.

Figure 5 — The threshold paradox: chat retrieval uses a lower threshold (0.50) than note linking (0.70) because the LLM acts as the final relevance filter. Deduplication keeps only the highest-scoring chunk per note to ensure source diversity.

We use a 0.50 threshold for chat — significantly lower than the 0.70 we use for note-to-note linking. That might seem counterintuitive. Why accept less-similar results? Because the use cases are fundamentally different. Note linking needs high precision — a bad link clutters your vault permanently. Chat is ephemeral. The LLM can read five retrieved chunks and decide which ones actually answer the question. A chunk at 0.55 similarity might contain the exact sentence the user needs. Better to retrieve it and let the model judge relevance than to filter it out at the gate.

KEY INSIGHT: A lower retrieval threshold (0.50) works for chat because the LLM can filter noise — let it retrieve more and judge relevance itself.

Without deduplication, a single highly relevant note might return three or four chunks, drowning out other perspectives. We keep only the highest-scoring chunk per note, ensuring the user sees diverse sources. If four of the top five chunks all come from “Trading Psychology Fundamentals,” the user learns nothing they would not have learned from reading that one note. Deduplication forces breadth.

Generation: Grounded Answers#

The generation step is the simplest part. We pass the question plus the retrieved context to Claude Haiku with a system prompt that sets three rules: use the context, be honest when the context does not contain an answer, and cite which source note each fact came from. Context is capped at 8,000 characters (~2,000 tokens) to leave room for the response. Five chunks at ~2,000 characters each fits comfortably within this limit.

New notes are indexed automatically when saved — whether they come from YouTube processing or from Obsidian vault curation. No manual re-indexing required. Ask a question about a video you processed five minutes ago, and it is already in the index.

The “Open in Obsidian” Loop#

Figure 6 — Closing the loop: the “Open in Obsidian” button transforms the chatbot from a dead end into a discovery tool. Every cited source is one click away from the full note in your vault.

Source attribution is what separates a useful RAG chatbot from a novelty. Every answer includes a collapsible list of sources with similarity scores. Click any source to preview the full note in the right panel. But the killer feature is a small button next to each source: Open in Obsidian.

1
function getObsidianUri(filePath: string, vaultName: string): string {
2
  const pathParts = filePath.replace(/\\/g, '/').split('/');
3
  const fileName = pathParts[pathParts.length - 1].replace(/\.md$/, '');
4
  const subFolder = pathParts[pathParts.length - 2] || '';
5
  const fileRef = subFolder ? `${subFolder}/${fileName}` : fileName;
6

7
  return `obsidian://open?vault=${encodeURIComponent(vaultName)}&file=${encodeURIComponent(fileRef)}`;
8
}
9
// → obsidian://open?vault=GaryD_1&file=5.%20Notes/Trading%20Psychology%20Fundamentals

Click it, and Obsidian opens directly to that note. You can read the full context, follow the Related Notes links to connected topics, check the tags, and continue your research in the graph view. The chatbot becomes a discovery tool — a fast way to find the right note — not a replacement for your actual vault.

This closes the loop between the chatbot and Obsidian. Without it, the chatbot is a dead end: you get an answer, and then what? With it, every answer is a doorway back into your knowledge graph. The answer says “Trading Psychology Fundamentals mentions pre-trade checklists,” you click through, and now you are reading the original note with all its context, links, and related notes visible.

KEY INSIGHT: Completing the loop matters. “Open in Obsidian” makes the chatbot a discovery tool, not a replacement for your notes.

The Full Journey: Every Layer Compounds#

Four-layer compounding architecture diagram -- Layer 1: Automated Extraction (ETL) at the base, Layer 2: Semantic Indexing (Vectors), Layer 3: Curated Metadata & Linking, Layer 4: Conversation (The Chatbot) at the top, with a Dependencies arrow flowing downward on the left and a Value arrow flowing upward on the right

Figure 7 — The compounding architecture: each layer builds on the one below it. Remove any layer and everything above it breaks. Dependencies flow down; value flows up.

This chatbot did not appear from nowhere. It is the capstone of a system where every layer builds on the one below it, and removing any single layer would break the one above.

The throughline of this series: every layer of automation compounds. Transcript extraction enabled structured notes. Batch processing made scale affordable. Semantic indexing enabled auto-linking. Clean metadata enabled accurate retrieval. And retrieval enabled conversation. No single piece is revolutionary — but stacked together, they transform a folder of files into a living knowledge base you can talk to.

From YouTube to Knowledge Graph built the foundation. Without transcript extraction and AI-powered note generation, there would be no notes to index.

Anthropic Batch API in Production made scale affordable. Processing 1,000 notes at full API price would have cost $3.00; the Batch API cut that to $1.50.

Building a Semantic Note Network created the vector infrastructure. The Qdrant collections, the embedding pipeline, the similarity search — all of it was built for note-to-note linking. But the same infrastructure powers the chatbot.

Obsidian Vault Curation at Scale cleaned up the metadata. Three years of tag chaos — #ai, #AI, #artificialIntelligence — collapsed into a clean hierarchy of 1,040 curated tags. Clean tags and descriptions flow into the embedding text, improving retrieval quality for every question.

And now this article: the chatbot that ties it all together. You ask a question, and 2.5 seconds later you have an answer drawn from notes that were extracted, batch-processed, semantically indexed, and curated by every previous layer of the system.

That is the compound effect. Each layer is useful on its own, but the combination is more than the sum of its parts.

The compound effect chain -- Extraction enables Indexing, Indexing enables Linking, Linking enables Curation, Curation enables Conversation, with the closing statement: We didn't just build a chatbot. We built a system that turns information into understanding.

Figure 8 — The compound effect: each capability enables the next. Extraction enables indexing, indexing enables linking, linking enables curation, and curation enables conversation with your own knowledge.

The Series#

This is Part 5 of a 5-part series on building an AI-powered knowledge management system:

From YouTube to Knowledge Graph — Turning 1,000+ videos into an interconnected knowledge base for $1.50
Anthropic Batch API in Production — 50% cost savings at scale, and the bug that almost corrupted everything
Building a Semantic Note Network — Vector search turned 1,024 isolated notes into a dense knowledge graph
Obsidian Vault Curation at Scale — Three years of tag chaos, fixed in 30 minutes for $1.50
Ask Your Vault Anything (this article) — A RAG chatbot that answers from your notes in 2.5 seconds

The throughline: Every layer of automation compounds — extraction enables indexing, indexing enables linking, linking enables curation, and curation enables conversation with your own knowledge.