RAG Document Assistant: From Single-Purpose Chatbot to Multi-Repository Document Platform#

51,000+ document chunks, 4 file formats, hybrid search with cross-encoder re-ranking, and zero frontend changes to get there.

Figure 1 - Architecture overview of the RAG document assistant showing ingestion pipeline, hybrid search, and multi-repo query flow

Figure 1 - RAG Document Assistant Architecture: The complete system processes documents through a parse, chunk, embed, upsert pipeline into Qdrant, then answers queries via hybrid vector and keyword search with cross-encoder re-ranking. A single Qdrant collection serves all repositories, with payload fields handling filtering and access control.

The system answers a natural language question across 51,000+ document chunks in under 3 seconds. A user types “What are the password requirements in FIPS 200?” and gets a sourced answer with clickable links to the original NIST PDF, rendered natively in the browser. The retrieval pipeline, from embedding the query through hybrid search to cross-encoder re-ranking, finishes in under 650ms. Qdrant search itself takes 50ms. The remaining ~2.3 seconds is Claude generating the answer.

Six weeks earlier, this was a single-purpose chatbot that could only answer questions about one product’s 500+ help articles. The transformation into a general-purpose document platform that handles Markdown, HTML, PDF, and DOCX required zero changes to the React frontend. That fact shaped every architectural decision that followed.

The Problem#

Every organization accumulates documentation across formats and systems. HR policies live in Markdown files. Legal contracts arrive as DOCX. Compliance frameworks ship as PDF. Technical documentation is HTML. When someone needs an answer that spans these sources, they open four different applications, search each one manually, and synthesize the results themselves.

The original system solved a narrow version of this problem. It embedded 500+ help articles from a single product into Qdrant, searched them via semantic similarity, and generated answers with Claude. It worked well for that one corpus. But it was hardcoded to a single source, a single format (HTML), and a single use case.

The harder problem was making it general without making it complex. Adding a second document source should not require a second search path, a second collection, or a second viewer component. The architecture needed to absorb new formats and new repositories without multiplying the code paths that serve them.

Before: Single-Purpose Help Chatbot#

Before	After
500+ help articles, one product	51,000+ chunks across 4 repositories
HTML only	Markdown, HTML, PDF, DOCX
Hardcoded article paths	Dynamic document serving by format
Vector-only search	Hybrid vector + BM25 keyword with RRF fusion
No re-ranking	Cross-encoder re-ranking (under 100ms overhead)
No admin interface	4-tab admin UI with ingestion, documents, repos, analytics
No analytics	SQLite query logging with gap analysis
Single viewer path	Format-aware viewing: native PDF, server-rendered DOCX, dark mode HTML/MD

Our Approach#

The Iframe Insight: Zero Frontend Changes#

The original React frontend loaded articles in an iframe using a local_url field from the API response. The ERP viewer used the same pattern. Neither component knew or cared what was behind that URL. It could be an HTML page, a PDF file, or anything else a browser can render in an iframe.

This meant the entire Phase 2 transformation was backend work. No React components needed modification. No new viewer components needed creation. The existing ArticleViewer iframe accepted PDF FileResponse objects from FastAPI exactly as well as it accepted HTML content. We verified this early and it held true for every format.

The constraint shaped everything. Instead of building format-specific viewer components, we built format-specific serving logic behind the same URL pattern: /help/custom/{doc_id}/view. For PDF, that endpoint returns a FileResponse with media_type="application/pdf". For Markdown and HTML, it returns transformed HTML with dark mode CSS injection. For DOCX, it returns a server-rendered PDF.

One URL pattern. One iframe. Four formats. Zero frontend changes across the entire ingestion pipeline.

This same abstraction makes the system embeddable in any web application. Any site that can host an iframe, whether it is a company intranet, an internal portal, an ERP system, or a standalone tool, can embed the document assistant without modifying its own frontend. The backend API handles the complexity. The hosting application only needs to render the iframe and pass through the URL.

KEY INSIGHT: When your frontend already uses a generic container (iframe, webview, embed), new content types are a backend problem. Check the abstraction boundary before planning frontend work.

Figure 2 - Repository selector dropdown showing 4 repos with document counts: HR Documents 55, IT Security 171, Legal Contracts 34, Python docs 409

Figure 2 - Repository Selector: The chat panel dropdown lists all repositories with document counts. Users filter to a specific repo or search across all repositories. The same iframe viewer renders PDF, HTML, Markdown, and DOCX without any format-specific frontend code.

Repository System: Payload Fields, Not Separate Collections#

We chose to keep all documents in a single Qdrant collection and distinguish repositories using a repo_id payload field. The alternative was one collection per repository. Separate collections would have meant separate search calls, separate result merging, and separate index management.

A single collection with payload filtering gives cross-repo search for free. A user searching “password requirements” across all repositories gets results from NIST cybersecurity PDFs, HR policy Markdown files, and legal contract DOCX files in one query. Filtering to a single repo is a one-line filter condition. Future access control maps directly to user -> [repo_ids] without collection-level permissions.

The four repositories today hold distinct document types:

Repository	Source	Format	Documents	Chunks
HR Policies	Clef + Basecamp handbooks	Markdown	55	600
IT Security	NIST Cybersecurity documents	PDF	171	42,000
Legal Contracts	Open Agreements templates	DOCX	34	388
Python Docs	Python 3.14 official documentation	HTML	409	8,000

Qdrant payload indexes on repo_id, source_type, doc_id, chunk_type, and article_id keep filtered queries fast even at 51,000+ points. The indexes are created automatically on application startup.

Figure 3 - Diagram showing how different document formats all resolve through a single URL pattern to the iframe viewer

Figure 3 - One URL, Four Formats: Every document type resolves through /help/custom/{doc_id}/view. The endpoint inspects file_type from the manifest and returns the appropriate response: FileResponse for PDF, HTMLResponse for MD/HTML, or a server-rendered PDF for DOCX. The iframe handles all of them.

Hybrid Search: Vector + BM25 Keyword Matching#

Vector-only search worked well for intent-based questions like “how do I create a purchase order?” It failed on exact-term queries. Searching for “FIPS 200” returned results about general cybersecurity frameworks but missed the specific FIPS 200 document. Searching for “argparse” in the Python docs returned results about command-line parsing concepts but ranked the actual argparse module documentation lower than it should have been.

The fix was hybrid search combining vector similarity with BM25 keyword matching via Qdrant’s built-in prefetch and Reciprocal Rank Fusion (RRF). The implementation uses two prefetch branches: one for pure semantic search and one for semantic search filtered to keyword matches. RRF fusion combines the rankings.

1
results = self.client.query_points(
2
    collection_name=COLLECTION_NAME,
3
    prefetch=[
4
        Prefetch(query=query_vector, filter=query_filter, limit=limit * 2),
5
        Prefetch(query=query_vector, filter=keyword_filter, limit=limit * 2),
6
    ],
7
    query=FusionQuery(fusion=Fusion.RRF),
8
    limit=limit,
9
).points

No second search system. No external keyword index. Qdrant handles both vector and keyword search natively. The text index is created lazily on first use and persists across restarts.

The fallback matters too. If hybrid search fails for any reason, the system drops to vector-only search automatically. Degraded results are better than no results.

KEY INSIGHT: Hybrid search is not an optimization. It is a correctness fix. Vector-only search systematically misses exact-term queries (document IDs, acronyms, compliance codes) that users expect to find.

Figure 4 - Hybrid Search in Action: A user asks “What is FIPS 200?” The system retrieves the correct NIST document (SP800-18, Guide for Developing Security Plans) and renders the original PDF natively in the viewer iframe. The answer panel on the left shows the generated response with source attribution. The keyword “FIPS 200” matched via BM25. Without hybrid search, this exact-term query returned lower-relevance results about general cybersecurity concepts.

Cross-Encoder Re-ranking: Precision After Recall#

The bi-encoder (all-mpnet-base-v2) embeds queries and documents independently. It retrieves fast but scores coarsely because it never sees the query and document together. A cross-encoder (ms-marco-MiniLM-L-6-v2) takes both the query and each candidate document as input and produces a relevance score that accounts for their interaction.

The pipeline retrieves broadly with the bi-encoder, deduplicates by document (keeping the highest-scored chunk per document), then re-ranks the deduplicated results with the cross-encoder. The re-ranking step adds approximately 100ms of latency for 5 candidate documents. Against a total query time of under 3 seconds dominated by the Claude API call, 100ms is negligible.

The cross-encoder loads lazily on first use with local_files_only=True to avoid network checks. It is a configuration toggle (RERANK_ENABLED in .env), not a code change, to enable or disable.

Figure 5 - Re-ranking pipeline showing bi-encoder retrieval, deduplication, and cross-encoder re-scoring

Figure 5 - The Re-ranking Pipeline: The bi-encoder retrieves 15 chunks (3x the requested limit). Deduplication reduces these to 5 unique documents. The cross-encoder re-scores each against the actual query. The final ordering reflects true query-document relevance, not just embedding proximity.

Figure 6 - Source history showing 5 NIST documents returned for the FIPS 200 query with clickable source links

Figure 6 - Source Attribution: Every answer includes clickable source links. For the “What is FIPS 200?” query, the system returned 5 NIST documents from the IT Security repository (171 documents). Clicking any source opens the original document in the viewer. The source list serves as both evidence for the answer and a navigation tool for further reading.

The DOCX Viewing Journey: Three Failures Before Success#

DOCX viewing was the hardest format problem. The requirements were straightforward: render a Word document in a browser iframe with reasonable fidelity. The path to a working solution was not.

Attempt 1: HTML conversion. We tried converting DOCX to semantic HTML. The conversion preserved headings, paragraphs, lists, and tables but stripped all Word styling. The resulting HTML was functional but ugly. Legal contracts with careful formatting looked like plain text with headings. We kept this as a fallback but it was not good enough for primary viewing.

Attempt 2: Python PDF conversion. We tried converting DOCX to PDF using a Python library’s conversion capabilities. The results were no better than HTML for complex documents. Some layout elements were lost entirely.

Attempt 3: Word COM automation. We tried using Microsoft Word’s COM automation on Windows to produce perfect PDF output. It worked in testing. In production, it hung indefinitely. Word’s COM automation opens dialog boxes that block headless execution. On a server without a display, there is no way to dismiss them. The process hung until it was killed.

Attempt 4: Headless PDF rendering. We found a headless document rendering engine that runs without a GUI. It produces Word-quality PDF output, handles complex formatting, and completes in 1-2 seconds per file. No hanging. No dialogs. No COM automation.

The conversion runs at ingestion time, not at view time. Each DOCX document gets a rendered PDF for viewing and an HTML fallback. The serving endpoint checks for the PDF first.

KEY INSIGHT: When converting between document formats for viewing, the rendering engine matters more than the conversion library. The right headless rendering tool gives you Word’s quality without Word’s automation problems. We tried three approaches before finding one that worked.

Figure 7 - DOCX viewing journey showing three failed approaches before finding a headless rendering solution

Figure 7 - The DOCX Viewing Journey: Three approaches failed before a headless rendering engine solved the problem. HTML conversion was ugly. Python PDF conversion lost layout elements. Word COM automation hung on dialog boxes. The fourth approach renders Word-quality PDFs in 1-2 seconds without a GUI.

Performance: Killing 30+ Network Requests Per Query#

The first query after a cold start took 5-8 seconds. Subsequent queries landed around 5,530ms. The difference was not model loading. Both the embedding model and the cross-encoder were already cached locally. The difference was HuggingFace Hub.

On every first use, the SentenceTransformer and CrossEncoder constructors make HTTP requests to HuggingFace to check whether newer model versions exist. For two models with multiple files each, this added up to 30+ HTTP requests. Each one included DNS resolution, TLS handshake, and response parsing. Combined with the initial hybrid-search fallback path, first-query latency sat at 5,530ms.

The fix was two lines:

1
os.environ.setdefault("HF_HUB_OFFLINE", "1")
2
self.model = SentenceTransformer("all-mpnet-base-v2", local_files_only=True)

With local_files_only=True, the constructors skip all network checks and load directly from the local cache. If the model is not cached (first run ever), the code catches the OSError, removes the environment variable, and downloads normally. Every subsequent startup loads instantly.

This is a one-time download, permanent local cache pattern. The models do not change between application restarts. Checking for updates on every cold start was pure waste.

The measured result: total query time dropped from 5,530ms to 3,946ms after killing the HuggingFace round-trips, then to 2,912ms once a 400 Bad Request bug in the hybrid-search path was fixed and hybrid search stopped silently falling back to vector-only. The breakdown at the end of this work:

Step	Time
Embedding	~300ms
Qdrant hybrid search	~50ms
Cross-encoder re-ranking	~100ms
Claude API generation	~2,300ms
Total	~2,912ms

The retrieval pipeline (embed + search + re-rank) runs in under 650ms. Claude API generation is 79% of the total. That is about as fast as the system gets without switching to a smaller LLM or adopting streaming responses.

Figure 8 - Cold start latency comparison showing 30+ HTTP requests eliminated by local-only model loading

Figure 8 - Cold Start Performance Fix: Before: 30+ HTTP requests to HuggingFace on first query, total query time 5,530ms. After the offline flag and the hybrid-search fix: zero network requests to HuggingFace, hybrid search working cleanly, total query time 2,912ms with Qdrant at just 50ms.

System Architecture#

1
+-------------------------------------------------------------------------+
2
              RAG DOCUMENT ASSISTANT - SYSTEM ARCHITECTURE
3
+-------------------------------------------------------------------------+
4

5
  INGESTION                    SEARCH + ANSWER              VIEWING
6
  (admin)                      (user)                       (iframe)
7

8
  folder of docs               natural language query        doc_id
9
       |                             |                          |
10
  DocumentParser               embed query (768-dim)         manifest
11
  (MD/HTML/PDF/DOCX)                |                       lookup
12
       |                       hybrid search                    |
13
  TextChunker                  (vector + BM25)              file_type?
14
  (heading-aware split)              |                     /    |    \
15
       |                       RRF fusion                PDF  DOCX  HTML/MD
16
  SentenceTransformer                |                    |    |      |
17
  (batch encode)               dedup by doc_id         native server dark
18
       |                             |                  PDF   PDF    mode
19
  Qdrant upsert                cross-encoder                  |     CSS
20
  (batch 100)                  re-ranking              FileResponse  |
21
       |                             |                         HTMLResponse
22
  ManifestManager              Claude API
23
  (track docs)                 (generate answer)
24
                                     |
25
                               answer + sources
26
+-------------------------------------------------------------------------+
27
  STORAGE: Qdrant (51K+ points, 5 payload indexes, single collection)
28
  ANALYTICS: SQLite (queries, clicks, gaps)
29
  CONFIG: .env (pydantic-settings)
30
+-------------------------------------------------------------------------+

Figure 9 - Rendered system architecture diagram showing ingestion, search, and viewing data flows

Figure 9 - Complete System Architecture: Three data flows converge on Qdrant. Ingestion writes document chunks. Search reads and ranks them. Viewing retrieves the original file for display. All three use the same doc_id as the primary key, and all three work across all four file formats without branching logic in the frontend.

Key Achievements#

Metric	Value
Document chunks indexed	51,000+ across 4 repositories and 4 file formats
Retrieval latency	Under 650ms for embed, hybrid search, and cross-encoder re-ranking combined
Qdrant hybrid search time	~50ms against 51,000+ points with 5 payload indexes
Total query time	~2,912ms end-to-end (~2.3s of that is Claude API generation)
Cold start improvement	5,530ms → 2,912ms total query time after killing HuggingFace checks and fixing hybrid search
Frontend changes for Phase 2	Zero React components modified for the entire ingestion pipeline
Document formats supported	4 (Markdown, HTML, PDF, DOCX) with format-aware viewing
DOCX-to-PDF conversion time	1-2 seconds per file via headless rendering
Payload indexes	5 fields indexed for near-instant filtered queries at scale
Operating cost	~$0 fixed infrastructure (Qdrant via Docker, models run locally); Claude API is the only variable cost, billed per query
Python HTML docs	409 files ingested and searchable with dark mode support
NIST cybersecurity PDFs	171 documents, 42,000 chunks, browser-native PDF viewing
Legal contracts (DOCX)	34 templates, Word-quality PDF rendering in-app
HR policy docs (Markdown)	55 files with HTML conversion and dark mode CSS

Technical Deep Dives#

The Ingestion Pipeline: Parse, Chunk, Embed, Upsert#

The DocumentIngestor orchestrates four steps for every file. The parser extracts plain text and a title using format-specific libraries for each file type. The chunker splits text on heading boundaries first, then paragraph boundaries, targeting 500-1,000 characters per chunk with overlap. The embedding model encodes all chunks in a single batch call (batch_size=64). Qdrant receives the points in batches of 100.

Each ingested file gets a UUID-based doc_id. The original file is copied to data/custom/docs/{uuid}/original.{ext}. A rendered version (HTML or PDF) is created alongside it for viewing. The manifest tracks every document with its repo, title, file type, source path, chunk count, and file hash.

This is a synchronous pipeline today. For 55 Markdown files, it completes in seconds. For 171 NIST PDFs, it took long enough that the HTTP request risked timing out. Async ingestion with job tracking is the next planned improvement.

Figure 10 - Admin ingest tab showing repository selector dropdown, folder path input, and Ingest Folder button

Figure 10 - Admin Ingest Tab: The ingestion interface accepts a repository selection and a folder path. Clicking “Ingest Folder” triggers the full pipeline: scan for supported files (.md, .html, .pdf, .docx), parse each one, chunk the text, embed in batch, and upsert to Qdrant. The four supported formats are listed in the description text.

Admin UI: Four Tabs for Document Management#

The admin interface provides four capabilities in a tabbed layout. The Ingest tab accepts a folder path and a repository selector, then runs the ingestion pipeline. The Documents tab lists all ingested documents with metadata, and supports delete and re-ingest operations per document. The Repositories tab manages repo creation and deletion (deleting a repo removes all its documents from Qdrant). The Analytics tab shows query volume, top queries, average latency, and a zero-result gap analysis that identifies questions users asked that the corpus could not answer.

Figure 11 - Documents tab showing 669 ingested documents with title, type, repo, chunks, ingested date, and action buttons

Figure 11 - Document Management: The Documents tab lists all 669 ingested documents across the four repositories. Each row shows the document title, file type (MD, HTML, PDF, DOCX), repository assignment, chunk count, and ingestion date. The action buttons on the right allow deleting a document (removes all its chunks from Qdrant and files from disk) or re-ingesting it (re-parse, re-chunk, re-embed with fresh content).

Figure 12 - Repositories tab showing 4 repos (HR Documents, IT Security, Legal Contracts, Python docs) with document counts and management form

Figure 12 - Repository Management: The Repositories tab shows all four active repositories with their IDs, descriptions, and document counts. HR Documents holds 55 Markdown files. IT Security holds 171 NIST PDFs. Legal Contracts holds 34 DOCX templates. Python docs holds 409 HTML files. The create form above the table accepts an ID (lowercase, no spaces), display name, and description. Deleting a repository removes all its documents from Qdrant.

Analytics are SQLite-backed with append-only event logging. Every chat and search query records the query text, repo filter, result count, and latency. Source clicks record which documents users actually opened. The gap analysis surfaces queries with zero results, sorted by frequency, so administrators know which documentation to add next.

Figure 13 - Analytics dashboard showing 14 total queries, 4580ms average latency, top queries list, usage by repository, and content gaps

Figure 13 - Search Analytics: The Analytics tab reveals usage patterns. During initial testing, 14 chat queries were logged with an average latency of 4,580ms. The top queries list shows what users ask most frequently. Usage by repository shows which document sets get the most traffic. The Content Gaps section surfaces zero-result queries, highlighting where the corpus needs expansion. These gaps become the roadmap for which documents to add next.

Dark Mode CSS Injection: Server-Side Theme Support#

HTML and Markdown documents are served with optional dark mode support. The approach is server-side CSS injection, not client-side theme switching. The viewing endpoint checks for a ?theme=dark query parameter and injects a <style> block before the closing </head> tag (or </body>, or prepends it if neither tag exists).

This pattern was inherited from the original help article viewer, which also stripped external hyperlinks (they would 404 outside the source application). The same injection approach extended naturally to custom documents.

The CSS overrides are aggressive (!important on all rules) because ingested documents carry their own stylesheets. Background, text color, link color, code blocks, tables, blockquotes, and image opacity all get dark mode treatment. It is not a full theme system. It is a reliable dark mode that works on arbitrary HTML content without knowing the document’s original stylesheet.

KEY INSIGHT: Server-side CSS injection for iframe content is simpler and more reliable than trying to communicate theme state across the iframe boundary. The server already knows the theme from the query parameter. The iframe content never needs JavaScript.

Technologies#

Backend:

Python 3.11+ with FastAPI (async endpoints, pydantic request validation)
Qdrant Vector Database (hybrid search, payload indexes, single collection architecture)
SQLite (append-only analytics logging, zero-result gap analysis)

AI and Search:

sentence-transformers all-mpnet-base-v2 (768-dimensional embeddings, local-only loading)
cross-encoder/ms-marco-MiniLM-L-6-v2 (re-ranking, approximately 100ms per 5 candidates)
Claude API via Anthropic SDK (answer generation from retrieved context)
Reciprocal Rank Fusion (vector + BM25 keyword search combination)

Document Processing:

Format-specific parsers for HTML, PDF, DOCX, and Markdown text extraction
Headless PDF rendering engine (DOCX-to-PDF conversion, 1-2 seconds per file)
Server-side HTML conversion for Markdown and DOCX fallback viewing

Frontend:

React 18 with TypeScript (Vite 5 build, TanStack Query for data fetching)
Tailwind CSS v4 with shadcn/ui components
Iframe-based document viewer (format-agnostic, zero changes for Phase 2)

Infrastructure:

Docker Compose for Qdrant
uv for Python dependency management
pydantic-settings for configuration from .env
concurrently for parallel backend + frontend development startup

Features#

Multi-Repository Search: Select a repository or search across all repos simultaneously. Cross-repo queries pull from Markdown, HTML, PDF, and DOCX sources in a single search call
4-Format Document Ingestion: Point at a folder of Markdown, HTML, PDF, or DOCX files and ingest them into any repository. Parse, chunk, embed, and upsert in one API call
Source Attribution: Every answer includes clickable links to the specific documents it drew from. Clicking a source opens the original document in the viewer
Admin UI with 4 Tabs: Ingest documents, manage individual files (delete, re-ingest), create and delete repositories, and view search analytics with gap analysis
Search Analytics and Gap Analysis: SQLite logs every query, its result count, latency, and source clicks. Zero-result queries surface as documentation gaps, sorted by frequency
Pluggable LLM Backend: Swap between Anthropic (Claude) and Requesty (OpenAI-compatible) providers via a single environment variable. No code changes required
Lazy Singleton Initialization: The HelpAgent, embedding model, and cross-encoder load on first request, not at startup. Cold starts are fast. Model downloads happen once and are cached permanently
Graceful Degradation: If hybrid search fails, the system falls back to vector-only search. If the cross-encoder is disabled, results skip re-ranking. If the PDF renderer is unavailable, DOCX files fall back to HTML conversion

What Happens Next#

The synchronous ingestion pipeline is the primary bottleneck. Ingesting 171 NIST PDFs pushed against HTTP request timeout limits. Phase 2d adds async ingestion with job tracking, ProcessPoolExecutor parallelism for PDF parsing, and file hash-based change detection for incremental re-ingestion.

The architecture supports these additions without structural changes. The DocumentIngestor already processes files individually. Wrapping each file in an async task and reporting progress through a job status endpoint is additive work, not a redesign.

The system is designed to work two ways: as a standalone application accessible over the network, or embedded into an existing web application. Any site that can render an iframe, whether a company intranet, an internal portal, a customer-facing knowledge base, or an ERP system, can embed the document assistant without modifying its own frontend. The API handles all search, retrieval, and document rendering. The hosting application only needs to pass through the URL.

The larger opportunity is access control. The single-collection, payload-filtered architecture was chosen specifically because it maps to user -> [repo_ids] permissions. Adding authentication and per-user repo access is a filter condition change, not an architectural migration.

RAG Document Assistant: From Single-Purpose Chatbot to Multi-Repository Document Platform