RAG: Grounding AI with Real-World Knowledge

As large language models (LLMs) become more capable they still face a fundamental limitation: they are unable to access information beyond their training cutoff date. When asked about information beyond this cutoff date LLMs will either be unable to answer or confidently hallucinate an answer.

This challenge is a critical barrier preventing the widespread adoption of LLMs in environments where accuracy matters. This is where Retrieval-Augmented Generation (RAG) comes into play, fundamentally transforming how we deploy LLMs by grounding their responses in real, verifiable data.

Instead of relying solely on parameters learned during training, RAG-enabled systems dynamically fetch relevant information from external sources, incorporating this context into their responses.

In this article, we’ll dive into:

What RAG is and how it enhances LLMs
The technical architecture that powers RAG systems
Various implementation approaches and their tradeoffs
Key metrics for evaluating RAG performance
Practical applications across different industries
Strategies for integrating RAG into AI systems
The benefits, limitations, and future directions of RAG

Understanding Retrieval-Augmented Generation#

What is RAG?#

Essentially, Retrieval-Augmented Generation is an AI framework that enhances the performance of generative models by integrating external data sources into their response generation process.

The process involves two core components working in harmony: retrieval and generation. During the retrieval phase, the system searches through external databases or knowledge repositories to find information relevant to the user’s query. This might involve keyword matching, but more commonly uses sophisticated vector similarity searches that understand semantic meaning. In the generation phase, the retrieved information is seamlessly incorporated into the LLM’s input, providing crucial context that shapes the final response.

Here’s a simple example to demonstrate the difference:

1
# Traditional LLM approach (without RAG)
2
def traditional_llm_query(question):
3
    # LLM only has access to its training data
4
    response = llm.generate(prompt=question)
5
    return response
6

7
# RAG-enhanced approach
8
def rag_query(question, knowledge_base):
9
    # Step 1: Retrieve relevant documents
10
    relevant_docs = retrieve_similar_documents(question, knowledge_base)
11

12
    # Step 2: Augment the prompt with retrieved context
13
    augmented_prompt = f"""
14
    Context: {' '.join(relevant_docs)}
15

16
    Question: {question}
17

18
    Please answer based on the provided context.
19

20
    """
21
    # Step 3: Generate response with additional context
22
    response = llm.generate(prompt=augmented_prompt)
23
    return response

Why RAG Enhances Large Language Models#

LLMs, despite their impressive capabilities, suffer from several critical shortcomings:

Knowledge Cutoff: Every LLM has a training cutoff date. GPT-4, for instance, might have comprehensive knowledge up to a certain point, but ask about events after that date, and it’s operating blind. This affects any domain where information changes rapidly, like financial markets or medical research.

Hallucinations: Perhaps more concerning than ignorance is the LLM tendency to generate plausible-sounding but entirely fabricated information. An LLM might confidently cite a research paper that doesn’t exist or quote statistics it’s essentially made up. These hallucinations can have serious consequences in healthcare or legal applications where accuracy is paramount.

Lack of Personalization: Traditional LLMs can’t access private or organization-specific information. They might know general business principles but can’t reference your company’s specific policies, procedures, or data.

RAG addresses these issues by grounding the generation process in retrieved factual data. When asked about recent events, a RAG system can pull from updated news sources. When queried about company policies, it can reference the actual policy documents. This dynamic access to external knowledge transforms LLMs from isolated oracles into connected, context-aware assistants.

Technical Architecture of RAG#

Figure 1: RAG Pipeline Architecture – This flowchart depicts the complete RAG pipeline from user query through document processing, chunking, embedding creation, retrieval, and final response generation, showing how each component interconnects to deliver contextually grounded answers.

Overview of the RAG Pipeline#

The RAG pipeline is a sophisticated integration of multiple components, each playing a crucial role in delivering accurate and contextually relevant responses. Understanding this architecture is key to implementing effective RAG systems.

Document Processing#

The RAG pipeline begins with ingesting external data sources. This data can be unstructured like PDFs and web pages, semi-structured like CSV files, or fully structured databases. The processing stage transforms this diverse content into a format suitable for efficient retrieval.

During this phase, documents undergo several transformations:

Text extraction: Converting various formats into plain text
Cleaning: Removing formatting artifacts, headers, footers
Normalization: Standardizing date formats, acronyms, and terminology
Metadata extraction: Capturing document properties like creation date, author, category

Chunking#

One of the most critical yet often overlooked aspects of RAG is chunking. This involves dividing documents into smaller, semantically coherent segments. Since LLMs have limited context windows, retrieving entire documents would rapidly fill the context window. Additionally, most queries only require specific information from a document, not the entire content.

Effective chunking strategies include:

Fixed-size chunking: Splitting text every N characters or tokens
Semantic chunking: Using NLP techniques to identify topic boundaries
Hierarchical chunking: Preserving document structure (chapters, sections, paragraphs)
Sliding window chunking: Creating overlapping chunks to preserve context at boundaries

Embedding Creation#

Once we have our chunks, the next step transforms them from text into numerical representations called embeddings. These high-dimensional vectors capture the semantic meaning of the text, enabling mathematical operations like similarity comparison.

Here is a brief example of embedding creation:

1
from sentence_transformers import SentenceTransformer
2
import numpy as np
3

4
# Initialize embedding model
5
model = SentenceTransformer('all-MiniLM-L6-v2')
6

7
def create_embeddings(chunks):
8
    """
9
    Transform text chunks into vector embeddings
10
    """
11

12
    embeddings = []
13
    for chunk in chunks:
14
        # Convert text to vector representation
15
        # The model outputs a 384-dimensional vector for this example
16

17
        embedding = model.encode(chunk)
18
        embeddings.append({
19
            'text': chunk,
20
            'embedding': embedding
21
        })
22
    return embeddings
23

24
# Example usage
25
chunks = [
26

27
    "RAG enhances LLMs by providing external context.",
28
    "Vector databases store embeddings for efficient retrieval.",
29
    "Chunking strategies affect retrieval quality."
30
]
31

32
embedded_chunks = create_embeddings(chunks)
33

34
print(f"Created {len(embedded_chunks)} embeddings")
35

36
print(f"Embedding dimension: {embedded_chunks[0]['embedding'].shape}")

1
Created 3 embeddings
2
Embedding dimension: (384,)

Figure 2: RAG-Enhanced LLM Flow – This diagram shows how a user query flows through the RAG system, with the retrieval system accessing external knowledge sources to augment the LLM prompt, resulting in responses grounded in retrieved data instead of solely relying on training knowledge.

Retrieval Operations#

The retrieval mechanism begins when a user submits a query. The query itself is converted into an embedding using the same model that processed the documents. This ensures that query and document embeddings exist in the same vector space, making similarity comparisons meaningful.

The retrieval process typically involves:

Query embedding: Converting the user’s question into a vector
Similarity search: Finding the most similar document embeddings
Ranking: Ordering results by relevance score
Filtering: Applying any metadata-based constraints

Most RAG systems use cosine similarity or Euclidean distance to measure how closely a query matches stored documents. The choice of similarity metric can significantly impact retrieval quality.

Contextualization#

Retrieved chunks rarely work well in isolation. The contextualization phase enriches them with additional information to create a coherent context for the LLM. This might involve:

Adding document metadata (title, date, source)
Including surrounding text for better context
Ordering chunks by relevance or logical flow
Summarizing multiple chunks if too many are retrieved

Response Generation#

In the final phase, the LLM receives an augmented prompt containing both the original query and the retrieved context. As users, the challenge is to craft prompts that effectively guide the model to use the provided information while maintaining natural, coherent responses.

Implementation Approaches for RAG#

Basic RAG Architecture#

The simplest form of RAG implements a straightforward pipeline: retrieve relevant documents, append them to the prompt, and generate a response. While this basic approach can be surprisingly effective for many use cases, it has limitations when dealing with complex queries or nuanced information needs.

Basic RAG works well when:

Queries are straightforward and well-defined
Retrieved documents are highly relevant
The required information is contained within a few chunks
Context window limitations aren’t a constraint

However, as applications grow more sophisticated, several advanced approaches have emerged to address these limitations.

Hybrid Search Techniques#

One significant enhancement to basic RAG involves combining multiple retrieval methods. Pure vector similarity search excels at capturing semantic meaning but can miss exact matches for specific terms. Conversely, keyword-based search finds exact matches but struggles with synonyms or paraphrases.

Hybrid search leverages both approaches:

Vector search for semantic understanding
Keyword search (BM25, TF-IDF) for precision matching
Metadata filtering for domain-specific constraints

The results from different search methods are typically combined using weighted scoring or rank fusion techniques, providing more comprehensive retrieval coverage.

Self-Query Retrieval#

Instead of directly searching with the user’s query, self-query systems first analyze the query to extract metadata filters and search parameters.

For example, a query like “What were Apple’s Q3 2024 earnings?” would be decomposed into:

Semantic search: “earnings financial results”
Metadata filters: company=“Apple”, time_period=“Q3 2024”
Document type: “earnings report” or “financial statement”

This approach significantly improves precision by narrowing the search space before applying vector similarity.

Query Transformation Techniques#

Not all user queries are created equal. Some are vague, others overly specific, and many could benefit from refinement before retrieval. Query transformation techniques address this by modifying or expanding the original query to improve retrieval results.

Common transformation strategies include:

Query expansion: Adding related terms or synonyms
Query decomposition: Breaking complex questions into sub-queries
Query rewriting: Clarifying ambiguous or poorly formed questions
Back-translation: Generating multiple query variations

Hypothetical Document Embeddings (HyDE)#

HyDE represents one of the more innovative approaches to improving retrieval accuracy. Instead of searching directly with the query embedding, HyDE first generates a hypothetical answer to the question, then uses that hypothetical document’s embedding for retrieval.

The intuition is clever: a hypothetical answer, even if not entirely accurate, will be more similar in style and content to actual documents than a short query. This can bridge the semantic gap between how users ask questions and how information is stored in documents.

Evaluation Metrics for RAG Performance#

Core Metrics for Retrieval Components#

Evaluating RAG systems requires careful consideration of both retrieval and generation performance. For the retrieval component, traditional information retrieval metrics provide a solid foundation:

Retrieval Accuracy: The proportion of retrieved documents that contain relevant information for answering the query. This fundamental metric tells us whether our retrieval system is finding the right information.

Relevance Scores: More nuanced than binary relevance, these scores measure how closely retrieved documents align with the query intent. Modern evaluation frameworks often use graded relevance (highly relevant, somewhat relevant, not relevant) rather than simple binary judgments.

Precision@K: The fraction of retrieved documents in the top K results that are relevant. This metric is particularly important for RAG systems since only a limited number of documents can fit within the LLM’s context window.

Recall@K: The fraction of all relevant documents that appear in the top K results. While perfect recall is rarely necessary for RAG (we don’t need every relevant document, just enough to answer the query), very low recall indicates retrieval problems.

Metrics for Generation Components#

The generation phase introduces its own evaluation challenges. Unlike retrieval, where relevance can be somewhat objectively assessed, generation quality involves multiple dimensions:

Response Coherence: Does the generated response flow logically? Are ideas connected sensibly? Coherence metrics evaluate the internal consistency of the generated text.

Content Coverage: How completely does the response address the user’s query? This metric assesses whether all aspects of multi-part questions receive attention.

Factual Accuracy: Does the response accurately reflect the information in the retrieved documents? This includes both avoiding hallucinations and correctly interpreting the source material.

Source Attribution: Can the response’s claims be traced back to specific retrieved documents? Proper attribution is crucial for building trust and enabling verification.

Latency and Efficiency Metrics#

Performance is about speed and resource usage in addition to quality. Key metrics include:

End-to-end latency: Total time from query submission to response delivery
Retrieval latency: Time spent finding relevant documents
Generation latency: Time spent producing the response
Throughput: Queries handled per second under load
Resource utilization: CPU, memory, and GPU usage patterns

Use Cases Across Industries#

RAG has found applications across many industries, each leveraging its unique ability to combine pre-trained language understanding with dynamic, domain-specific knowledge access.

Enterprise Search#

Organizations are drowning in data scattered across various systems. Traditional search returns a list of potentially relevant documents, leaving users to dig through each one. RAG-powered enterprise search transforms this experience entirely.

Instead of presenting document links, these systems directly answer questions by retrieving relevant sections from multiple documents and synthesizing coherent answers. This capability is particularly valuable for:

Onboarding new employees who need quick answers about company procedures
Executives requiring rapid insights from across the organization
Compliance officers searching for policy violations or regulatory requirements

Customer Support#

In customer service, the difference between a frustrated customer and a satisfied one often comes down to how quickly and accurately their issues are resolved. RAG systems are revolutionizing customer support by providing agents with instant access to relevant information from knowledge bases, past tickets, and product documentation.

Consider a customer asking about a specific error message in a software product. A RAG-enabled support system can:

Search through technical documentation for that exact error
Retrieve similar resolved tickets
Find relevant sections from troubleshooting guides
Generate a comprehensive response that addresses the specific issue This approach dramatically reduces resolution time while ensuring consistency in support quality.

Document Question Answering#

Legal professionals analyzing contracts, researchers reviewing literature, or analysts processing reports all benefit from RAG’s ability to extract specific answers from large document collections.

Unlike traditional keyword searches that return entire documents, RAG-powered QA systems provide direct answers with citations. A lawyer asking “What are the termination conditions in the Acme Corp contract?” receives not just the relevant section but a natural language summary with specific clause references.

Knowledge Management Systems#

RAG-powered knowledge management systems create dynamic, queryable repositories that adapt to how people naturally seek information. These systems excel in domains like:

Healthcare: Retrieving relevant clinical guidelines, research findings, and treatment protocols based on specific patient presentations
Manufacturing: Accessing maintenance procedures, safety protocols, and troubleshooting guides for specific equipment models
Finance: Finding relevant regulations, compliance requirements, and internal policies for specific scenarios

Integration with AI Systems#

The Role of RAG in Modern AI Architecture#

RAG has evolved from an interesting research concept to an essential component in production AI systems. It serves as the bridge between the linguistic capabilities of LLMs and the specific, current information needs of real-world applications.

In modern AI architectures, RAG typically functions as a middleware layer:

1
class RAGMiddleware:
2
    def __init__(self, retriever, llm, config):
3
        self.retriever = retriever
4
        self.llm = llm
5
        self.config = config
6

7
    def process_query(self, query, context=None):
8
        """
9
        Process a query through the RAG pipeline
10
        """
11

12
        # Step 1: Analyze query intent
13
        query_metadata = self.analyze_query(query)
14

15
        # Step 2: Retrieve relevant documents
16
        if self.should_retrieve(query_metadata):
17
            retrieved_docs = self.retriever.search(
18
                query,
19
                filters=query_metadata.get('filters'),
20
                top_k=self.config['retrieval_top_k']
21
            )
22
        else:
23
            retrieved_docs = []
24

25
        # Step 3: Generate response
26
        response = self.generate_response(query, retrieved_docs,context)
27

28
        return {
29
            'answer': response,
30
            'sources': self.format_sources(retrieved_docs),
31
            'confidence': self.calculate_confidence(response, retrieved_docs)
32
        }

Performance Considerations#

The performance of RAG systems depends on careful optimization across both hardware and software dimensions.

Hardware Factors:

CPU performance affects document processing and embedding generation
RAM capacity determines how many embeddings can be kept in memory
GPU acceleration can significantly speed up embedding generation and vector operations
Storage I/O impacts document loading and index access times

Software Optimization:

Embedding model selection: Balancing quality versus speed
Batch processing: Grouping operations to maximize throughput
Caching strategies: Storing frequently accessed embeddings and results
Index optimization: Choosing appropriate data structures for vector search

Scaling RAG Systems#

As RAG applications grow from prototypes to production systems serving millions of queries, scaling becomes crucial. Common scaling strategies include:

Horizontal Scaling: Distributing the retrieval workload across multiple servers. This typically involves:

Sharing the document collection across nodes
Load balancing queries across retrieval servers
Implementing distributed caching layers

Vertical Optimization: Maximizing single-node performance through:

GPU acceleration for embedding operations
Optimized vector indexes (HNSW, IVF)
Efficient memory management
Query result caching

Hybrid Approaches: Combining hot and cold storage tiers:

Frequently accessed documents in high-speed memory
Archived content in cheaper storage
Intelligent prefetching based on query patterns

Benefits and Limitations#

Advantages of RAG#

RAG offers several compelling benefits such as:

Reduced Hallucinations: By grounding responses in retrieved factual data, RAG significantly reduces the tendency of LLMs to generate plausible but false information.
Access to Private/Recent Data: RAG allows LLMs to work with information they’ve never seen during training. This transforms LLMs from static knowledge repositories to dynamic information systems.
Enhanced Accuracy: The combination of retrieval and generation typically produces more accurate responses than either approach alone. The retrieval component provides factual grounding, while the generation component handles natural language understanding and synthesis.
Scalability: Unlike fine-tuning, which requires retraining models as new information becomes available, RAG systems can be updated simply by adding new documents to the retrieval dataset. This makes maintaining current information significantly more efficient.
Transparency and Verifiability: RAG systems can provide citations for their responses, allowing users to verify information and dig deeper into source materials when needed.

Challenges and Limitations#

Despite its advantages, RAG faces several significant challenges:

Context Window Constraints: LLMs have finite input lengths, which limits how much retrieved information can be provided. As models expand context windows, this becomes less constraining, but it remains a fundamental limitation.
Retrieval Quality Issues: The system is only as good as its retrieval component. Poor retrieval leads to irrelevant context, which can confuse the model or lead to incorrect responses. Issues include:
- Semantic gaps between queries and documents
- Difficulty with multi-step reasoning
- Challenges with temporal or conditional information
Computational Overhead: RAG systems require additional infrastructure beyond the LLM itself such as vector databases, embedding models, and retrieval pipelines, which all add complexity and computational cost.
Integration Complexity: Building production-ready RAG systems requires carefully orchestrating multiple components, handling edge cases, and ensuring robust performance under load.
Latency Considerations: The retrieval step adds latency to response generation. While often acceptable, this can be problematic for real-time applications.

Future Directions#

The future of RAG extends beyond text. Emerging systems are beginning to incorporate varying data types like images, videos, audio, and structured data, into the retrieval and generation phases.

Real-Time Knowledge Updates#

Current RAG systems typically work with static document collections updated periodically. Future systems will likely incorporate streaming data sources, enabling real-time knowledge updates. This could include:

Live news feeds
Real-time sensor data
Dynamic knowledge graphs
Continuous learning from user interactions

Improved Reasoning Capabilities#

While current RAG excels at finding and presenting relevant information, future systems will likely demonstrate enhanced reasoning capabilities:

Multi-step inference across multiple retrieved documents
Temporal reasoning about how information changes over time
Causal reasoning about relationships between events
Hypothetical reasoning based on retrieved patterns

Personalized RAG#

Future RAG systems will likely adapt to individual users or use cases:

Learning from user feedback to improve retrieval
Personalizing response style and detail level
Maintaining user-specific context across sessions
Adapting to domain-specific terminology and conventions

Autonomous RAG Agents#

Integrating RAG with autonomous agents could allow these agents to:

Proactively retrieve information based on anticipated needs
Continuously update their knowledge base
Collaborate with other agents to answer complex queries
Self-evaluate and improve their retrieval strategies

Practical Takeaways#

If you’re considering implementing RAG in your AI system, here are our key recommendations:

Start Simple: Begin with a basic RAG pipeline before adding sophisticated features. Even simple retrieval can dramatically improve LLM responses for many use cases.
Invest in Document Preparation: The quality of your RAG system depends heavily on how well you process and chunk your documents. Spend time optimizing this often-overlooked step.
Choose Appropriate Embedding Models: Select embedding models that match your domain and use case. Domain-specific models often outperform general-purpose ones.
Implement Robust Evaluation: Establish clear metrics for both retrieval and generation quality. Regular evaluation helps identify and address issues before they impact users.
Plan for Scale: Consider future scaling needs from the start. Choices made in prototype systems can create bottlenecks in production.

Conclusion#

RAG represents a fundamental shift in how we deploy LLMs for real-world applications. By combining the linguistic capabilities of LLMs with dynamic access to external knowledge, RAG addresses critical limitations around accuracy and verifiability that have historically hindered enterprise AI adoption.

The evolution of basic keyword search to semantic retrieval-augmented generation is a significant milestone in information access. We’ve moved from systems that return documents to ones that provide direct, contextually grounded answers. This transformation enables new applications across industries, from intelligent customer support to sophisticated research assistants.

While RAG comes with challenges (managing retrieval quality, handling computational overhead, and orchestrating complex pipelines), the benefits often outweigh these limitations. As technology evolves, we’re seeing increasingly sophisticated approaches that address these limitations while further expanding RAG technology.

As we continue to push the boundaries of what’s possible with AI, RAG stands as a crucial bridge between the vast potential of language models and the practical requirements of real-world applications.

As retrieval techniques become more sophisticated and language models more capable, we can expect RAG to evolve from a useful technique to an indispensable component of intelligent systems. For AI practitioners, understanding and mastering RAG is necessary when preparing for a future where AI systems seamlessly blend learned knowledge with dynamic information access.

Introductory AI Series:

Part 1: Vector Databases: The Engine Powering Modern AI Applications
Part 2: RAG: Grounding AI with Real-World Knowledge (this article)
Part 3: Graph Databases: The Foundation Enabling Context-Aware AI Applications
Part 4: GraphRAG: Enhancing Retrieval with Knowledge Graph Intelligence

References#

[1] Gao, Yunfan, et al., “Retrieval-Augmented Generation for Large Language Models: A Survey,” arXiv preprint arXiv:2312.10997, 2023.

[2] Lewis, Patrick, et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” Advances in Neural Information Processing Systems, 2020, pp. 9459-9474.

[3] Microsoft Azure, “RAG Solution Design and Evaluation Guide,” https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/rag/rag-solution-design-and-evaluation-guide (2024).

[4] Borgeaud, Sebastian, et al., “Improving Language Models by Retrieving from Trillions of Tokens,” International Conference on Machine Learning, PMLR, 2022, pp. 2206-2240.

[5] Izacard, Gautier, and Edouard Grave, “Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering,” Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics, 2021, pp. 874-880.

[6] Karpukhin, Vladimir, et al., “Dense Passage Retrieval for Open-Domain Question Answering,” Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 6769-6781.

[7] Anthropic, “Constitutional AI: Harmlessness from AI Feedback,” arXiv preprint arXiv:2212.08073, 2022.

[8] Chen, Wenhu, et al., “WebGPT: Browser-assisted question-answering with human feedback,” arXiv preprint arXiv:2112.09332, 2021.

[9] Shuster, Kurt, et al., “Retrieval Augmentation Reduces Hallucination in Conversation,” Findings of the Association for Computational Linguistics: EMNLP 2021, pp. 3784-3803.

[10] Glass, Michael, et al., “Re2G: Retrieve, Rerank, Generate,” Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2701-2715.

[11] Sachan, Devendra, et al., “Questions Are All You Need to Train a Dense Passage Retriever,” Transactions of the Association for Computational Linguistics, vol. 11, 2023, pp. 600-616.

[12] Asai, Akari, et al., “Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection,” arXiv preprint arXiv:2310.11511, 2023.

[13] Databricks, “Fundamentals of Retrieval-Augmented Generation,” https://docs.databricks.com/en/generative-ai/retrieval-augmented-generation (2024).

[14] Huang, Jie, and Kevin Chen-Chuan Chang, “Towards Reasoning in Large Language Models: A Survey,” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, 2023, pp. 1049-1065.

[15] Robertson, Stephen, and Hugo Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, 2009, pp. 333-389.

[16] LangChain, “Self-Query Retrieval,” https://python.langchain.com/docs/modules/data_connection/retrievers/self_query (2024).

[17] Mao, Yuning, et al., “Generation-Augmented Retrieval for Open-Domain Question Answering,” Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 4089-4100.

[18] Guu, Kelvin, et al., “REALM: Retrieval-Augmented Language Model Pre-Training,” International Conference on Machine Learning, PMLR, 2020, pp. 3929-3938.

[19] Pinecone, “What is Retrieval-Augmented Generation?,” https://www.pinecone.io/learn/retrieval-augmented-generation/ (2024).

[20] Zilliz, “Building Advanced RAG Applications,” https://zilliz.com/learn/advanced-rag (2024).