Retrieval-Augmented Generation
Combining retrieval systems with generative models to produce accurate, grounded responses that go beyond the training data.
Retrieval-Augmented Generation (RAG)
RAG augments LLM generation with retrieved documents, reducing hallucinations and enabling knowledge updates.
How RAG Works
- Embed the user query into a vector
- Retrieve the top-k most similar document chunks from a vector store
- Augment the prompt with retrieved context
- Generate a grounded response
Embedding
Documents and queries are converted to dense vectors using an embedding model (e.g. nomic-embed-text, OpenAI text-embedding-3).
Similarity is computed via cosine similarity or dot product.
Vector Databases
Popular options:
- pgvector β Postgres extension, SQL-native
- Chroma β Python-native, embeddable
- Weaviate β hosted or self-hosted
- Pinecone β managed cloud
Chunking Strategies
- Fixed-size: simple, may cut context
- Sentence/paragraph: preserves semantic units
- Recursive character: respects document structure
- Semantic: groups by meaning (expensive)
Advanced RAG
- Reranking: Use a cross-encoder to reorder retrieved chunks
- HyDE: Generate a hypothetical document, then retrieve
- Multi-query: Expand the query to improve recall
- Parent-child chunking: Retrieve small chunks, return parent context
References
- Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks