🚀 Applications

Retrieval-Augmented Generation

Combining retrieval systems with generative models to produce accurate, grounded responses that go beyond the training data.

1 February 2024

Retrieval-Augmented Generation (RAG)

RAG augments LLM generation with retrieved documents, reducing hallucinations and enabling knowledge updates.

How RAG Works

Embed the user query into a vector
Retrieve the top-k most similar document chunks from a vector store
Augment the prompt with retrieved context
Generate a grounded response

Embedding

Documents and queries are converted to dense vectors using an embedding model (e.g. nomic-embed-text, OpenAI text-embedding-3).

Similarity is computed via cosine similarity or dot product.

Vector Databases

Popular options:

pgvector — Postgres extension, SQL-native
Chroma — Python-native, embeddable
Weaviate — hosted or self-hosted
Pinecone — managed cloud

Chunking Strategies

Fixed-size: simple, may cut context
Sentence/paragraph: preserves semantic units
Recursive character: respects document structure
Semantic: groups by meaning (expensive)

Advanced RAG

Reranking: Use a cross-encoder to reorder retrieved chunks
HyDE: Generate a hypothetical document, then retrieve
Multi-query: Expand the query to improve recall
Parent-child chunking: Retrieve small chunks, return parent context

References

Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

← All articles