πŸš€ Applications

Retrieval-Augmented Generation

Combining retrieval systems with generative models to produce accurate, grounded responses that go beyond the training data.

Retrieval-Augmented Generation (RAG)

RAG augments LLM generation with retrieved documents, reducing hallucinations and enabling knowledge updates.

How RAG Works

  1. Embed the user query into a vector
  2. Retrieve the top-k most similar document chunks from a vector store
  3. Augment the prompt with retrieved context
  4. Generate a grounded response

Embedding

Documents and queries are converted to dense vectors using an embedding model (e.g. nomic-embed-text, OpenAI text-embedding-3).

Similarity is computed via cosine similarity or dot product.

Vector Databases

Popular options:

  • pgvector β€” Postgres extension, SQL-native
  • Chroma β€” Python-native, embeddable
  • Weaviate β€” hosted or self-hosted
  • Pinecone β€” managed cloud

Chunking Strategies

  • Fixed-size: simple, may cut context
  • Sentence/paragraph: preserves semantic units
  • Recursive character: respects document structure
  • Semantic: groups by meaning (expensive)

Advanced RAG

  • Reranking: Use a cross-encoder to reorder retrieved chunks
  • HyDE: Generate a hypothetical document, then retrieve
  • Multi-query: Expand the query to improve recall
  • Parent-child chunking: Retrieve small chunks, return parent context

References

  • Lewis et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks