How does retrieval work in RAG?
In Retrieval-Augmented Generation, the retrieval step is responsible for finding relevant external information before the LLM generates an answer.
Without retrieval, the LLM only relies on its pre-trained knowledge.
With retrieval, it can fetch updated or domain-specific data from documents, databases, PDFs, websites, or vector stores.
Simple Interview Definition
Retrieval in RAG is the process of searching and fetching the most relevant documents or chunks from an external knowledge source based on the user query, and providing them to the LLM for answer generation.
High-Level RAG Flow
User Query
↓
Convert Query into Embedding
↓
Search Vector Database
↓
Retrieve Relevant Chunks
↓
Send Context + Query to LLM
↓
Generate Final Answer
Step-by-Step Retrieval Process
1. Documents are split into chunks
Large documents are divided into smaller pieces.
Example:
Document:
"React uses virtual DOM for efficient rendering..."
Chunks:
1. "React uses virtual DOM..."
2. "React components manage UI..."
3. "Hooks allow functional state..."
Why chunking?
-
LLM context windows are limited
-
Smaller chunks improve search accuracy
-
Helps retrieve only relevant information
2. Convert chunks into embeddings
Each chunk is converted into a vector (list of numbers) using an embedding model.
Example:
"React uses virtual DOM"
→ [0.12, -0.44, 0.98, ...]
Embeddings capture semantic meaning.
Important Interview Point
Similar meanings produce similar vectors.
Example:
-
“car”
-
“vehicle”
will have close embeddings.
3. Store embeddings in vector database
The vectors are stored in a vector DB such as:
-
Pinecone
-
Weaviate
-
Chroma
-
Milvus
-
MongoDB
-
Azure
The database stores:
-
chunk text
-
embedding vector
-
metadata
What happens during retrieval?
Suppose the user asks:
"Why is virtual DOM fast?"
4. Query embedding generation
The user query is also converted into an embedding vector.
"Why is virtual DOM fast?"
→ [0.11, -0.40, 0.96, ...]
5. Similarity search
The vector DB compares the query vector with stored vectors.
It uses similarity algorithms like:
-
Cosine similarity
-
Euclidean distance
-
Dot product
The most semantically similar chunks are retrieved.
Example
Stored chunks:
Chunk A: "React uses virtual DOM for faster updates"
Chunk B: "Angular uses TypeScript"
Chunk C: "CNNs are used in image processing"
Query:
"Why is virtual DOM efficient?"
Retrieved chunk:
Chunk A
because its embedding is closest.
6. Retrieved chunks sent to LLM
The retrieved context is added to the prompt.
Example:
Context:
"React uses virtual DOM for faster updates..."
Question:
"Why is virtual DOM efficient?"
Then the LLM generates the final answer.
Types of Retrieval in RAG
1. Dense Retrieval
Uses embeddings/vector similarity.
Most modern RAG systems use this.
Example:
-
semantic search
-
vector DB search
2. Sparse Retrieval
Traditional keyword search.
Uses:
-
BM25
-
TF-IDF
Good for exact keyword matching.
3. Hybrid Retrieval
Combines:
-
semantic search
-
keyword search
This is widely used in production systems.
Important Interview Concepts
Top-K Retrieval
The system retrieves top K relevant chunks.
Example:
Top 3 chunks
Top 5 chunks
More chunks = more context but higher token cost.
Re-ranking
After retrieval, another model ranks results again to improve relevance.
Flow:
Retrieve 20 chunks
↓
Re-ranker selects best 5
↓
Send to LLM
Metadata Filtering
Filters documents before retrieval.
Example:
department = HR
year = 2026
document_type = policy
Useful in enterprise RAG.
Common Interview Question
Q: Why not send entire database to LLM?
Answer:
-
Too expensive
-
Token limits
-
Slow inference
-
Irrelevant information reduces answer quality
Retrieval selects only useful context.
Advantages of Retrieval
-
Provides latest information
-
Reduces hallucinations
-
Supports private/company data
-
Improves factual accuracy
-
Cheaper than fine-tuning large models
Limitations
-
Bad retrieval → bad answers
-
Chunking strategy matters
-
Embedding quality matters
-
Vector DB tuning is important
Short Interview Answer
In RAG, retrieval works by converting documents and user queries into embeddings and storing document embeddings in a vector database. When a user asks a question, the query embedding is compared against stored embeddings using similarity search. The most relevant chunks are retrieved and passed to the LLM as context, allowing the model to generate accurate and up-to-date answers.
