How does retrieval work in RAG?

In Retrieval-Augmented Generation, the retrieval step is responsible for finding relevant external information before the LLM generates an answer.

Without retrieval, the LLM only relies on its pre-trained knowledge.
With retrieval, it can fetch updated or domain-specific data from documents, databases, PDFs, websites, or vector stores.

Simple Interview Definition

Retrieval in RAG is the process of searching and fetching the most relevant documents or chunks from an external knowledge source based on the user query, and providing them to the LLM for answer generation.

High-Level RAG Flow

User Query
    ↓
Convert Query into Embedding
    ↓
Search Vector Database
    ↓
Retrieve Relevant Chunks
    ↓
Send Context + Query to LLM
    ↓
Generate Final Answer

Step-by-Step Retrieval Process

1. Documents are split into chunks

Large documents are divided into smaller pieces.

Example:

Document:
"React uses virtual DOM for efficient rendering..."

Chunks:
1. "React uses virtual DOM..."
2. "React components manage UI..."
3. "Hooks allow functional state..."

Why chunking?

LLM context windows are limited
Smaller chunks improve search accuracy
Helps retrieve only relevant information

2. Convert chunks into embeddings

Each chunk is converted into a vector (list of numbers) using an embedding model.

Example:

"React uses virtual DOM"
→ [0.12, -0.44, 0.98, ...]

Embeddings capture semantic meaning.

Important Interview Point

Similar meanings produce similar vectors.

Example:

“car”
“vehicle”

will have close embeddings.

3. Store embeddings in vector database

The vectors are stored in a vector DB such as:

Pinecone
Weaviate
Chroma
Milvus
MongoDB
Azure

The database stores:

chunk text
embedding vector
metadata

What happens during retrieval?

Suppose the user asks:

"Why is virtual DOM fast?"

4. Query embedding generation

The user query is also converted into an embedding vector.

"Why is virtual DOM fast?"
→ [0.11, -0.40, 0.96, ...]

5. Similarity search

The vector DB compares the query vector with stored vectors.

It uses similarity algorithms like:

Cosine similarity
Euclidean distance
Dot product

The most semantically similar chunks are retrieved.

Example

Stored chunks:

Chunk A: "React uses virtual DOM for faster updates"
Chunk B: "Angular uses TypeScript"
Chunk C: "CNNs are used in image processing"

Query:

"Why is virtual DOM efficient?"

Retrieved chunk:

Chunk A

because its embedding is closest.

6. Retrieved chunks sent to LLM

The retrieved context is added to the prompt.

Example:

Context:
"React uses virtual DOM for faster updates..."

Question:
"Why is virtual DOM efficient?"

Then the LLM generates the final answer.

Types of Retrieval in RAG

1. Dense Retrieval

Uses embeddings/vector similarity.

Most modern RAG systems use this.

Example:

semantic search
vector DB search

2. Sparse Retrieval

Traditional keyword search.

Uses:

BM25
TF-IDF

Good for exact keyword matching.

3. Hybrid Retrieval

Combines:

semantic search
keyword search

This is widely used in production systems.

Important Interview Concepts

Top-K Retrieval

The system retrieves top K relevant chunks.

Example:

Top 3 chunks
Top 5 chunks

More chunks = more context but higher token cost.

Re-ranking

After retrieval, another model ranks results again to improve relevance.

Flow:

Retrieve 20 chunks
↓
Re-ranker selects best 5
↓
Send to LLM

Metadata Filtering

Filters documents before retrieval.

Example:

department = HR
year = 2026
document_type = policy

Useful in enterprise RAG.

Common Interview Question

Q: Why not send entire database to LLM?

Answer:

Too expensive
Token limits
Slow inference
Irrelevant information reduces answer quality

Retrieval selects only useful context.

Advantages of Retrieval

Provides latest information
Reduces hallucinations
Supports private/company data
Improves factual accuracy
Cheaper than fine-tuning large models

Limitations

Bad retrieval → bad answers
Chunking strategy matters
Embedding quality matters
Vector DB tuning is important

Short Interview Answer

In RAG, retrieval works by converting documents and user queries into embeddings and storing document embeddings in a vector database. When a user asks a question, the query embedding is compared against stored embeddings using similarity search. The most relevant chunks are retrieved and passed to the LLM as context, allowing the model to generate accurate and up-to-date answers.