Tool for HR, Hiring Managers, and the Leadership Team

How does retrieval work in RAG?

How does retrieval work in RAG?

In Retrieval-Augmented Generation, the retrieval step is responsible for finding relevant external information before the LLM generates an answer.

Without retrieval, the LLM only relies on its pre-trained knowledge.
With retrieval, it can fetch updated or domain-specific data from documents, databases, PDFs, websites, or vector stores.

Simple Interview Definition

Retrieval in RAG is the process of searching and fetching the most relevant documents or chunks from an external knowledge source based on the user query, and providing them to the LLM for answer generation.

High-Level RAG Flow

User Query
    ↓
Convert Query into Embedding
    ↓
Search Vector Database
    ↓
Retrieve Relevant Chunks
    ↓
Send Context + Query to LLM
    ↓
Generate Final Answer

Step-by-Step Retrieval Process

1. Documents are split into chunks

Large documents are divided into smaller pieces.

Example:

Document:
"React uses virtual DOM for efficient rendering..."

Chunks:
1. "React uses virtual DOM..."
2. "React components manage UI..."
3. "Hooks allow functional state..."

Why chunking?

  • LLM context windows are limited

  • Smaller chunks improve search accuracy

  • Helps retrieve only relevant information

2. Convert chunks into embeddings

Each chunk is converted into a vector (list of numbers) using an embedding model.

Example:

"React uses virtual DOM"
→ [0.12, -0.44, 0.98, ...]

Embeddings capture semantic meaning.

Important Interview Point

Similar meanings produce similar vectors.

Example:

  • “car”

  • “vehicle”

will have close embeddings.

3. Store embeddings in vector database

The vectors are stored in a vector DB such as:

  • Pinecone

  • Weaviate

  • Chroma

  • Milvus

  • MongoDB

  • Azure

The database stores:

  • chunk text

  • embedding vector

  • metadata

What happens during retrieval?

Suppose the user asks:

"Why is virtual DOM fast?"

4. Query embedding generation

The user query is also converted into an embedding vector.

"Why is virtual DOM fast?"
→ [0.11, -0.40, 0.96, ...]

5. Similarity search

The vector DB compares the query vector with stored vectors.

It uses similarity algorithms like:

  • Cosine similarity

  • Euclidean distance

  • Dot product

The most semantically similar chunks are retrieved.

Example

Stored chunks:

Chunk A: "React uses virtual DOM for faster updates"
Chunk B: "Angular uses TypeScript"
Chunk C: "CNNs are used in image processing"

Query:

"Why is virtual DOM efficient?"

Retrieved chunk:

Chunk A

because its embedding is closest.

6. Retrieved chunks sent to LLM

The retrieved context is added to the prompt.

Example:

Context:
"React uses virtual DOM for faster updates..."

Question:
"Why is virtual DOM efficient?"

Then the LLM generates the final answer.

Types of Retrieval in RAG

1. Dense Retrieval

Uses embeddings/vector similarity.

Most modern RAG systems use this.

Example:

  • semantic search

  • vector DB search

2. Sparse Retrieval

Traditional keyword search.

Uses:

  • BM25

  • TF-IDF

Good for exact keyword matching.

3. Hybrid Retrieval

Combines:

  • semantic search

  • keyword search

This is widely used in production systems.

Important Interview Concepts

Top-K Retrieval

The system retrieves top K relevant chunks.

Example:

Top 3 chunks
Top 5 chunks

More chunks = more context but higher token cost.

Re-ranking

After retrieval, another model ranks results again to improve relevance.

Flow:

Retrieve 20 chunks
↓
Re-ranker selects best 5
↓
Send to LLM

Metadata Filtering

Filters documents before retrieval.

Example:

department = HR
year = 2026
document_type = policy

Useful in enterprise RAG.

Common Interview Question

Q: Why not send entire database to LLM?

Answer:

  • Too expensive

  • Token limits

  • Slow inference

  • Irrelevant information reduces answer quality

Retrieval selects only useful context.

Advantages of Retrieval

  • Provides latest information

  • Reduces hallucinations

  • Supports private/company data

  • Improves factual accuracy

  • Cheaper than fine-tuning large models

Limitations

  • Bad retrieval → bad answers

  • Chunking strategy matters

  • Embedding quality matters

  • Vector DB tuning is important

Short Interview Answer

In RAG, retrieval works by converting documents and user queries into embeddings and storing document embeddings in a vector database. When a user asks a question, the query embedding is compared against stored embeddings using similarity search. The most relevant chunks are retrieved and passed to the LLM as context, allowing the model to generate accurate and up-to-date answers.