Hybrid search for RAG: combining dense embeddings and keyword search for better retrieval

Q: How does Reciprocal Rank Fusion (RRF) work?

RRF assigns each document a score of 1 / (k + rank) for each retrieval system it appears in, then sums those scores across systems. The constant k (typically 60) prevents top-ranked documents from dominating too heavily. Documents appearing near the top of multiple systems accumulate higher scores and rise in the final merged list. It's model-free — you don't need to normalize or calibrate the raw scores from BM25 and cosine similarity against each other.

By the end of this tutorial, you’ll have a working hybrid search pipeline that combines BM25 keyword retrieval with dense vector embeddings, fused via Reciprocal Rank Fusion — and you’ll see concretely why hybrid search RAG embeddings consistently outperform pure semantic search on real-world document corpora, often by 25–35% on precision@5.

Pure vector search feels like magic until it doesn’t. Query “SOC 2 Type II audit requirements” against a compliance knowledge base and your cosine similarity might surface “security certification processes” — semantically close, but missing the specific term match that matters. BM25 would nail it. The fix isn’t to pick one approach; it’s to run both and merge the ranked lists intelligently.

Install dependencies — Set up rank-bm25, sentence-transformers, and Qdrant client
Chunk and index documents — Build both BM25 and vector indexes over your corpus
Run parallel retrieval — Execute keyword and semantic search simultaneously
Fuse results with RRF — Merge ranked lists using Reciprocal Rank Fusion
Pass fused context to Claude — Feed reranked chunks into your RAG prompt
Benchmark the improvement — Measure precision@5 against a baseline

Why Pure Vector Search Fails on Specific Terminology

Embeddings are great at capturing semantic intent. They’re bad at exact matching. If your corpus contains model numbers, legal citations, API endpoint names, or product SKUs, a dense retriever will routinely miss them because those strings don’t cluster meaningfully in embedding space. “AWS EC2 t3.micro” and “EC2 t3.medium” might land far apart as vectors despite being obviously related in context.

BM25, the inverse-document-frequency-weighted keyword model used by Elasticsearch and Solr, handles this naturally. But BM25 breaks down on paraphrasing — “what’s the return policy” won’t match a document titled “refund procedures and eligibility” unless you’ve done query expansion or synonym handling.

The practical answer is hybrid retrieval: run both, then fuse. If you’re already building a RAG pipeline (check out our RAG pipeline from scratch guide if you’re starting cold), adding BM25 as a second retrieval channel is a relatively small delta with a meaningful precision gain.

Step 1: Install Dependencies

pip install rank-bm25 sentence-transformers qdrant-client anthropic numpy tqdm

We’re using rank-bm25 for the keyword side, sentence-transformers for local embeddings (swap in OpenAI or Cohere if you prefer), and Qdrant as the vector store. You can run Qdrant locally via Docker for free: docker run -p 6333:6333 qdrant/qdrant.

For embeddings in production, all-MiniLM-L6-v2 is fast and costs nothing — roughly 14K tokens/second on a CPU. If you need higher quality, BAAI/bge-large-en-v1.5 benchmarks better on retrieval tasks at about 3–4x the inference time. Both are free; the cost you’re really managing here is Qdrant storage and Claude API tokens on the generation side.

Step 2: Chunk and Index Documents

from rank_bm25 import BM25Okapi
from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid, json

# --- Config ---
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION = "hybrid_rag_demo"
CHUNK_SIZE = 512  # tokens approx; tune per your doc type

encoder = SentenceTransformer(EMBED_MODEL)
qdrant = QdrantClient("localhost", port=6333)

def chunk_text(text: str, size: int = CHUNK_SIZE) -> list[str]:
    words = text.split()
    return [" ".join(words[i:i+size]) for i in range(0, len(words), size)]

# Example corpus — swap this for your actual document loader
documents = [
    "SOC 2 Type II audit requirements mandate continuous monitoring over a 6-12 month period.",
    "Security certification processes verify that controls are operating effectively.",
    "AWS EC2 t3.micro instances provide 2 vCPUs and 1GB RAM, suitable for low-traffic workloads.",
    "Refund procedures and eligibility depend on the product category and purchase date.",
    "The return policy allows returns within 30 days for unused items in original packaging.",
]

all_chunks = []
for doc in documents:
    all_chunks.extend(chunk_text(doc))

# --- Build BM25 index (in-memory, rebuilt each startup) ---
tokenized_corpus = [chunk.lower().split() for chunk in all_chunks]
bm25 = BM25Okapi(tokenized_corpus)

# --- Build Qdrant vector index ---
qdrant.recreate_collection(
    collection_name=COLLECTION,
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),  # 384 = MiniLM dim
)

embeddings = encoder.encode(all_chunks, show_progress_bar=True)

points = [
    PointStruct(
        id=str(uuid.uuid4()),
        vector=embeddings[i].tolist(),
        payload={"text": all_chunks[i], "chunk_index": i}
    )
    for i in range(len(all_chunks))
]
qdrant.upsert(collection_name=COLLECTION, points=points)
print(f"Indexed {len(all_chunks)} chunks into Qdrant and BM25")

A few notes on the indexing choices: BM25 is rebuilt in memory at startup, which is fine for corpora under ~500K chunks. Beyond that, you’ll want Elasticsearch or OpenSearch’s BM25 implementation with persistent storage. The chunk size of 512 words is a reasonable starting point — technical documentation often benefits from shorter 256-word chunks, while narrative prose does better at 512–1024.

Step 3: Run Parallel Retrieval

import numpy as np

def bm25_search(query: str, k: int = 10) -> list[tuple[int, float]]:
    """Returns list of (chunk_index, bm25_score) sorted descending."""
    tokens = query.lower().split()
    scores = bm25.get_scores(tokens)
    top_k_indices = np.argsort(scores)[::-1][:k]
    return [(int(idx), float(scores[idx])) for idx in top_k_indices]

def vector_search(query: str, k: int = 10) -> list[tuple[int, float]]:
    """Returns list of (chunk_index, cosine_score) sorted descending."""
    query_vec = encoder.encode([query])[0].tolist()
    results = qdrant.search(
        collection_name=COLLECTION,
        query_vector=query_vec,
        limit=k,
        with_payload=True
    )
    return [(r.payload["chunk_index"], r.score) for r in results]

Both functions return the same shape: a list of (chunk_index, score) tuples. That’s intentional — the fusion step doesn’t care about the raw scores, only the rank positions, which is what makes RRF robust.

Step 4: Fuse Results with Reciprocal Rank Fusion

RRF is the simplest fusion method that actually works. The formula is: for each document, sum 1 / (k + rank) across all retrieval systems, where k is a constant (typically 60) that dampens the influence of very high-ranked results. Documents appearing at rank 1 in multiple systems get a large boost; documents appearing only in one system at rank 50 get almost nothing.

from collections import defaultdict

def reciprocal_rank_fusion(
    *ranked_lists: list[tuple[int, float]],
    k: int = 60,
    top_n: int = 5
) -> list[tuple[int, float]]:
    """
    Fuse multiple ranked lists using RRF.
    Each ranked_list is [(chunk_index, score), ...] sorted descending by score.
    Returns top_n results as [(chunk_index, rrf_score), ...].
    """
    rrf_scores: dict[int, float] = defaultdict(float)
    
    for ranked_list in ranked_lists:
        for rank, (chunk_idx, _score) in enumerate(ranked_list, start=1):
            rrf_scores[chunk_idx] += 1.0 / (k + rank)
    
    sorted_results = sorted(rrf_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_results[:top_n]

def hybrid_search(query: str, top_n: int = 5) -> list[str]:
    """Full hybrid retrieval pipeline. Returns top_n chunk texts."""
    bm25_results = bm25_search(query, k=20)
    vector_results = vector_search(query, k=20)
    
    fused = reciprocal_rank_fusion(bm25_results, vector_results, top_n=top_n)
    
    return [all_chunks[chunk_idx] for chunk_idx, _score in fused]

Why retrieve 20 candidates per system before fusing to top 5? Because RRF rewards documents that appear in both lists. If you only retrieve 5 per system, you limit the overlap opportunity. Fetching 20 and fusing to 5 consistently outperforms 5+5 in my testing — the extra latency is negligible (single-digit milliseconds for BM25, sub-50ms for a local Qdrant instance).

Step 5: Pass Fused Context to Claude

import anthropic

client = anthropic.Anthropic()  # reads ANTHROPIC_API_KEY from env

def rag_query(question: str) -> str:
    # Retrieve fused context
    context_chunks = hybrid_search(question, top_n=5)
    context = "\n\n---\n\n".join(context_chunks)
    
    # Build the RAG prompt
    system_prompt = (
        "You are a precise assistant. Answer the question using ONLY the provided context. "
        "If the context doesn't contain the answer, say so explicitly."
    )
    
    user_message = f"""Context:
{context}

Question: {question}

Answer:"""
    
    response = client.messages.create(
        model="claude-haiku-4-5",       # ~$0.0008 per 1K input tokens
        max_tokens=512,
        system=system_prompt,
        messages=[{"role": "user", "content": user_message}]
    )
    
    return response.content[0].text

# Test it
answer = rag_query("What are the SOC 2 Type II audit requirements?")
print(answer)

Using Claude Haiku here because RAG answers are usually context-bound — the retrieval is doing the heavy lifting, not the model’s parametric knowledge. A typical RAG query with 5 chunks of 512 words each will run about 2,500 input tokens. At Haiku’s current pricing, that’s roughly $0.002 per query including generation. If you’re doing thousands of queries daily, that matters. For cases where the reasoning over retrieved content is complex, step up to Sonnet — but start with Haiku and measure whether quality suffers before paying the premium.

Grounding Claude’s responses in retrieved facts also directly addresses hallucination risk. If you’re not already familiar with the failure modes here, the structured output and verification patterns for reducing LLM hallucinations article covers the verification layer you should add on top of this retrieval step.

Step 6: Benchmark the Improvement

def precision_at_k(retrieved: list[str], relevant: list[str], k: int = 5) -> float:
    """Simple precision@k: fraction of top-k results that are relevant."""
    top_k = retrieved[:k]
    hits = sum(1 for chunk in top_k if any(rel in chunk for rel in relevant))
    return hits / k

# Example evaluation set: query -> list of relevant substring markers
eval_set = [
    {
        "query": "SOC 2 Type II audit requirements",
        "relevant_keywords": ["SOC 2 Type II", "6-12 month"],
    },
    {
        "query": "EC2 instance specs",
        "relevant_keywords": ["t3.micro", "2 vCPUs"],
    },
    {
        "query": "how do I get a refund",
        "relevant_keywords": ["return policy", "refund procedures", "30 days"],
    },
]

# Compare hybrid vs vector-only
for item in eval_set:
    q = item["query"]
    rel = item["relevant_keywords"]
    
    hybrid_results = hybrid_search(q, top_n=5)
    vector_only = [all_chunks[idx] for idx, _ in vector_search(q, k=5)]
    
    h_p5 = precision_at_k(hybrid_results, rel)
    v_p5 = precision_at_k(vector_only, rel)
    
    print(f"Query: '{q}'")
    print(f"  Hybrid P@5: {h_p5:.2f}  |  Vector-only P@5: {v_p5:.2f}")
    print()

On a realistic technical documentation corpus of 5,000 chunks, you’ll typically see hybrid search matching or exceeding pure semantic search on every query type, with the biggest gains on exact-term queries (model numbers, API names, legal citations). The gain on paraphrase-heavy queries is smaller — which is expected, since BM25 adds nothing on those and RRF’s contribution from BM25 washes out.

For deeper evaluation tooling, the semantic search embeddings guide covers NDCG and MRR metrics that give a fuller picture than precision@5 alone.

Common Errors

BM25 index returns all-zero scores

This usually means your tokenization doesn’t match between indexing and querying. If you lowercase during indexing (chunk.lower().split()), make sure you do the same at query time (query.lower().split()). It’s also triggered by querying with terms that appear in every document (stopwords) — BM25 scores them near zero because their IDF weight is almost nothing. Solution: add stopword filtering if your queries are very generic.

Qdrant dimension mismatch error on upsert

You’ll get Wrong input: Vector inserting error: expected dim: 384, got 768 if you switch embedding models after creating the collection. recreate_collection drops and rebuilds — use it freely in development. In production, treat the collection as append-only and create a new collection when you change models. The collection name should encode the model: docs_minilm_384 not just docs.

RRF returns fewer results than expected

If top_n=5 but you’re getting 3 results, your corpus is smaller than your retrieval window. If k=20 in both searches but you only have 10 chunks total, Qdrant caps at the collection size and BM25 returns all chunks — the fused list might only have 10 unique entries. Not a bug, just size arithmetic. Add an assertion: assert len(all_chunks) >= top_n at pipeline startup to catch this early.

What to Build Next

The natural extension here is query-time reranking with a cross-encoder. After RRF gives you your top 10 candidates, pass each (query, chunk) pair through a cross-encoder like cross-encoder/ms-marco-MiniLM-L-6-v2 — it scores relevance jointly rather than independently, which consistently improves precision@5 by another 5–15 percentage points. It adds ~100ms latency per query on CPU, which is acceptable for most RAG applications. Pair that with a fallback strategy for when retrieval confidence is low, and you have a production-grade retrieval system.

If you’re scaling to millions of documents, consider moving the BM25 side to Elasticsearch (which has a native hybrid search API in 8.x) and keeping Qdrant for vectors — the operational overhead is worth it once you’re beyond ~1M chunks and need distributed BM25 scoring.

Bottom line by reader type: Solo founder building a knowledge-base chatbot — start here, exactly as written, with Qdrant local + BM25 in memory. It’s free, deploys in an afternoon, and the precision gain over pure vector search is immediate and measurable. Engineering team on a production system — replace in-memory BM25 with Elasticsearch, add cross-encoder reranking, and instrument retrieval scores in your observability stack (Langfuse or similar). Enterprise with strict data residency — self-host Qdrant Enterprise or Weaviate, and run the encoder model locally; the hybrid search architecture itself doesn’t change.

Hybrid search RAG embeddings are not a premature optimization — they’re the baseline you should be starting from. Pure vector search is the shortcut that costs you precision in production.

Frequently Asked Questions

What is the difference between BM25 and vector search in a RAG pipeline?

BM25 is a term-frequency/inverse-document-frequency model that scores documents based on exact word matches — it excels at specific terminology, product names, and citations. Vector search uses dense embeddings to find semantically similar content regardless of exact wording. In a RAG pipeline, BM25 catches what vector search misses on precise queries, and vector search catches what BM25 misses on paraphrased or conceptual queries.

How does Reciprocal Rank Fusion (RRF) work?

RRF assigns each document a score of 1 / (k + rank) for each retrieval system it appears in, then sums those scores across systems. The constant k (typically 60) prevents top-ranked documents from dominating too heavily. Documents appearing near the top of multiple systems accumulate higher scores and rise in the final merged list. It’s model-free — you don’t need to normalize or calibrate the raw scores from BM25 and cosine similarity against each other.

Can I use hybrid search with Pinecone or other managed vector databases?

Pinecone’s sparse-dense hybrid search supports BM25-style sparse vectors natively — you encode your documents with their built-in sparse encoder and store both sparse and dense vectors in the same index. Weaviate and Elasticsearch 8.x also have native hybrid retrieval APIs. The approach in this tutorial (separate BM25 + Qdrant with RRF fusion) gives you more control over the fusion logic and works with any vector store, at the cost of managing two systems.

How much does hybrid search add to RAG query latency?

With the setup in this tutorial — local BM25 in memory + local Qdrant — the BM25 search adds under 10ms for corpora up to ~100K chunks, and RRF fusion is microseconds. The dominant latency is still the vector search (20–80ms locally, 50–150ms on managed services) and the LLM generation step (500ms–3s depending on model and output length). Hybrid retrieval adds negligible overhead relative to those two costs.

Should I use hybrid search for every RAG application?

Not always. If your documents are purely narrative (news articles, blog posts, fiction) and your queries are conceptual, pure vector search may be sufficient and simpler to maintain. Hybrid search delivers the biggest gains when your corpus contains specific identifiers — model numbers, legal citations, API names, person names, dates — that don’t embed well. When in doubt, benchmark on a representative sample of 50–100 real queries before deciding.

Put this into practice

Try the Search Specialist agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Hybrid search for RAG: combining dense embeddings and keyword search for better retrieval

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Hybrid search for RAG: combining dense embeddings and keyword search for better retrieval

Why Pure Vector Search Fails on Specific Terminology

Step 1: Install Dependencies

Step 2: Chunk and Index Documents

Step 3: Run Parallel Retrieval

Step 4: Fuse Results with Reciprocal Rank Fusion

Step 5: Pass Fused Context to Claude

Step 6: Benchmark the Improvement

Common Errors

BM25 index returns all-zero scores

Qdrant dimension mismatch error on upsert

RRF returns fewer results than expected

What to Build Next

Frequently Asked Questions

What is the difference between BM25 and vector search in a RAG pipeline?

How does Reciprocal Rank Fusion (RRF) work?

Can I use hybrid search with Pinecone or other managed vector databases?

How much does hybrid search add to RAG query latency?

Should I use hybrid search for every RAG application?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation