Sunday, April 5

If your RAG agent is hallucinating or returning irrelevant context, the problem is almost never the LLM — it’s your retrieval layer. Bad semantic search embeddings mean the right chunks never reach Claude, so it fabricates answers from whatever did show up. This tutorial walks you through choosing embedding models, building a working retrieval pipeline, and tuning it so your agent actually finds what users are asking for.

By the end, you’ll have a working Python-based semantic search pipeline backed by a vector store, with concrete techniques for measuring and improving retrieval quality.

  1. Install dependencies — set up sentence-transformers, Qdrant client, and supporting libraries
  2. Choose your embedding model — compare OpenAI, Cohere, and open-source options with real tradeoffs
  3. Chunk and embed your knowledge base — build the index with sensible defaults
  4. Run semantic queries — wire up retrieval and inspect results
  5. Tune retrieval quality — reranking, hybrid search, and score thresholds
  6. Integrate with your Claude agent — feed retrieved context cleanly

Step 1: Install Dependencies

You’ll need sentence-transformers for open-source embeddings, qdrant-client for the vector store, and anthropic for the agent layer. If you want to compare OpenAI embeddings, add openai.

pip install sentence-transformers qdrant-client anthropic openai cohere tiktoken

We’re using Qdrant in-memory mode for this tutorial, which means no infrastructure to spin up. For production, you’ll want a persistent Qdrant instance or a managed alternative — see our comparison of Pinecone, Weaviate, and Qdrant for RAG agents for a thorough breakdown of which fits which use case.

Step 2: Choose Your Embedding Model

This decision matters more than most tutorials admit. The wrong model will cap your retrieval quality regardless of how much you tune downstream.

The shortlist in 2024

  • text-embedding-3-small (OpenAI) — 1536 dims, ~$0.02 per million tokens. Best default for most teams. Strong general performance, fast, cheap.
  • text-embedding-3-large (OpenAI) — 3072 dims, ~$0.13 per million tokens. Measurably better on technical and multilingual content. Worth it if you’re embedding large corpora once.
  • embed-english-v3.0 (Cohere) — purpose-built for retrieval, supports int8 quantization. Competitive with OpenAI at similar pricing. Has a native reranker that works well with it.
  • BAAI/bge-m3 (open-source) — runs locally, supports 100+ languages, 8192 token context window. ~600ms per batch on CPU, much faster on GPU. Zero ongoing cost after compute.
  • all-MiniLM-L6-v2 — tiny (22M params), fast, surprisingly good for general English. Don’t use it for technical domains without testing.

My recommendation: Start with text-embedding-3-small unless you’re cost-constrained or need to self-host. If you’re building something domain-specific (legal, medical, code), test bge-m3 — the longer context window alone often wins on document-heavy knowledge bases. For custom domain embeddings, we’ve covered the process of building domain-specific embedding models from scratch.

Step 3: Chunk and Embed Your Knowledge Base

Chunking strategy is where most RAG implementations quietly fail. Too large and the retrieved chunk buries the relevant sentence in noise. Too small and you lose context the LLM needs.

A practical default: 512 tokens per chunk, 50-token overlap, split on sentence boundaries when possible. Adjust based on your document type — code snippets need different treatment than prose.

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
import tiktoken

# Load the embedding model (swap for OpenAI/Cohere if preferred)
model = SentenceTransformer("BAAI/bge-m3")
EMBEDDING_DIM = 1024  # bge-m3 output dimension

# Init Qdrant in-memory (swap for URL= for persistent/hosted)
client = QdrantClient(":memory:")
client.create_collection(
    collection_name="knowledge_base",
    vectors_config=VectorParams(size=EMBEDDING_DIM, distance=Distance.COSINE),
)

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Token-aware chunking with overlap."""
    enc = tiktoken.get_encoding("cl100k_base")
    tokens = enc.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start += chunk_size - overlap
    return chunks

def embed_and_index(documents: list[dict]):
    """
    documents: list of {"text": str, "metadata": dict}
    """
    points = []
    for doc in documents:
        chunks = chunk_text(doc["text"])
        embeddings = model.encode(chunks, normalize_embeddings=True)  # normalize for cosine
        for chunk, embedding in zip(chunks, embeddings):
            points.append(PointStruct(
                id=str(uuid.uuid4()),
                vector=embedding.tolist(),
                payload={"text": chunk, **doc["metadata"]},
            ))
    client.upsert(collection_name="knowledge_base", points=points)
    print(f"Indexed {len(points)} chunks")

# Example usage
docs = [
    {"text": "Your long document text here...", "metadata": {"source": "doc1.pdf", "category": "policy"}},
]
embed_and_index(docs)

One thing the docs don’t make obvious: always normalize_embeddings=True with cosine distance. Without it you’re comparing raw magnitudes, not directions, and your similarity scores become unreliable.

Step 4: Run Semantic Queries

def semantic_search(query: str, top_k: int = 5, score_threshold: float = 0.65) -> list[dict]:
    """
    Returns ranked chunks above the score threshold.
    Lower the threshold if you're getting zero results; raise it to cut noise.
    """
    query_embedding = model.encode([query], normalize_embeddings=True)[0]
    
    results = client.search(
        collection_name="knowledge_base",
        query_vector=query_embedding.tolist(),
        limit=top_k,
        score_threshold=score_threshold,  # Qdrant filters below this cosine similarity
        with_payload=True,
    )
    
    return [
        {
            "text": r.payload["text"],
            "score": r.score,
            "source": r.payload.get("source", "unknown"),
        }
        for r in results
    ]

# Test it
hits = semantic_search("What is the refund policy for annual subscriptions?")
for h in hits:
    print(f"[{h['score']:.3f}] {h['source']}: {h['text'][:120]}...")

The score_threshold parameter is doing real work here. Setting it to 0.0 means you always return something, even if it’s completely unrelated — which is how you end up feeding garbage context to your LLM. A score below 0.6 usually means the knowledge base doesn’t contain the answer, and you should tell the user that rather than hallucinating.

Step 5: Tune Retrieval Quality

Getting the pipeline working is step one. Getting it working well requires measurement and iteration.

Add a reranker

Embedding similarity captures semantic proximity but doesn’t distinguish between “document mentions the word” and “document actually answers the question.” A cross-encoder reranker fixes this. Cohere’s reranker costs ~$1 per thousand searches, which is worth it on user-facing products. The open-source alternative is cross-encoder/ms-marco-MiniLM-L-6-v2.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 3) -> list[dict]:
    """
    Takes semantic search candidates, reranks by relevance to query.
    """
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)
    
    # Attach reranker score and sort
    for i, candidate in enumerate(candidates):
        candidate["rerank_score"] = float(scores[i])
    
    return sorted(candidates, key=lambda x: x["rerank_score"], reverse=True)[:top_n]

# Usage: first retrieve more candidates, then rerank
candidates = semantic_search(query, top_k=15, score_threshold=0.5)
final_results = rerank(query, candidates, top_n=3)

Hybrid search (keyword + semantic)

Pure semantic search misses exact matches — product codes, names, specific version numbers. Combine BM25 keyword search with vector search using Qdrant’s built-in sparse vector support, or run a simple BM25 pass first. Retrieval quality typically improves 10-20% on domain-specific corpora when you add keyword matching.

Measure before you tune

Pick 20-30 representative queries, annotate the correct chunks manually, then compute recall@k. If recall@5 is below 70%, your chunking or embedding model is the bottleneck. If it’s above 80% but your agent still hallucinates, the problem is in how you’re constructing the prompt — that’s a different problem. For systematic evaluation of your full agent pipeline, the LLM output quality evaluation guide covers metrics and A/B testing approaches worth reading.

Step 6: Integrate With Your Claude Agent

import anthropic

claude = anthropic.Anthropic()

def rag_query(user_question: str) -> str:
    # Retrieve and rerank
    candidates = semantic_search(user_question, top_k=15, score_threshold=0.55)
    
    if not candidates:
        return "I don't have information about that in my knowledge base."
    
    top_chunks = rerank(user_question, candidates, top_n=3)
    
    # Build context block
    context = "\n\n---\n\n".join(
        f"Source: {c['source']}\n{c['text']}" for c in top_chunks
    )
    
    response = claude.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        system="""You are a helpful assistant. Answer questions using ONLY the provided context.
If the context doesn't contain the answer, say so explicitly — do not guess.""",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {user_question}"
            }
        ]
    )
    
    return response.content[0].text

# Test
print(rag_query("What are the cancellation terms for enterprise plans?"))

The system prompt instruction to use “ONLY the provided context” matters. Without it, Claude will helpfully fill gaps with its training data — which looks right but may be completely wrong for your domain. This is especially important if you’re building agents for compliance-sensitive use cases. For agents where structured responses matter, the guide on structured JSON output from Claude covers how to get consistent formats from your retrieval results.

Common Errors

Error 1: Score threshold too high, zero results returned

Symptom: your agent always falls back to “I don’t have information about that.” Start by logging the raw scores for every query. If you’re seeing max scores of 0.55-0.60 on questions you know the knowledge base covers, your embedding model and document content are in different semantic spaces. Fix: either fine-tune the model on domain vocabulary, or lower the threshold and compensate with a strong reranker. Also check that you’re not comparing normalized vs. unnormalized vectors — a common silent bug.

Error 2: Chunks too large, retrieval score is high but context is wrong

Symptom: scores look good (0.75+) but the returned chunk contains the answer buried in 400 words of unrelated content. The LLM then synthesizes across everything and generates a plausible-sounding wrong answer. Fix: reduce chunk size to 256-384 tokens and re-index. Yes, this doubles your index size; it’s usually worth it.

Error 3: Duplicate chunks diluting context budget

Symptom: you retrieve top-5 chunks but three of them are near-identical overlapping windows from the same paragraph. Fix: add a deduplication pass after retrieval — compute pairwise similarity between candidates and drop anything above 0.92 similarity to an already-selected chunk before passing to the LLM.

def deduplicate_chunks(chunks: list[dict], sim_threshold: float = 0.92) -> list[dict]:
    """Remove near-duplicate chunks from retrieval results."""
    if not chunks:
        return chunks
    
    texts = [c["text"] for c in chunks]
    embeddings = model.encode(texts, normalize_embeddings=True)
    
    selected = [0]  # Always keep the top result
    for i in range(1, len(chunks)):
        is_duplicate = False
        for j in selected:
            sim = float(embeddings[i] @ embeddings[j])  # dot product of normalized = cosine
            if sim > sim_threshold:
                is_duplicate = True
                break
        if not is_duplicate:
            selected.append(i)
    
    return [chunks[i] for i in selected]

What to Build Next

The natural extension here is query expansion: before embedding the user’s question, use Claude to generate 2-3 alternative phrasings of it, embed all variants, retrieve candidates for each, then merge and deduplicate. This typically improves recall by 15-25% on short, ambiguous queries — the kind users actually type. It adds one Claude API call per search (roughly $0.001 at Haiku pricing), which is almost always worth it for user-facing products. Pair this with a proper agent benchmarking framework to measure whether each change actually moves the needle before shipping it.

If you’re thinking about infrastructure for this at scale — batching embeddings overnight for large corpora, managing vector store costs, picking the right serverless backend to serve it — the Claude batch processing guide covers the architecture for handling 10K+ documents efficiently.

Bottom Line: Which Setup for Which Team

Solo founder, tight budget: bge-m3 locally + Qdrant in-memory (or free tier), cross-encoder/ms-marco for reranking. Zero ongoing embedding cost, good quality. Expect ~1-2s latency per query on CPU.

Small team, production SaaS: text-embedding-3-small + Qdrant cloud + Cohere reranker. Reliable, fast, predictable costs. At 100K queries/month you’re looking at ~$50-80 total for embedding + reranking.

Enterprise, domain-specific corpus: Fine-tune bge-m3 on your domain data, host on your own GPU, hybrid BM25 + vector search, Cohere or custom reranker. Higher upfront investment, meaningfully better retrieval on specialized vocabulary. The semantic search embeddings you generate from domain-specific fine-tuning routinely outperform general models by 15-30% on in-domain recall.

The pipeline is less complex than it looks. The hard part is measurement — you need ground truth labels to know if your tuning is actually helping. Build that eval set early, before you start tweaking.

Frequently Asked Questions

What is the best embedding model for RAG in 2024?

For most teams, OpenAI’s text-embedding-3-small is the best default — it’s fast, cheap ($0.02/million tokens), and performs well on general English content. For domain-specific or multilingual corpora, BAAI/bge-m3 is competitive and has a larger context window (8192 tokens) that helps on long documents. Always benchmark on your own data before committing.

How do I improve retrieval quality when my RAG agent keeps returning wrong chunks?

Start by logging similarity scores for failing queries to diagnose whether it’s a threshold issue or a model mismatch. Add a cross-encoder reranker as a second pass — this alone often fixes relevance problems without changing your embedding model. Also check your chunk size; 512 tokens is a reasonable max, but dropping to 256 frequently helps precision.

What chunk size should I use for embedding documents?

512 tokens with 50-token overlap is a solid starting point for prose. For technical documentation or code, 256 tokens tends to work better because the relevant answer is usually concentrated. Always split on sentence or paragraph boundaries rather than raw token counts — cutting mid-sentence degrades embedding quality noticeably.

How do I prevent my Claude agent from hallucinating when search returns poor results?

Set a score threshold (0.6-0.65 cosine similarity is a reasonable floor) and explicitly return “I don’t have this information” when nothing clears it. Add a system prompt instruction telling Claude to use only the provided context and to say so if the answer isn’t there. Don’t rely on Claude’s judgment alone — enforce the constraint at the retrieval layer first.

Can I use semantic search embeddings with a local vector database?

Yes. Qdrant runs in-memory (for dev) or as a local Docker container (for production) with no cloud dependency. For smaller knowledge bases under ~100K chunks, in-process libraries like ChromaDB or even FAISS work fine. The tradeoff is operational overhead vs. managed service cost — Qdrant Cloud’s free tier handles most small-to-medium RAG deployments without any self-hosting.

What is the difference between semantic search and keyword search for RAG?

Keyword search (BM25) matches exact or stemmed terms — it excels at product codes, proper nouns, and precise technical terms but misses paraphrased questions entirely. Semantic search matches by meaning, catching synonyms and rephrased queries but sometimes missing exact-match specifics. Hybrid search combines both, and typically outperforms either alone by 10-20% on real-world knowledge base queries.

Put this into practice

Try the Search Specialist agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply