Most Claude agents fail not because the model is bad at reasoning, but because they’re working with stale, generic, or hallucinated knowledge. If you’ve ever watched an agent confidently answer a product-specific question with completely wrong details, you already know the problem. The fix is giving your agent a semantic search embeddings vector database — a searchable knowledge layer that lets it retrieve accurate, domain-specific information before generating a response. This article shows you exactly how to build that layer, end to end, with working code you can drop into a real project.
Why Keyword Search Breaks for Agent Knowledge Retrieval
Before jumping to the implementation, it’s worth being precise about why traditional search fails here. Keyword search (BM25, Elasticsearch, plain SQL LIKE queries) matches tokens. If your user asks “what’s the refund window for annual subscriptions?” and your docs say “yearly plan cancellation period,” a keyword index returns nothing useful. The semantics match, the tokens don’t.
Embeddings solve this by converting text into dense numerical vectors where semantic similarity maps to geometric proximity. Sentences that mean the same thing end up near each other in vector space, regardless of vocabulary. This is what makes retrieval-augmented generation (RAG) actually work in production — not just the generation step, but the retrieval step that feeds it.
The architecture we’re building has three phases:
- Indexing: chunk your documents, embed each chunk, store vectors in a vector database
- Retrieval: embed the user query, find the nearest chunks by cosine similarity
- Augmentation: inject retrieved chunks into Claude’s context before generating
Choosing Your Embedding Model
You have three realistic options depending on your constraints.
OpenAI text-embedding-3-small
This is what I default to for most production RAG systems. It’s 1536 dimensions, costs $0.02 per million tokens, and has strong performance across English and multilingual tasks. For a typical knowledge base of 1,000 chunked documents (averaging 200 tokens each), your entire indexing run costs roughly $0.004. That’s not a typo — embeddings are cheap.
Cohere embed-english-v3.0
Slightly better on retrieval benchmarks for English-only use cases, and Cohere’s compression options (int8 quantization) can cut storage costs significantly. Worth evaluating if your retrieval precision is critical and you’re English-only.
Local models via sentence-transformers
If you’re running air-gapped, dealing with sensitive data, or just don’t want API dependency, all-MiniLM-L6-v2 is a solid 384-dimension model that runs fast on CPU. Quality is noticeably lower than the API options, but for internal tooling it often clears the bar.
My verdict: use OpenAI text-embedding-3-small unless you have a specific reason not to. The cost is negligible, the quality is excellent, and the API is reliable.
Picking a Vector Database
This is where most tutorials give you a list of six options and leave you to figure it out. Here’s the actual breakdown for agent use cases:
Pinecone
Managed, fast, zero ops overhead. The free tier gives you one index with 100k vectors — enough to prototype. Paid plans start at $70/month for production workloads. The API is clean and the Python client is well-maintained. If you’re a solo founder or small team who doesn’t want to babysit infrastructure, this is the right call.
Qdrant
Open source, self-hostable, and surprisingly production-ready. The filtering syntax is excellent — you can filter by metadata before the vector search, which matters when you’re building multi-tenant systems or topic-scoped retrieval. Run it locally with Docker for development, deploy to their cloud or your own instance for production. This is my current preference for anything where I need metadata filtering or want to avoid vendor lock-in.
pgvector
If you’re already running Postgres (and most of us are), pgvector is a compelling option for smaller indexes — up to roughly 100k vectors before performance degrades meaningfully. It adds a vector column type and approximate nearest neighbor search via HNSW indexes. No new infrastructure, no new billing, familiar ops model. The limitation is scale and query speed at large volume.
For this walkthrough I’ll use Qdrant running locally via Docker, with OpenAI embeddings. The code is easy to swap for Pinecone if that’s your preference.
Building the Index: Step-by-Step Implementation
Step 1: Environment Setup
# Start Qdrant locally
docker run -p 6333:6333 qdrant/qdrant
# Install dependencies
pip install qdrant-client openai tiktoken langchain
Step 2: Document Chunking
Chunking strategy matters more than most people realize. Too large and your chunks dilute the signal with irrelevant content. Too small and you lose context. For most knowledge bases, 512 tokens with 50-token overlap is a reasonable starting point. The overlap prevents answers from being split across chunk boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
def chunk_documents(documents: list[str]) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len, # swap for tiktoken counter if you want token-exact splits
)
chunks = []
for doc in documents:
chunks.extend(splitter.split_text(doc))
return chunks
# Example usage
raw_docs = [
"Your product documentation, support articles, or internal knowledge goes here...",
# Add more documents
]
chunks = chunk_documents(raw_docs)
print(f"Generated {len(chunks)} chunks")
Step 3: Generating Embeddings
from openai import OpenAI
import time
client = OpenAI() # reads OPENAI_API_KEY from environment
def embed_chunks(chunks: list[str], batch_size: int = 100) -> list[list[float]]:
"""
Embed chunks in batches to avoid rate limits.
text-embedding-3-small: $0.02 per 1M tokens
"""
embeddings = []
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
response = client.embeddings.create(
model="text-embedding-3-small",
input=batch
)
batch_embeddings = [item.embedding for item in response.data]
embeddings.extend(batch_embeddings)
# Basic rate limit buffer — remove if you're on a high-tier key
if i + batch_size < len(chunks):
time.sleep(0.1)
return embeddings
embeddings = embed_chunks(chunks)
print(f"Embedding dimension: {len(embeddings[0])}") # Should be 1536
Step 4: Storing Vectors in Qdrant
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
qdrant = QdrantClient(host="localhost", port=6333)
COLLECTION_NAME = "agent_knowledge"
VECTOR_DIM = 1536 # matches text-embedding-3-small
# Create collection (idempotent check)
existing = [c.name for c in qdrant.get_collections().collections]
if COLLECTION_NAME not in existing:
qdrant.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)
# Upsert points with metadata
points = [
PointStruct(
id=str(uuid.uuid4()),
vector=embeddings[i],
payload={
"text": chunks[i],
"chunk_index": i,
# Add any metadata: source doc, category, date, etc.
}
)
for i in range(len(chunks))
]
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Indexed {len(points)} chunks into Qdrant")
Wiring Retrieval into Your Claude Agent
With the index built, retrieval is the easy part. You embed the incoming query, search for the top-k nearest chunks, and inject them into Claude’s system prompt as context.
import anthropic
claude = anthropic.Anthropic() # reads ANTHROPIC_API_KEY
def retrieve_context(query: str, top_k: int = 5) -> list[str]:
"""Embed query and fetch most relevant chunks from Qdrant."""
query_embedding = client.embeddings.create(
model="text-embedding-3-small",
input=[query]
).data[0].embedding
results = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_embedding,
limit=top_k,
with_payload=True,
)
return [hit.payload["text"] for hit in results]
def ask_agent(user_query: str) -> str:
"""Full RAG pipeline: retrieve → inject → generate."""
context_chunks = retrieve_context(user_query)
context_block = "\n\n---\n\n".join(context_chunks)
system_prompt = f"""You are a knowledgeable assistant. Answer the user's question
using ONLY the information in the context below. If the answer isn't in the context,
say so — do not fabricate details.
<context>
{context_block}
</context>"""
response = claude.messages.create(
model="claude-opus-4-5",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_query}]
)
return response.content[0].text
# Test it
answer = ask_agent("What is the cancellation policy for annual subscriptions?")
print(answer)
A few things worth noting about this implementation. First, the explicit instruction to not fabricate details is non-negotiable — without it, Claude will occasionally “help” by filling gaps with plausible-sounding nonsense. Second, top_k=5 is a starting point; tune it based on your chunk size and how much context your queries typically need. Going too high dilutes the signal and burns tokens.
What Breaks in Production (And How to Handle It)
Retrieval quality degrades silently
The most dangerous failure mode is when retrieval returns plausible-but-wrong chunks and Claude generates a confident wrong answer. Build evaluation into your pipeline: maintain a small test set of queries with expected answers and monitor retrieval precision on a schedule. Tools like RAGAS can automate this.
Chunk boundary problems
Critical information — pricing tiers, policy numbers, specific steps — often spans what becomes a chunk boundary. Add overlap (the 50-token setting above helps), and consider semantic chunking via LLM-based splitting for high-stakes documents.
Index drift
Your knowledge base changes. Set up a pipeline to re-embed and upsert updated documents. Qdrant’s upsert is idempotent with stable IDs, so use a deterministic ID scheme based on document path + chunk index rather than random UUIDs. That way re-indexing updates existing vectors rather than duplicating them.
Latency
Two API calls (embed query + Claude completion) plus a vector search adds up. In practice, Qdrant local search runs in under 10ms for indexes under 500k vectors, and the embedding call adds 100–200ms. The Claude call dominates. If latency matters, cache embeddings for frequently repeated queries.
When to Use This Architecture
Use this when: you have domain-specific knowledge that Claude doesn’t have baked in — internal docs, product specs, support history, legal documents, proprietary research. Also use it when accuracy matters enough that hallucination is a real risk to your product.
Skip it when: your queries are genuinely general-purpose and Claude’s training covers them well. Adding a RAG layer to answer “what is Python?” is overhead with no benefit.
Solo founders and small teams: start with Pinecone’s free tier and OpenAI embeddings. You can be up and running in a day, and the operational overhead is zero.
Teams with existing Postgres infrastructure: pgvector is worth a look if your knowledge base stays under 50k chunks. One less service to operate.
Anyone handling sensitive data: run Qdrant self-hosted and use a local embedding model like all-MiniLM-L6-v2. You lose some quality but nothing leaves your infrastructure.
The semantic search embeddings vector database combination is not the most glamorous part of building AI agents — but it’s consistently the part that determines whether the agent is actually useful or just impressive in demos. Get the retrieval layer right and Claude does the rest.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

