If your AI agent is doing keyword search to find relevant context, you’re leaving most of its potential on the table. Agents that rely on exact-match retrieval fail the moment a user phrases something differently than the document author did. Semantic search embeddings solve this by converting text into dense vectors that encode meaning — so “cardiac arrest” matches “heart attack” without any manual synonym mapping. This guide walks through building a production-ready vector search system for your agent’s knowledge base, from choosing an embedding model to querying at scale, with working code throughout.
How Vector Embeddings Actually Work (The Part Most Guides Skip)
An embedding model takes a piece of text and outputs a fixed-length array of floats — typically 768 to 3072 dimensions depending on the model. Semantically similar texts produce vectors that are geometrically close in that high-dimensional space. When you search, you embed the query using the same model, then find stored vectors that are nearest to it.
The similarity metric you almost always want is cosine similarity, not Euclidean distance. Cosine similarity measures the angle between vectors rather than their absolute distance, which makes it robust to differences in text length. A single sentence and a full paragraph about the same topic will have a similar cosine similarity score even though their raw vector magnitudes differ.
One thing the documentation consistently undersells: the embedding model and the query must come from the same model family. If you index your documents with OpenAI’s text-embedding-3-small and then query with Cohere’s embed model, you’ll get garbage results. Pick your model before you index, and treat changing it as a schema migration — you’ll need to re-embed everything.
Choosing Your Embedding Model
Here are the models worth considering in production right now:
- OpenAI text-embedding-3-small: 1536 dimensions, $0.02 per million tokens. Good all-around performance, easy integration, widely supported. My default for most projects.
- OpenAI text-embedding-3-large: 3072 dimensions, $0.13 per million tokens. Meaningful accuracy improvement on retrieval benchmarks, but 6.5x the cost. Worth it for high-stakes retrieval where recall matters more than spend.
- Cohere embed-v3: Purpose-built for retrieval with a separate query vs. document mode. Genuinely better on search tasks than OpenAI at comparable pricing. The separate query/document encoding is a real architectural advantage — use it.
- BGE-M3 (open source): Runs on your own hardware, supports 100+ languages, competitive with commercial models. If you’re processing sensitive data or have high volume that makes API costs painful, this is the path.
For embedding 1 million typical support tickets (~500 tokens each) with text-embedding-3-small, you’re looking at roughly $10. That’s a one-time cost to index; queries are cheap because you’re only embedding a short user query each time.
Setting Up Your Vector Database
You have three realistic options depending on where you are in the product lifecycle:
- Pinecone: Fully managed, generous free tier (100k vectors), excellent Python SDK. Easiest to get started. Gets expensive past ~1M vectors on production plans.
- pgvector: PostgreSQL extension. If you’re already on Postgres, this is often the right answer — you get vector search alongside your relational data in one query. Performance drops on very large datasets without careful indexing.
- Qdrant: Open source, can self-host or use their cloud. Better performance than pgvector at scale, more control than Pinecone. My recommendation for teams who want managed convenience but are nervous about Pinecone lock-in.
Building the Indexing Pipeline
Here’s a complete indexing pipeline using OpenAI embeddings and Qdrant. This handles chunking, embedding, and upsert with metadata — the metadata part is critical for filtering later.
from openai import OpenAI
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
import uuid
client = OpenAI()
qdrant = QdrantClient("localhost", port=6333)
COLLECTION_NAME = "knowledge_base"
EMBEDDING_MODEL = "text-embedding-3-small"
VECTOR_DIM = 1536
# Create collection once
qdrant.recreate_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE),
)
def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> list[str]:
"""Simple token-approximate chunking with overlap to preserve context at boundaries."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk = " ".join(words[start:end])
chunks.append(chunk)
start += chunk_size - overlap # slide with overlap
return chunks
def embed_texts(texts: list[str]) -> list[list[float]]:
"""Batch embed — OpenAI supports up to 2048 inputs per request."""
response = client.embeddings.create(input=texts, model=EMBEDDING_MODEL)
return [item.embedding for item in response.data]
def index_document(doc_id: str, text: str, metadata: dict):
chunks = chunk_text(text)
embeddings = embed_texts(chunks)
points = [
PointStruct(
id=str(uuid.uuid4()),
vector=embedding,
payload={
"doc_id": doc_id,
"chunk_index": i,
"text": chunk,
**metadata, # e.g. {"source": "support_kb", "category": "billing"}
},
)
for i, (chunk, embedding) in enumerate(zip(chunks, embeddings))
]
qdrant.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Indexed {len(points)} chunks for doc {doc_id}")
# Example usage
index_document(
doc_id="doc_001",
text="Your long document text goes here...",
metadata={"source": "help_center", "category": "billing", "language": "en"},
)
The overlap parameter on chunking is one of those things that looks like a minor detail but meaningfully affects retrieval quality. Without it, a query about information that spans a chunk boundary will match neither chunk well. 64-word overlap on 512-word chunks is a reasonable starting point — tune it based on how your content is structured.
Querying: From User Input to Ranked Results
The query side is simpler than indexing but has its own traps. Here’s a search function that handles filtering by metadata alongside semantic similarity — something you’ll need the moment your knowledge base covers more than one domain.
from qdrant_client.models import Filter, FieldCondition, MatchValue
def semantic_search(
query: str,
top_k: int = 5,
filter_metadata: dict | None = None,
) -> list[dict]:
"""
Returns ranked chunks with scores and metadata.
Scores are cosine similarity in [0, 1] — higher is more similar.
"""
# Embed the query using same model as index
query_embedding = embed_texts([query])[0]
# Build optional metadata filter
qdrant_filter = None
if filter_metadata:
qdrant_filter = Filter(
must=[
FieldCondition(key=k, match=MatchValue(value=v))
for k, v in filter_metadata.items()
]
)
results = qdrant.search(
collection_name=COLLECTION_NAME,
query_vector=query_embedding,
limit=top_k,
query_filter=qdrant_filter,
with_payload=True,
)
return [
{
"text": r.payload["text"],
"score": r.score,
"doc_id": r.payload["doc_id"],
"metadata": {k: v for k, v in r.payload.items() if k not in ("text", "doc_id")},
}
for r in results
]
# Example: search only within billing docs
results = semantic_search(
query="How do I get a refund for a failed payment?",
top_k=5,
filter_metadata={"category": "billing"},
)
for r in results:
print(f"Score: {r['score']:.3f} | {r['text'][:100]}...")
A score above 0.80 is typically a strong semantic match. Between 0.65 and 0.80 is relevant but not highly specific. Below 0.65 you’re often getting noise — consider returning a “no relevant context found” signal to your agent rather than feeding it low-confidence chunks.
Reranking: The Step That Most Implementations Skip
Raw vector similarity is a first-pass filter, not a final ranking. The top-5 by cosine similarity aren’t necessarily the 5 most useful chunks for your agent. Adding a reranker dramatically improves precision and is often the highest-ROI improvement you can make after basic retrieval is working.
Cohere’s Rerank API is the easiest drop-in. You retrieve more candidates (say, top 20) from the vector database, then rerank them to get the best 5. The reranker is a cross-encoder — it looks at the query and each document together, which is more accurate than embedding them independently but too slow to run against your entire index.
import cohere
co = cohere.Client("your-cohere-api-key")
def search_with_rerank(query: str, top_k: int = 5) -> list[dict]:
# Step 1: Retrieve more candidates than you need
candidates = semantic_search(query, top_k=20)
# Step 2: Rerank using cross-encoder
rerank_response = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=[c["text"] for c in candidates],
top_n=top_k,
)
# Step 3: Return reranked results with new scores
return [
{
**candidates[r.index],
"rerank_score": r.relevance_score, # replaces cosine score for ranking
}
for r in rerank_response.results
]
Reranking with Cohere costs $1 per 1000 searches (at current pricing). If your agent handles 10k queries a day, that’s $10/day — factor this into your unit economics. For lower-volume use cases, it’s essentially free. For high-volume pipelines, consider caching reranked results for repeated queries or using a self-hosted reranker like cross-encoder/ms-marco-MiniLM-L-6-v2 via HuggingFace.
Common Failure Modes in Production
Things that work fine in demos and break in production:
- Chunk size mismatch: If your agent’s context window is tight, long chunks mean fewer documents fit in context. If your chunks are too short, individual chunks lack enough context to be useful in isolation. 256-512 tokens is usually right; adjust based on how your content reads when extracted.
- Stale embeddings: When a document updates, you need to delete old chunks by
doc_idand re-index. Build this into your document pipeline from day one — retrofitting it is painful. - Query-document distribution mismatch: Your users ask questions; your documents contain answers. These can live in different parts of embedding space. Cohere’s separate query/document embedding mode partially addresses this. For OpenAI models, prefixing documents with “Document: ” and queries with “Query: ” during indexing and search can help.
- No score threshold: Without a minimum score cutoff, your agent will confidently use irrelevant context. Always filter results below your threshold and handle the empty-result case explicitly.
Connecting Retrieval to Your Agent
The retrieval pipeline becomes a tool that your agent calls. With Claude or GPT-4, the pattern is straightforward: define a search_knowledge_base tool, let the model decide when to invoke it, and inject the returned chunks as context before generating the response.
def build_agent_context(query: str) -> str:
"""Retrieve relevant chunks and format them for injection into agent prompt."""
results = search_with_rerank(query, top_k=5)
# Filter low-confidence results
strong_results = [r for r in results if r.get("rerank_score", r["score"]) > 0.5]
if not strong_results:
return "No relevant context found in knowledge base."
context_blocks = []
for i, r in enumerate(strong_results, 1):
context_blocks.append(
f"[Source {i} | doc_id: {r['doc_id']}]\n{r['text']}"
)
return "\n\n---\n\n".join(context_blocks)
Pass the output of this function into your agent’s system prompt or user message. Citation tracking (the doc_id in the format string) is non-negotiable for production agents — you need to be able to audit what context influenced a given response.
What to Build First Based on Your Situation
Solo founder / MVP: Start with Pinecone’s free tier and text-embedding-3-small. Skip reranking initially. Get something working with a score threshold of 0.75 and tune from there. You can add reranking when you have real user queries to evaluate against.
Small team with an existing Postgres stack: pgvector with text-embedding-3-small keeps your infrastructure simple and your retrieval queries composable with your existing data. Add Cohere reranking once you’re past 10k daily queries.
High-volume or data-sensitive: Self-host Qdrant, use BGE-M3 for embeddings, and run a local cross-encoder for reranking. Higher ops overhead but zero per-query API costs and full data control. The embedding quality is genuinely competitive with commercial options.
The core principle behind production semantic search embeddings is simple: retrieve broadly with vectors, then rank precisely with a cross-encoder, and always give your agent a graceful path when retrieval returns nothing useful. Get those three right and your agent will find relevant context from a million documents faster than a human could scan a single page.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

