Sunday, April 5

By the end of this tutorial, you’ll have a working RAG pipeline that ingests PDFs, chunks and embeds them, stores vectors in ChromaDB, and connects to a Claude agent that retrieves relevant context before answering questions. Every code snippet runs — this is the exact architecture I’d use to build RAG pipeline Claude integrations for a production knowledge base on a tight deadline.

  1. Install dependencies — Set up Python environment with PyMuPDF, ChromaDB, and Anthropic SDK
  2. Parse and chunk PDFs — Extract text from PDFs and split into retrievable chunks
  3. Generate and store embeddings — Embed chunks with OpenAI’s text-embedding-3-small and persist to ChromaDB
  4. Build the retrieval function — Query ChromaDB for top-k semantically relevant chunks
  5. Wire Claude to the retriever — Inject retrieved context into the Claude API call with a grounded system prompt
  6. Add a simple query loop — Wrap everything in a CLI you can actually use

Step 1: Install Dependencies

You need four packages: anthropic for Claude, chromadb for local vector storage, pymupdf (imported as fitz) for PDF parsing, and openai for embeddings. I’m using OpenAI’s embedding API here because text-embedding-3-small costs roughly $0.00002 per 1K tokens — embedding a 200-page PDF typically runs under $0.05. Anthropic doesn’t yet expose a dedicated embeddings endpoint, so this is the standard production choice.

pip install anthropic chromadb pymupdf openai python-dotenv

Create a .env file:

ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-...

Step 2: Parse and Chunk PDFs

PDF parsing is where most tutorials cut corners and where most pipelines fail in production. PyMuPDF is significantly faster than PyPDF2 and handles multi-column layouts better. The chunking strategy matters more than most people realise — too small and you lose context, too large and you waste tokens in the prompt.

I use 512-token chunks with 64-token overlap. That overlap prevents answers from falling through chunk boundaries. For technical documents with lots of tables or code, increase overlap to 128.

import fitz  # PyMuPDF
import os
from typing import List, Dict

def parse_pdf(pdf_path: str) -> str:
    """Extract full text from a PDF file."""
    doc = fitz.open(pdf_path)
    text = ""
    for page in doc:
        text += page.get_text()
    doc.close()
    return text

def chunk_text(text: str, chunk_size: int = 512, overlap: int = 64) -> List[Dict]:
    """
    Split text into overlapping chunks.
    Returns list of dicts with 'text' and 'chunk_index'.
    """
    words = text.split()
    chunks = []
    start = 0
    
    while start < len(words):
        end = start + chunk_size
        chunk_words = words[start:end]
        chunks.append({
            "text": " ".join(chunk_words),
            "chunk_index": len(chunks)
        })
        # Move forward by chunk_size minus overlap
        start += chunk_size - overlap
    
    return chunks

# Parse a directory of PDFs
def ingest_pdfs(pdf_dir: str) -> List[Dict]:
    all_chunks = []
    for filename in os.listdir(pdf_dir):
        if filename.endswith(".pdf"):
            path = os.path.join(pdf_dir, filename)
            text = parse_pdf(path)
            chunks = chunk_text(text)
            # Tag each chunk with its source file
            for chunk in chunks:
                chunk["source"] = filename
            all_chunks.extend(chunks)
            print(f"Ingested {filename}: {len(chunks)} chunks")
    return all_chunks

Step 3: Generate and Store Embeddings

ChromaDB handles both storage and similarity search locally, which is ideal for prototyping and small-to-medium knowledge bases (under ~100K chunks). For larger deployments or multi-instance setups, you’ll want a managed vector DB — our comparison of Pinecone, Qdrant, and Weaviate for production RAG covers those tradeoffs in detail.

import chromadb
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

openai_client = OpenAI()
chroma_client = chromadb.PersistentClient(path="./chroma_store")

def get_embedding(text: str) -> List[float]:
    """Embed a single string using text-embedding-3-small."""
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

def build_vector_store(chunks: List[Dict], collection_name: str = "knowledge_base"):
    """Embed all chunks and persist to ChromaDB."""
    collection = chroma_client.get_or_create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}  # cosine similarity for text
    )
    
    # Process in batches to avoid rate limits
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        
        embeddings = [get_embedding(c["text"]) for c in batch]
        ids = [f"chunk_{i + j}" for j in range(len(batch))]
        documents = [c["text"] for c in batch]
        metadatas = [{"source": c["source"], "chunk_index": c["chunk_index"]} for c in batch]
        
        collection.add(
            embeddings=embeddings,
            documents=documents,
            metadatas=metadatas,
            ids=ids
        )
        print(f"Stored batch {i // batch_size + 1} / {len(chunks) // batch_size + 1}")
    
    print(f"Vector store built: {collection.count()} chunks indexed")
    return collection

Run ingestion once. After that, ChromaDB loads from disk — no re-embedding on restart.

if __name__ == "__main__":
    chunks = ingest_pdfs("./pdfs")
    collection = build_vector_store(chunks)

Step 4: Build the Retrieval Function

Retrieval is straightforward: embed the query, fetch the top-k most similar chunks. The only decision is how many chunks to return. I default to 5. More than 8 and you’re padding Claude’s context with noise; fewer than 3 and you risk missing a relevant passage. Tune this based on your chunk size and the complexity of expected queries.

def retrieve_context(query: str, collection, top_k: int = 5) -> List[Dict]:
    """
    Retrieve the most relevant chunks for a given query.
    Returns list of dicts with 'text', 'source', and 'distance'.
    """
    query_embedding = get_embedding(query)
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    retrieved = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0]
    ):
        retrieved.append({
            "text": doc,
            "source": meta["source"],
            "distance": dist  # lower = more similar in cosine space
        })
    
    return retrieved

Step 5: Wire Claude to the Retriever

This is where the pipeline comes together. The retrieved chunks become the grounding context in the system prompt. Never dump raw chunks directly into the user message — it confuses the turn structure and degrades response quality. Put context in the system prompt where it belongs.

Claude claude-3-5-haiku-20241022 is the right model here for most use cases: ~$0.00025 per 1K input tokens, fast, and handles retrieval-augmented tasks well. Switch to Sonnet if your questions require multi-step reasoning across chunks. For a deeper look at how grounding affects answer quality and hallucination rates, see our guide on reducing LLM hallucinations in production.

import anthropic

anthropic_client = anthropic.Anthropic()

def build_system_prompt(context_chunks: List[Dict]) -> str:
    """Construct a grounded system prompt from retrieved chunks."""
    context_str = "\n\n---\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}"
        for c in context_chunks
    ])
    
    return f"""You are a precise knowledge assistant. Answer questions based strictly on the provided context documents.

If the answer is not found in the context, say "I don't have that information in the provided documents" — do not speculate or use prior knowledge.

Always cite the source filename when referencing information.

CONTEXT DOCUMENTS:
{context_str}"""

def ask_claude(query: str, collection) -> str:
    """Full RAG pipeline: retrieve context, then query Claude."""
    # Step 1: Retrieve relevant chunks
    chunks = retrieve_context(query, collection, top_k=5)
    
    # Step 2: Build grounded system prompt
    system_prompt = build_system_prompt(chunks)
    
    # Step 3: Call Claude with context
    response = anthropic_client.messages.create(
        model="claude-haiku-4-5",  # fast and cheap for RAG tasks
        max_tokens=1024,
        system=system_prompt,
        messages=[
            {"role": "user", "content": query}
        ]
    )
    
    return response.content[0].text

The system prompt design here matters significantly. If you want to go deeper on prompt architecture for agents, our breakdown of high-performance Claude system prompts covers the structural patterns that make a real difference.

Step 6: Add a Simple Query Loop

def main():
    # Load the persisted collection (no re-embedding needed)
    collection = chroma_client.get_or_create_collection(
        name="knowledge_base",
        metadata={"hnsw:space": "cosine"}
    )
    
    if collection.count() == 0:
        print("No documents indexed. Run ingestion first.")
        return
    
    print(f"Loaded {collection.count()} chunks. Ask anything (ctrl+c to quit):\n")
    
    while True:
        try:
            query = input("You: ").strip()
            if not query:
                continue
            answer = ask_claude(query, collection)
            print(f"\nClaude: {answer}\n")
        except KeyboardInterrupt:
            print("\nExiting.")
            break

if __name__ == "__main__":
    main()

Common Errors and How to Fix Them

Error 1: ChromaDB returns wrong results for obvious queries

Usually a chunking issue. If your PDFs have headers, footers, or page numbers mixed into body text, those artifacts corrupt the chunks and pollute your embeddings. Fix: add a basic cleaning step after page.get_text().

import re

def clean_text(text: str) -> str:
    # Remove excessive whitespace and page artifacts
    text = re.sub(r'\n{3,}', '\n\n', text)
    text = re.sub(r'Page \d+ of \d+', '', text)
    return text.strip()

Error 2: OpenAI rate limit errors during ingestion

If you’re ingesting hundreds of PDFs, you’ll hit the embeddings API rate limit (particularly on tier-1 accounts: 1M TPM). Add exponential backoff or use the tenacity library. Alternatively, batch processing patterns can help you structure high-volume ingestion jobs properly.

import time

def get_embedding_with_retry(text: str, max_retries: int = 3) -> List[float]:
    for attempt in range(max_retries):
        try:
            return get_embedding(text)
        except Exception as e:
            if "rate_limit" in str(e).lower() and attempt < max_retries - 1:
                wait = 2 ** attempt  # 1s, 2s, 4s
                print(f"Rate limited, waiting {wait}s...")
                time.sleep(wait)
            else:
                raise

Error 3: Claude ignores the context and answers from training data

This happens when your system prompt is too permissive or when retrieved chunks are so noisy that Claude weighs them as low-quality. Two fixes: tighten the instruction (“Do not use any knowledge outside the provided documents”) and filter retrieved chunks by distance threshold — discard anything with a cosine distance above 0.4.

def retrieve_context(query: str, collection, top_k: int = 5, max_distance: float = 0.4):
    # ... (same as before)
    # Filter out low-relevance chunks
    retrieved = [r for r in retrieved if r["distance"] <= max_distance]
    return retrieved

Architecture Decisions That Matter

A few choices that separate a throwaway prototype from something you’d actually run in production:

  • Chunk size: 512 words works for prose-heavy documents. For technical specs or legal text with dense terminology, drop to 256 with 32-token overlap.
  • Embedding model: text-embedding-3-small is the right default — 1536 dimensions, fast, cheap. text-embedding-3-large costs 5x more with marginal gains for most RAG use cases. See our guide on semantic search and embedding tuning for benchmark numbers.
  • When to move off ChromaDB: Once you’re above ~500K chunks or need multi-tenancy, switch to a managed vector DB. Local ChromaDB is single-process and won’t handle concurrent writes from multiple workers.
  • Framework question: For this scale, plain Python beats LangChain. The abstraction cost isn’t worth it until you need complex chain orchestration. Our LangChain vs LlamaIndex vs plain Python comparison walks through exactly when each makes sense.

What to Build Next

Add a reranking step between retrieval and generation. ChromaDB’s HNSW index does approximate nearest-neighbour search, which means the top-5 results aren’t always the 5 most semantically relevant — they’re just fast approximations. Drop in a cross-encoder reranker (Cohere’s Rerank API costs $1 per 1K searches, or run cross-encoder/ms-marco-MiniLM-L-6-v2 locally) after retrieval to reorder candidates before feeding them to Claude. In my testing on technical documentation, reranking reduced “I don’t have that information” false negatives by around 30% — the right chunks were already in the top-10, just not consistently in the top-5.

Bottom Line: When to Use This Architecture

Solo founder or small team: This stack (ChromaDB + OpenAI embeddings + Claude Haiku) is production-ready for knowledge bases under 50K pages. Total cost for a typical SaaS support bot handling 10K questions/month sits around $15-25/month at current pricing. Start here.

Budget-conscious builder: You can replace OpenAI embeddings with a local model like all-MiniLM-L6-v2 via sentence-transformers (free, ~80% of the retrieval quality) to cut the embedding cost entirely. The Claude API call is where most cost accumulates anyway.

Enterprise or high-volume: Swap ChromaDB for Qdrant or Pinecone, add a reranker, and build a monitoring layer so you can track retrieval quality over time. The fundamentals to build RAG pipeline Claude integrations stay identical — the plumbing around them scales up.

Frequently Asked Questions

How many PDFs can this pipeline handle before ChromaDB becomes a bottleneck?

ChromaDB’s local persistent mode handles roughly 500K–1M vectors comfortably on a standard machine with 16GB RAM. A typical 50-page PDF produces around 200–300 chunks, so you’re looking at capacity for 1,500–5,000 documents before you need to consider a managed vector database. The main constraint is query latency, not storage — expect sub-100ms queries up to ~200K chunks, degrading after that.

Can I use Claude’s own embeddings instead of OpenAI?

Anthropic doesn’t currently offer a dedicated embeddings API endpoint. The standard production approach is to use OpenAI’s text-embedding-3-small for embedding and Claude for generation — they’re separate steps and there’s no coupling requirement. Alternatively, you can run a local embedding model like all-MiniLM-L6-v2 for zero embedding cost.

How do I handle PDFs with tables, images, or scanned pages?

PyMuPDF handles native PDF tables reasonably well but will skip embedded images. For image-heavy or scanned PDFs, you need an OCR layer — pytesseract for open-source or AWS Textract/Google Document AI for production accuracy. Scanned PDFs where text isn’t selectable will return empty strings with PyMuPDF, which is a silent failure — always validate that extracted text length is non-trivial after parsing.

What’s the difference between this approach and just using Claude’s 200K context window directly?

Stuffing entire documents into the context window works for one-off queries but doesn’t scale: you pay for every token on every call (a 200K-token context on Sonnet costs ~$0.60 per query), latency increases significantly, and Claude’s performance degrades at very long contexts. RAG keeps per-query cost low by only sending the 5–10 most relevant chunks, typically 500–2000 tokens of context instead of hundreds of thousands.

How do I update the knowledge base when PDFs change or new ones are added?

For new documents, run the ingestion function on just the new files and add chunks to the existing collection — ChromaDB handles incremental writes. For updated documents, you need to delete the old chunks by source filename before re-ingesting: collection.delete(where={"source": "old_file.pdf"}). Track document hashes in a simple SQLite table to detect changes and trigger re-ingestion automatically.

Put this into practice

Try the Connection Agent agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.


Share.
Leave A Reply