Sunday, April 5

Generic embedding models are trained on everything — Wikipedia, Common Crawl, GitHub, and a million other sources. That’s great for general semantic search. It’s not so great when your knowledge base is full of medical billing codes, semiconductor fabrication specs, or internal legal contracts. If your RAG pipeline’s retrieval accuracy feels stuck at “good enough but not great,” the problem is often that the embedding model doesn’t actually understand your domain. Custom embedding models fix this, and training one is far more accessible than most developers assume.

This guide walks through the actual process: fine-tuning a base embedding model on your domain data, generating synthetic training pairs when you don’t have labeled data, and validating that your model is genuinely better — not just scoring higher on a benchmark that doesn’t matter.

Why Generic Embeddings Fail at Domain-Specific Retrieval

The failure mode is subtle. Generic models like text-embedding-ada-002 or all-MiniLM-L6-v2 understand semantic similarity in a broad sense. Ask them to compare “myocardial infarction” with “heart attack” and they’ll nail it. But ask them to rank the relevance of internal procurement documents where “PO” means “Purchase Order” — not “Post Office” or anything else — and they’ll occasionally get it wrong in ways that hurt your application.

The real problem is domain vocabulary and query intent. In a generic embedding space, your domain’s terminology sits in a neighborhood populated by every other use of those terms across the internet. Fine-tuning collapses the distance between semantically related concepts in your domain and expands it between superficially similar but contextually different ones.

Concretely: a company I worked with had a legal document search product. Switching from ada-002 to a fine-tuned bge-base-en-v1.5 lifted their precision@5 from 61% to 79% on internal evaluation queries. Same documents, same retrieval pipeline, different embeddings.

Choosing a Base Model Worth Fine-Tuning

Don’t fine-tune from scratch. You’ll spend weeks and thousands of dollars to produce something worse than a model that already exists. Start from a strong base and adapt it.

Models that work well as starting points

  • BGE family (BAAI)bge-base-en-v1.5 and bge-large-en-v1.5 are consistently strong baselines. The large version is 335M parameters; fine-tuning it takes about 2-4 hours on an A100 for a dataset of 50k pairs. These models respond well to fine-tuning in my experience.
  • E5 family (Microsoft)e5-base-v2 is solid and slightly faster to fine-tune than BGE. Good choice if inference latency matters more than raw accuracy.
  • nomic-embed-text-v1 — Apache 2.0 license, 137M parameters, genuinely competitive with models 3x its size. My first choice for cost-sensitive deployments.
  • GTE (Alibaba)gte-large regularly tops MTEB. Worth trying if you have budget for the compute.

I’d skip fine-tuning OpenAI’s embedding API — you can’t access the weights. Stick to open-source models you can actually control.

The Data Problem: Generating Synthetic Training Pairs

The most common objection to fine-tuning is “I don’t have labeled data.” You don’t need pre-labeled data. You need query-document pairs where the document is a relevant result for the query. You can generate these synthetically with an LLM.

The synthetic data generation pipeline

The approach is straightforward: take chunks from your knowledge base, prompt an LLM to generate realistic questions that the chunk would answer, and use those as positive pairs. Optionally generate hard negatives — questions where a similar-looking but wrong document is the distractor.

from anthropic import Anthropic
import json

client = Anthropic()

def generate_training_pairs(document_chunk: str, num_questions: int = 5) -> list[dict]:
    """
    Generate synthetic query-document pairs for embedding fine-tuning.
    Returns list of {"query": str, "positive": str} dicts.
    """
    prompt = f"""You are generating training data for a search system.

Given this document chunk, generate {num_questions} realistic search queries
that this chunk would be the best answer to. Queries should vary in phrasing
and specificity — include both short keyword-style queries and longer natural
language questions.

Document chunk:
{document_chunk}

Return a JSON array of strings, each being a query. No other text."""

    response = client.messages.create(
        model="claude-haiku-3-5",  # Haiku is fast and cheap for bulk generation
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    
    queries = json.loads(response.content[0].text)
    
    # Each query paired with the source chunk as positive document
    return [{"query": q, "positive": document_chunk} for q in queries]


# Process your knowledge base chunks
all_pairs = []
for chunk in knowledge_base_chunks:  # your chunked documents
    pairs = generate_training_pairs(chunk, num_questions=5)
    all_pairs.extend(pairs)

print(f"Generated {len(all_pairs)} training pairs")
# 1000 chunks × 5 questions = 5000 pairs, costs ~$0.15 at current Haiku pricing

At current Claude Haiku pricing (roughly $0.25/million input tokens), generating 5,000 training pairs from a 1,000-chunk knowledge base costs about $0.15–0.40 depending on average chunk length. That’s not a typo. The economics here are excellent.

Adding hard negatives to improve discrimination

Positive pairs alone will improve recall but won’t sharpen precision. Hard negatives — documents that look relevant but aren’t — teach the model to discriminate. The simplest approach is to use BM25 or your existing embedding model to find the top-10 nearest neighbors for each query, then label the non-matching ones as negatives.

from sentence_transformers import SentenceTransformer
import numpy as np

def mine_hard_negatives(
    queries: list[str],
    positives: list[str],
    corpus: list[str],
    model_name: str = "bge-base-en-v1.5",
    top_k: int = 10
) -> list[dict]:
    """
    Mine hard negatives using an existing embedding model.
    Returns triplets: {"query": str, "positive": str, "negative": str}
    """
    model = SentenceTransformer(model_name)
    
    corpus_embeddings = model.encode(corpus, batch_size=64, show_progress_bar=True)
    query_embeddings = model.encode(queries, batch_size=64)
    
    triplets = []
    
    for i, (query, positive, q_emb) in enumerate(zip(queries, positives, query_embeddings)):
        # Find nearest neighbors in full corpus
        scores = np.dot(corpus_embeddings, q_emb)
        top_indices = np.argsort(scores)[::-1][:top_k]
        
        for idx in top_indices:
            candidate = corpus[idx]
            # Use as hard negative only if it's not the actual positive
            if candidate != positive:
                triplets.append({
                    "query": query,
                    "positive": positive,
                    "negative": candidate  # semantically close but wrong
                })
                break  # one hard negative per query is enough to start
    
    return triplets

Fine-Tuning with sentence-transformers

The sentence-transformers library handles most of the complexity. You’re looking at under 100 lines to go from training pairs to a fine-tuned model.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
from sentence_transformers.evaluation import InformationRetrievalEvaluator
import random

# Load your base model
model = SentenceTransformer("BAAI/bge-base-en-v1.5")

# Convert training data to InputExamples
# Using MultipleNegativesRankingLoss — best performer for retrieval tasks
train_examples = [
    InputExample(texts=[pair["query"], pair["positive"]])
    for pair in all_pairs
]

# If you mined hard negatives, use TripletLoss instead:
# InputExample(texts=[triplet["query"], triplet["positive"], triplet["negative"]])

train_dataloader = DataLoader(
    train_examples, 
    shuffle=True, 
    batch_size=32  # 32-64 works well; increase if VRAM allows
)

# MultipleNegativesRankingLoss treats other items in the batch as negatives
# This is why larger batch sizes matter — more negatives per step
train_loss = losses.MultipleNegativesRankingLoss(model=model)

# Set up evaluation on a held-out set (10-15% of your data)
# InformationRetrievalEvaluator expects: queries dict, corpus dict, relevant_docs dict
evaluator = InformationRetrievalEvaluator(
    queries=eval_queries,       # {query_id: query_text}
    corpus=eval_corpus,         # {doc_id: doc_text}
    relevant_docs=eval_relevant # {query_id: set of relevant doc_ids}
)

model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    evaluator=evaluator,
    epochs=3,                    # 3-5 epochs is usually enough; watch for overfitting
    warmup_steps=100,
    evaluation_steps=500,
    output_path="./my-domain-embeddings",
    save_best_model=True,        # saves the checkpoint with best eval score
    show_progress_bar=True
)

On an A100 (roughly $1–2/hour on Lambda Labs or RunPod), training 5,000 pairs for 3 epochs takes about 15–20 minutes. Total cost: under $1. On a T4 it’ll take longer — maybe 45 minutes — but still costs well under $5 for a typical dataset size.

Validating That You Actually Improved Things

This is where most tutorials skip the hard part. Don’t trust training loss or generic MTEB scores to validate a domain-specific model. Build evaluation metrics that reflect your actual use case.

Build a domain evaluation set before you train

Set aside 100–200 query-document pairs that you don’t use in training. These should ideally come from real user queries if you have them, or from a second round of LLM generation on a separate subset of documents. Use these to compute:

  • Recall@K — what fraction of relevant documents appear in the top-K results? For RAG, Recall@5 and Recall@10 are the numbers that matter.
  • MRR (Mean Reciprocal Rank) — how highly ranked is the first relevant result? Directly maps to RAG answer quality.
  • Precision@5 — of the 5 chunks fed to the LLM, how many are actually relevant? High precision means less noise in your context window.
def evaluate_retrieval(model, eval_pairs: list[dict], corpus: list[str], k: int = 5) -> dict:
    """
    Compute Recall@K, MRR, and Precision@K for a retrieval model.
    eval_pairs: list of {"query": str, "positive": str}
    """
    query_embeddings = model.encode([p["query"] for p in eval_pairs])
    corpus_embeddings = model.encode(corpus)
    
    recall_hits = 0
    mrr_sum = 0.0
    precision_sum = 0.0
    
    for i, pair in enumerate(eval_pairs):
        scores = np.dot(corpus_embeddings, query_embeddings[i])
        top_k_indices = np.argsort(scores)[::-1][:k]
        top_k_docs = [corpus[idx] for idx in top_k_indices]
        
        # Recall: did the positive appear in top-K?
        if pair["positive"] in top_k_docs:
            recall_hits += 1
            rank = top_k_docs.index(pair["positive"]) + 1
            mrr_sum += 1.0 / rank
        
        # Precision: fraction of top-K that are relevant
        # (simplified: 1 if positive in top-k, else 0, divided by k)
        precision_sum += (1 / k) if pair["positive"] in top_k_docs else 0
    
    n = len(eval_pairs)
    return {
        f"recall@{k}": recall_hits / n,
        "mrr": mrr_sum / n,
        f"precision@{k}": precision_sum / n
    }

# Compare before and after
base_model = SentenceTransformer("BAAI/bge-base-en-v1.5")
finetuned_model = SentenceTransformer("./my-domain-embeddings")

print("Base model:", evaluate_retrieval(base_model, eval_pairs, corpus))
print("Fine-tuned:", evaluate_retrieval(finetuned_model, eval_pairs, corpus))

If you’re not seeing at least a 5–10% improvement in Recall@5 on your domain eval set, either your training data doesn’t capture real query patterns or you don’t have enough pairs. Add more synthetic data from different chunks before assuming the approach doesn’t work.

Deployment and Integration

Once fine-tuned, your model is a standard sentence-transformers model. Drop it into any pipeline that accepts SentenceTransformer objects or ONNX models.

For production, export to ONNX for 2-3x faster inference with no accuracy loss:

from sentence_transformers import SentenceTransformer
from optimum.onnxruntime import ORTModelForFeatureExtraction

# Export fine-tuned model to ONNX
model = SentenceTransformer("./my-domain-embeddings")
model[0].auto_model.save_pretrained("./model-onnx-export")

# Load with ONNX Runtime for faster inference
ort_model = ORTModelForFeatureExtraction.from_pretrained(
    "./model-onnx-export",
    export=True
)

For vector database integration — Weaviate, Qdrant, Chroma, Pinecone with custom vectors — you’re just computing embeddings with your fine-tuned model and inserting the vectors. Nothing in your RAG pipeline needs to change except swapping the embedding function.

When Custom Embedding Models Are (and Aren’t) Worth It

This approach earns its complexity when:

  • Your domain has specialized vocabulary not well-represented in general training data (legal, medical, engineering, finance)
  • Your retrieval precision matters more than recall — i.e., a wrong context chunk causes real problems
  • You’re running at scale where a 0.5ms/query latency win from a smaller fine-tuned model adds up
  • You have at least 500–1,000 documents to generate training data from

Skip it when your knowledge base is small (<100 documents), your domain is reasonably general, or you’re still iterating on your retrieval architecture. Validate that retrieval is actually your bottleneck before investing in custom embedding models — sometimes the problem is chunking strategy or prompt design.

Bottom line by reader type: If you’re a solo founder with a focused vertical product and >500 knowledge base documents, the $1–5 training cost and 2–3 hours of setup time is one of the highest-ROI improvements you can make to RAG accuracy. If you’re a team with heterogeneous knowledge bases across different domains, train one model per domain rather than one combined model — the specialization is the point.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply