Sunday, April 5

If you’ve spent more than a week seriously building with LLMs, you’ve already hit the moment where the OpenAI bill lands and you start Googling “self-hosted Llama.” The open source vs proprietary LLM decision isn’t really a philosophical one — it’s an infrastructure and unit economics question, and getting it wrong will either kill your margins or waste three months of engineering time. This article is about making that call based on real numbers, not vibes.

I’m going to walk through the actual total cost of ownership (TCO), latency profiles, reliability characteristics, and the hidden operational costs that most comparisons conveniently skip. By the end, you’ll have a framework to make this decision for your specific workload — whether that’s an agent pipeline running 50k requests a day or a B2B SaaS feature that calls an LLM on every user action.

The Real Cost Picture: API Pricing vs Self-Hosting

Let’s start with numbers. At current pricing (verify before you build — these shift regularly):

  • GPT-4o: ~$2.50 per 1M input tokens, ~$10.00 per 1M output tokens
  • Claude 3.5 Sonnet: ~$3.00 per 1M input tokens, ~$15.00 per 1M output tokens
  • Claude 3 Haiku: ~$0.25 per 1M input tokens, ~$1.25 per 1M output tokens
  • GPT-4o mini: ~$0.15 per 1M input tokens, ~$0.60 per 1M output tokens
  • Llama 3.1 70B (self-hosted on AWS g5.12xlarge): ~$16/hr for the instance, no per-token cost

That AWS instance runs roughly 2-3M tokens per hour depending on batch size and context length. So at 2M tokens/hour, you’re paying about $0.008 per 1K tokens total — which sounds great until you factor in that the instance needs to be running continuously to handle traffic spikes, you need at least two for redundancy, and someone has to maintain it.

Where Self-Hosting Math Actually Works

The crossover point where self-hosting beats a proprietary API is roughly 10-20M tokens per day sustained, depending on the model and your infrastructure competence. Below that, you’re almost certainly overpaying in engineering time.

Here’s a quick back-of-envelope comparison for a product doing 5M tokens/day:


# Token cost comparison: 5M tokens/day, 30-day month
# Assuming 60% input, 40% output ratio

daily_tokens = 5_000_000
input_ratio = 0.60
output_ratio = 0.40

# Claude 3.5 Sonnet
sonnet_input_cost = (daily_tokens * input_ratio / 1_000_000) * 3.00
sonnet_output_cost = (daily_tokens * output_ratio / 1_000_000) * 15.00
sonnet_daily = sonnet_input_cost + sonnet_output_cost
sonnet_monthly = sonnet_daily * 30
print(f"Claude Sonnet monthly: ${sonnet_monthly:.2f}")  # ~$117k/month

# Claude Haiku
haiku_input_cost = (daily_tokens * input_ratio / 1_000_000) * 0.25
haiku_output_cost = (daily_tokens * output_ratio / 1_000_000) * 1.25
haiku_daily = haiku_input_cost + haiku_output_cost
haiku_monthly = haiku_daily * 30
print(f"Claude Haiku monthly: ${haiku_monthly:.2f}")  # ~$9.75k/month

# Self-hosted Llama 70B (2x g5.12xlarge for redundancy)
instance_cost_hourly = 16.00 * 2  # two instances
self_hosted_monthly = instance_cost_hourly * 24 * 30
print(f"Self-hosted monthly (infra only): ${self_hosted_monthly:.2f}")  # ~$23k/month

That output tells an interesting story. At 5M tokens/day on Sonnet, self-hosting looks incredible — but you’d never use Sonnet for a 5M token/day workload unless you were printing money. You’d use Haiku or GPT-4o mini for that volume. And suddenly self-hosting costs more than the API, before engineering overhead.

Latency: Where Proprietary APIs Have a Real Edge

The managed API providers have invested heavily in inference infrastructure. Anthropic’s median P50 latency on Haiku for a 500-token completion is typically under 800ms. For time-to-first-token on streaming, it’s often under 300ms. These numbers are achievable because they’re running purpose-built inference clusters at scale.

Self-hosted Llama on a single g5.12xlarge? You’re looking at 1.5-3 seconds for comparable completions, with significantly worse tail latency. The P95 and P99 numbers are where self-hosting really shows its pain — a GPU hiccup or memory fragmentation issue and you’re suddenly looking at 10-15 second completions that blow up your user-facing SLA.

When Latency Actually Matters

For async background jobs — document processing, batch summarization, nightly report generation — latency barely matters. A 3-second completion on a job that runs in the background is completely fine. Self-hosting makes more sense here.

For real-time user-facing features — chat interfaces, inline suggestions, agent steps the user is waiting on — latency is product quality. I’d default to managed APIs here unless you have serious infrastructure investment and a dedicated ML platform team.

Reliability and the Operational Tax

This is the cost that never appears in the benchmarks. When you self-host, you own:

  • Model loading and cold-start management
  • GPU memory management and OOM crashes
  • Load balancing across replicas
  • Monitoring inference health (not just server health)
  • Handling model version upgrades without downtime
  • Dependency hell between CUDA versions, torch, and the serving framework

The managed API providers give you 99.9%+ uptime SLAs (check the actual SLA, not the marketing page). Anthropic and OpenAI both have status pages that are honestly more reliable than most companies’ self-hosted infrastructure. Unless you have a platform engineering team that has done GPU cluster operations before, you will spend more time fighting your inference stack than building your product.

A Minimal Self-Hosted Setup Worth Knowing

If you do go the self-hosting route, vLLM is the right serving framework. It handles PagedAttention, continuous batching, and multi-GPU correctly. Here’s a minimal production-ish setup:


# Install vLLM (pin the version — updates break things)
pip install vllm==0.4.2

# Serve Llama 3.1 8B (fits on a single A10G)
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --port 8000

# Call it with the OpenAI-compatible client
from openai import OpenAI

client = OpenAI(
    base_url="http://your-gpu-instance:8000/v1",
    api_key="not-needed-but-required-by-client"
)

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Summarise this contract: ..."}],
    max_tokens=512,
    temperature=0.1
)

print(response.choices[0].message.content)

The OpenAI-compatible endpoint means you can swap between self-hosted and managed APIs with a one-line change to base_url. This is worth designing for from day one — don’t couple your code to a specific provider’s SDK in a way that makes switching painful.

Quality Differences That Actually Show Up in Production

Let’s be honest about where the capability gap still exists. For most structured tasks — classification, extraction, summarization with a defined format, RAG retrieval synthesis — Llama 3.1 70B is legitimately competitive with GPT-4o on constrained tasks. The gap has closed substantially in the last 12 months.

Where proprietary models still clearly win:

  • Complex multi-step reasoning — agent chains where errors compound over many steps
  • Ambiguous instruction following — when users write poorly-structured prompts and you need the model to recover gracefully
  • Code generation with context — large codebase awareness and complex refactoring tasks
  • Nuanced long-document analysis — anything requiring synthesis across 50k+ tokens with subtle relationships

For a customer support classifier or an extraction pipeline pulling structured data from invoices, Llama 3.1 8B will work fine. For an autonomous coding agent or a research synthesis tool, you want Claude 3.5 Sonnet or GPT-4o.

Data Privacy and Compliance: The Non-Negotiable Cases for Self-Hosting

There are workloads where the open source vs proprietary LLM decision is made for you by compliance requirements, not economics. If you’re processing:

  • PHI under HIPAA without a BAA (Anthropic and OpenAI both offer BAAs, but read the data retention terms)
  • Data subject to EU data residency requirements where US-based API providers don’t have the right regional infrastructure
  • Classified or highly sensitive IP where any third-party data processing is prohibited by contract

…then you’re self-hosting regardless of cost. Azure OpenAI with a private endpoint and no logging can satisfy some of these requirements while keeping frontier model quality, which is worth knowing as a middle path.

A Decision Framework Based on Your Actual Situation

Here’s how I’d make this call based on team type and workload:

Solo Founder or Small Team (<5 engineers)

Use managed APIs, full stop. The operational cost of self-hosting will consume more of your capacity than the money you save. Start with Claude Haiku or GPT-4o mini for high-volume tasks, Sonnet or GPT-4o for quality-critical tasks. Optimize with prompt engineering before reaching for infrastructure changes. You can always migrate later once you have the volume to justify it.

Series A+ with a Platform Team

Hybrid makes sense here. Use managed APIs as your default, self-host for workloads that are clearly high-volume and low-sensitivity (batch jobs, internal tooling, non-user-facing pipelines). Run a proper TCO analysis every quarter — the model pricing landscape shifts fast enough that what made sense six months ago may not today.

Enterprise with Compliance Requirements

The calculus is different. Evaluate Azure OpenAI or AWS Bedrock before raw self-hosting — you get managed infrastructure, compliance certifications, and frontier models without full DIY ops. If you need true air-gap, then vLLM on your own VPC with Llama 3.1 70B or Mistral Large is the practical path.

High-Volume Commodity Workloads (>50M tokens/day)

At this scale, self-hosting almost certainly wins on cost — but only if you staff it correctly. You need at least one engineer who has done GPU cluster ops before, proper autoscaling, and a clear runbook for the operational failure modes. The TCO break-even is real at this volume, but so is the operational complexity.

The open source vs proprietary LLM choice ultimately comes down to one question: what is your engineering team’s time actually worth relative to your token volume? For most teams at most stages, managed APIs are the right default — and the smarter play is to optimize your prompts and model selection within that constraint rather than prematurely reaching for self-hosting. Save the infrastructure investment for when you have the volume and the team to justify it.

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply