Saturday, March 21

Once your agent hits production and starts making real decisions โ€” routing tickets, generating reports, calling external APIs โ€” you will immediately wish you’d instrumented it properly from day one. Logs vanish, token costs spike unexpectedly, and tracing a bad output back to the exact prompt that caused it becomes a multi-hour archaeology project. The right LLM observability platform turns those investigations from guesswork into a five-minute task. The wrong one just adds another dashboard nobody checks.

I’ve run all three of these tools โ€” Helicone, LangSmith, and Langfuse โ€” on real agent workloads ranging from a single-model summarisation pipeline to a multi-step ReAct agent making 20+ LLM calls per user session. What follows is an honest breakdown of where each one shines, where each one frustrates, and which type of builder should pick which tool.

What You Actually Need From LLM Observability

Before comparing platforms, it’s worth being specific about what “observability” means for agents versus simple API calls. A single chat.completions.create call is easy โ€” log the input, log the output, track the tokens. An agent is harder because you need:

  • Trace hierarchy โ€” which tool calls, retrieval steps, and LLM calls belong to a single user-facing action
  • Cost attribution โ€” not just total spend, but cost per trace, per user, per feature
  • Prompt versioning โ€” so you can tie a bad output to the exact prompt version that produced it
  • Latency breakdown โ€” where is the time actually going: retrieval, inference, or your own code?
  • Evaluation hooks โ€” the ability to score outputs, flag regressions, and run evals against historical data

All three platforms claim to cover all of this. The differences are in depth, ergonomics, and cost at scale.

Helicone: Lowest-Friction Logging for OpenAI-Compatible APIs

Helicone works by acting as a proxy between your code and the LLM provider. You change one URL and add two headers โ€” that’s the entire integration for the basic case. If you’re calling OpenAI, Anthropic, or any OpenAI-compatible endpoint, you’re logging in under two minutes.

import openai

client = openai.OpenAI(
    api_key="sk-...",
    base_url="https://oai.hconeai.com/v1",  # Helicone proxy
    default_headers={
        "Helicone-Auth": "Bearer sk-helicone-...",
        "Helicone-Property-UserId": "user_123",  # custom metadata
    }
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Summarise this document..."}]
)

Every call is automatically captured with latency, token counts, cost estimates, and any custom properties you pass in headers. The dashboard is genuinely clean โ€” filtering by user, model, or date range is fast, and the cost breakdowns are accurate.

Where Helicone Struggles

The proxy architecture is also its main weakness. For multi-step agents, you get a flat list of LLM calls with no native trace hierarchy. You can group calls by session ID using custom headers, but you’re building the correlation logic yourself. If your agent makes 15 LLM calls across three different tools, Helicone shows you 15 separate rows โ€” you have to mentally stitch them together or write your own grouping.

There’s a Sessions feature now that partially addresses this, but it’s not as mature as LangSmith or Langfuse’s trace trees. Evaluations are limited. Prompt management exists but feels bolted on.

Helicone Pricing (as of mid-2025)

Free tier: 10,000 requests/month. Pro starts at $20/month for 1M requests. At that price, it’s cheap enough that cost is rarely the deciding factor. Enterprise pricing is custom. The proxy approach does add ~10-30ms latency, which matters if you’re building latency-sensitive applications but is irrelevant for most batch or async workloads.

Best for: Solo founders or small teams who want cost tracking and basic logging without any SDK overhead. If you’re calling one or two models and don’t need deep trace trees, Helicone is the fastest path to visibility.

LangSmith: The Deepest Integration for LangChain Shops

LangSmith is LangChain’s native observability layer, and if you’re already using LangChain or LangGraph, the integration is near-zero effort โ€” you set two environment variables and every chain, agent step, tool call, and retrieval is automatically traced.

import os
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls__..."
os.environ["LANGCHAIN_PROJECT"] = "my-production-agent"

# Everything below is automatically traced
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_react_agent

llm = ChatOpenAI(model="gpt-4o")
# ... your agent setup ...
agent_executor.invoke({"input": "What's the status of order #4821?"})

That’s it. Open LangSmith and you’ll see a full trace tree: the top-level agent invocation, each tool call with its input and output, each LLM call with the exact prompt, the token counts, and the latency at every level. The trace UI is genuinely excellent โ€” you can click into any node, see the raw messages, and immediately understand what happened.

LangSmith’s Evaluation Workflow Is the Real Differentiator

Where LangSmith pulls ahead is the evaluation pipeline. You can take a set of production traces, create a dataset from them, run a new prompt version against that dataset, and get a comparison view โ€” all without leaving the platform. For iterating on agent prompts, this workflow is hard to beat.

from langsmith import Client

client = Client()

# Create a dataset from production traces
dataset = client.create_dataset("order-status-evals")
client.create_examples(
    inputs=[{"input": "What's the status of order #4821?"}],
    outputs=[{"output": "Order #4821 is in transit, expected Friday."}],
    dataset_id=dataset.id
)

# Run evaluation against a new prompt version
from langchain_openai import ChatOpenAI
from langsmith.evaluation import evaluate

def my_app(inputs):
    # your updated agent here
    return {"output": agent_executor.invoke(inputs)["output"]}

results = evaluate(
    my_app,
    data=dataset.name,
    evaluators=["cot_qa"],  # built-in evaluators
)

LangSmith Limitations

The catch is framework lock-in. If you’re not using LangChain, instrumentation requires manual wrapping with the SDK, which is noticeably more verbose than Helicone’s proxy approach. The free tier is 5,000 traces/month, which disappears fast if you’re running a multi-step agent (a 10-step agent run counts as 10+ traces). Paid plans start at $39/month for higher limits, scaling by usage โ€” check the current pricing page, as this has changed several times.

I’ve also found the LangSmith UI can get slow when loading traces with hundreds of nested steps. Not a dealbreaker, but noticeable.

Best for: Teams already on LangChain/LangGraph who need serious eval capabilities and don’t mind the ecosystem coupling. The zero-config tracing and evaluation workflow justify the subscription for any team doing regular prompt iteration.

Langfuse: The Open-Source Option With Enterprise Depth

Langfuse is the most flexible of the three because it’s open-source โ€” you can self-host it on your own infrastructure, which immediately solves data privacy concerns that enterprise customers have with the other two. The cloud version is also solid if you don’t want to manage infra.

The data model is framework-agnostic. You instrument using Traces, Spans, and Generations โ€” a clear hierarchy that maps cleanly to how agents actually work regardless of whether you’re using LangChain, LlamaIndex, raw API calls, or a custom framework.

from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="https://cloud.langfuse.com"
)

@observe()  # automatically creates a trace for each function call
def run_agent(user_input: str):
    # Nested @observe decorators create child spans
    context = retrieve_context(user_input)
    response = generate_response(user_input, context)
    return response

@observe(name="retrieval")
def retrieve_context(query: str):
    # This becomes a span inside the parent trace
    langfuse_context.update_current_span(
        metadata={"retrieved_chunks": 5, "index": "product-docs"}
    )
    return vector_store.similarity_search(query)

The @observe decorator approach feels clean and stays close to normal Python โ€” you’re not rewriting your agent around an SDK’s abstractions.

Langfuse’s Prompt Management and Evaluation

Langfuse has arguably the most mature prompt management of the three. You can version prompts, deploy new versions without code deploys, and track which prompt version was active when a specific trace was generated. For teams where prompt iteration is frequent, this is valuable.

Evaluations can be triggered automatically (using model-based scoring) or manually (human review queue). The annotation workflow for building eval datasets from production traffic is solid. It’s not quite as tightly integrated as LangSmith’s eval loop, but it’s more than capable for most teams.

Langfuse Limitations

Self-hosting means you own the ops burden โ€” Postgres, ClickHouse for analytics, and the application itself. The Docker Compose setup is straightforward, but at scale it becomes real infrastructure work. The cloud version removes this but reintroduces the data-leaving-your-infra concern.

The UI is functional but feels less polished than LangSmith’s in some areas โ€” the trace explorer is good, but dashboard customisation is limited. The Python SDK is reliable; the JS SDK has historically lagged behind in features.

Langfuse Pricing

Self-hosted is free (MIT licensed). Cloud free tier: 50,000 observations/month โ€” significantly more generous than the competition. Pro is $59/month for 1M observations. For data-sensitive workloads, self-hosted Langfuse is the only reasonable choice among these three.

Best for: Teams with data privacy requirements, companies that want self-hosting, and builders using frameworks other than LangChain. Also the best default for anyone running open-source models through Ollama or vLLM.

Head-to-Head: The Metrics That Actually Matter

Capability Helicone LangSmith Langfuse
Setup time ~2 minutes ~5 minutes (LangChain) ~10 minutes
Agent trace hierarchy Limited Excellent Excellent
Cost tracking Excellent Good Good
Evaluations Basic Excellent Very good
Self-hosting No No Yes (MIT)
Framework agnostic Yes (proxy) Partial Yes
Free tier 10k req/mo 5k traces/mo 50k obs/mo

The Verdict: Which LLM Observability Platform Fits Your Situation

Choose Helicone if you’re a solo founder or early-stage team who wants cost visibility and basic logging with zero integration effort. It’s the fastest path from zero to “I can see what my app is spending.” Don’t pick it as your long-term solution if you’re building complex multi-step agents โ€” you’ll outgrow the flat log view quickly.

Choose LangSmith if you’re already using LangChain or LangGraph and you run regular prompt experiments. The automatic tracing and evaluation workflow are genuinely best-in-class for that stack. Be honest with yourself about the trace volume โ€” a busy multi-step agent can burn through the free tier in days, and the paid tier is a real recurring cost.

Choose Langfuse if you have data privacy requirements, you’re building on a non-LangChain stack, or you want the most control over your observability infrastructure. The self-hosted option is the only true answer for regulated industries or teams that can’t route production data through third-party proxies. The generous cloud free tier also makes it the best starting point for teams who want real agent tracing without paying on day one.

For most teams building production agents today, I’d default to Langfuse โ€” the combination of framework agnosticism, mature trace hierarchy, decent eval tooling, and the self-hosting option covers the most ground. LangSmith is the exception if you’re deep in the LangChain ecosystem and evaluation speed matters. As an LLM observability platform, Langfuse simply has fewer vendor lock-in risks and a more realistic pricing curve as you scale.

Editorial note: API pricing, model capabilities, and tool features change frequently โ€” always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links โ€” we may earn a commission if you sign up, at no extra cost to you.

Share.
Leave A Reply