LLM Observability: The ML Engineer's Practical Guide (2026)

01 — DEFINITIONWhat is LLM observability?

LLM observability is the ability to understand what your large language model is doing, why it's doing it, and whether it's doing it well — while it's running in production.

The formal definition: it's the process of instrumenting LLM applications to collect structured data (metrics, traces, logs) about inputs, outputs, latency, token usage, and downstream behavior — then making that data queryable and actionable.

But here's the part most definitions skip: LLMs are non-deterministic. The same prompt can produce different outputs. That single fact breaks every assumption traditional application monitoring was built on.

💡

NOTE

"Observability" comes from control theory — a system is observable if you can infer its internal state from its outputs. For LLMs, the "internal state" is opaque by design. Observability is how you compensate for that opacity.

A complete LLM observability setup lets you answer questions like:

Why did this prompt return garbage output on Tuesday at 3pm?
How many tokens did we burn last week, and on which features?
Is our retrieval step actually finding relevant context, or just noise?
Which user flows are generating the most hallucinations?
Did our prompt change last Wednesday improve or hurt response quality?

Without observability, you're guessing at all of the above.

02 — CONTEXTWhy traditional APM fails for LLMs

You might already have Datadog, New Relic, or Prometheus running. They're great tools. They will not help you monitor an LLM application properly. Here's why:

Traditional APM vs LLM Observability

Traditional APM

Deterministic outputs — same input → same output
Binary success/failure (HTTP 200 vs 500)
Performance = speed + uptime
No concept of "output quality"
Traces follow fixed execution paths
No per-request cost tracking needed
Errors are clear (stack traces, exceptions)

LLM Observability

Non-deterministic — same prompt → different outputs
Output can be grammatically correct but factually wrong
Quality = relevance, factual accuracy, coherence
Quality evaluation is a first-class concern
Traces span prompt → retrieval → generation → re-ranking
Token cost per request is critical (it's your AWS bill)
"Silent failures" — plausible-sounding wrong answers

The most dangerous failure mode in LLM production is the silent failure: the model returns a 200 OK with a confident, fluent, completely wrong answer. Your APM sees green. Your users are getting misinformation. You have no idea.

That's the problem LLM observability is built to solve.

03 — IMPORTANCEWhy LLM observability matters in 2026

0 % of LLM failures are silent¹

0 % avg cost reduction with token tracking

0 × faster MTTR with proper traces

Three specific reasons this matters right now:

1. You're paying per token — and it adds up fast

GPT-4o charges ~$5 per million input tokens. Claude Opus is $15. If you're running a RAG pipeline that sends 3,000-token prompts for every user query, and you have 10,000 daily active users, you're burning through tokens fast. Without observability, you have no visibility into which features are expensive, which prompts are bloated, or which retrieval chunks are redundant.

2. Hallucinations don't throw exceptions

When a SQL query fails, you get an error. When an LLM confidently fabricates a legal clause, a medical dosage, or a product spec — you get a 200 OK. The only way to catch this is output evaluation: either automated (LLM-as-judge, assertion checks) or via user feedback signals — both of which require an observability layer to collect and route.

3. LLM apps are increasingly multi-step

A modern RAG agent might do: query rewriting → vector search → reranking → generation → post-processing → tool calls. Any step can fail silently. Without distributed tracing across all those steps, you have no way to know which node in the chain is degrading your quality.

✅

TIP

If you're already logging prompts and responses to a database, you have the raw material for LLM observability. The difference is structure, aggregation, and making that data queryable — which is what proper tooling does.

04 — FUNDAMENTALSThe three pillars: metrics, traces, logs

LLM observability, like traditional observability, rests on three data types. But each has LLM-specific meaning:

Metrics — aggregated numbers over time

Latency percentiles, token consumption per day, error rates, hallucination rate, TTFT (time to first token), user thumbs-up/down ratio. These are your dashboards — the signals that tell you whether the system is healthy at a glance.

Traces — the execution path of a single request

A trace for an LLM request spans every step: input received → prompt constructed → retrieval triggered → chunks fetched → LLM called → response parsed → returned. Traces tell you where time and tokens were spent on a specific request and let you drill into failures.

Logs — raw structured records of events

Every prompt sent, every response received, every retrieved chunk, every tool call. Logs are the ground truth — unsampled, timestamped, filterable. They're what you reach for during incident investigation when metrics tell you something is wrong but not exactly what.

A mature LLM observability setup collects all three and links them: a metric spike points you to a trace, a trace links to the logs of that specific exchange.

⚠️

WARNING

Logging raw prompts and responses raises data privacy and compliance considerations. If users send PII, it ends up in your logs. Make sure you have a redaction or anonymization strategy before you log at full fidelity in production.

05 — METRICSKey LLM observability metrics to track

These are the metrics that actually matter — not the generic list you'll find everywhere, but the ones that show up when something goes wrong.

Latency metrics

⏱️

TTFT

Time To First Token. The latency a user perceives before streaming starts. Critical for UX — even if total latency is high, low TTFT feels fast.

User-perceived speed

📊

TPS

Tokens Per Second (generation speed). Determines how fast streaming completions arrive. Degrades under load — track p50, p95, p99.

Throughput

🕐

End-to-end latency

Total request time including retrieval, reranking, and generation. This is what SLAs are measured against. Break it down by stage.

SLA metric

Cost metrics

🪙

Input tokens/request

How many prompt tokens your application sends per call. This is where cost bloat hides — long system prompts, redundant context, noisy retrieved chunks.

Cost driver

💸

Cost per request

Calculated from input+output tokens × model price. Track by feature, user segment, and time of day to find optimization opportunities.

Unit economics

📈

Daily token burn rate

Total tokens consumed across all requests per day. Set alerts here. A bug that loops your chain will show up as a spike here before your bill does.

Budget alert

Quality metrics

🎯

Faithfulness

In RAG: does the answer stay grounded in the retrieved context? Unfaithful answers are hallucinations. Measured via LLM-as-judge or assertion frameworks.

RAG critical

📋

Relevance score

Is the answer relevant to the question asked? Separate from correctness. A factually accurate answer about the wrong thing still fails.

Quality signal

👍

User feedback rate

Explicit thumbs up/down, ratings, or correction events. The highest-signal quality metric — users are telling you directly when the model failed them.

Ground truth

💡

NOTE

Quality metrics are the hardest to collect automatically. Start with user feedback signals (explicit) and error rate (implicit — users who retry or abandon). Then layer in automated evaluation once you have a baseline.

06 — TOOLSBest LLM observability tools in 2026

Honest breakdown. I've tested all of these. No affiliate links, no vendor bias — just what each tool actually does well and where it falls short.

🦜

Langfuse

Open Source

Self-hostable, developer-first LLM tracing. Best open-source option if you want full data control and a clean SDK.

Self-hostable via Docker (free)
SDKs for Python, JS, LangChain, LlamaIndex
Prompt management + version tracking
Dataset + evaluation workflows

🔥

Arize Phoenix

Open Source

ML observability platform with strong LLM support. Great for teams already using Arize for traditional ML model monitoring.

OpenInference tracing standard
Embedding drift & cluster visualization
Built-in evals (hallucination, toxicity)
Works offline / local

⚡

Helicone

Free Tier

Proxy-based approach — zero SDK changes required. Add one header and you get instant logging, cost tracking, and rate limiting.

One-line integration (proxy URL swap)
Real-time cost dashboard
Request caching (save money)
10k req/month free

🌊

W&B Weave

Free Tier

Weights & Biases' LLM observability layer. Best if you're already using W&B for experiment tracking — tight integration.

Native W&B integration
Automatic function tracing via decorator
Evaluation pipelines built-in
Free for individual use

📡

OpenTelemetry

Open Standard

Build your own observability pipeline using the vendor-neutral OTel standard. Maximum flexibility, maximum setup effort.

Vendor-neutral (ship to any backend)
OpenLLMetry SDK for LLM spans
Works with Jaeger, Tempo, Datadog
Right choice for complex enterprise infra

🐕

Datadog LLM Obs

Paid

Enterprise-grade. If your team is already deep in Datadog and has the budget, their LLM Observability product integrates cleanly with existing dashboards.

Unified with existing Datadog APM
Auto-instrumentation for OpenAI/Anthropic
Cluster analysis for prompt patterns
Expensive — not for early-stage

Quick comparison

Tool	Open Source	Self-hostable	RAG support	Evals built-in	Best for
Langfuse	✓	✓	✓	✓	Most teams — best OSS default
Arize Phoenix	✓	✓	✓	✓	Embedding analysis, ML teams
Helicone	~	~	~	✗	Cost tracking, fastest setup
W&B Weave	✗	✗	✓	✓	W&B users, experiment correlation
OpenTelemetry	✓	✓	✓	✗	Enterprise, multi-backend infra
Datadog LLM Obs	✗	✗	✓	✓	Existing Datadog shops, enterprise

✅

RECOMMENDATION

Start with Langfuse. Open source, self-hostable with Docker in 5 minutes, clean Python SDK, covers 90% of what you need. Graduate to OpenTelemetry when you have multiple services that need unified tracing across a complex infra stack.

07 — TUTORIALHow to implement LLM observability in Python

Enough theory. Here's how you actually do it. We'll implement tracing with Langfuse — the best open-source option — and show the full flow from a simple LLM call to a RAG pipeline with spans, scores, and cost tracking.

Step 1: Set up Langfuse (self-hosted via Docker)

BASH terminal

# Clone and start Langfuse locally
git clone https://github.com/langfuse/langfuse.git
cd langfuse
docker compose up -d

# Langfuse UI will be at http://localhost:3000
# Create a project and grab your API keys

# Install the Python SDK
pip install langfuse openai

Step 2: Basic LLM call with full tracing

PYTHON basic_tracing.py

from langfuse import Langfuse
from langfuse.openai import openai  # drop-in replacement
import os

# Init — reads LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_HOST from env
langfuse = Langfuse(
    public_key="pk-lf-...",
    secret_key="sk-lf-...",
    host="http://localhost:3000"  # or https://cloud.langfuse.com
)

# This single import swap gives you automatic tracing
# of every OpenAI call: prompt, response, tokens, latency, cost
response = openai.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful MLOps assistant."},
        {"role": "user", "content": "What is MLflow used for?"}
    ],
    # Optional: tag this trace for filtering in the UI
    name="mlops-qa",
    metadata={"feature": "chat", "user_id": "u_123"}
)

print(response.choices[0].message.content)
# All trace data is now visible in Langfuse UI — zero extra code needed

That's the minimal setup. The import swap is the key — from langfuse.openai import openai patches the OpenAI client and captures everything automatically. Token counts, cost, latency, the full prompt and response.

Step 3: Custom spans for multi-step pipelines

PYTHON rag_pipeline_traced.py

from langfuse import Langfuse
from langfuse.openai import openai
from langfuse.decorators import langfuse_context, observe

langfuse = Langfuse()

# @observe creates a span for this function automatically
@observe()
def retrieve_chunks(query: str, top_k: int = 5) -> list:
    """Simulated vector store retrieval"""
    # In production: call your Chroma / Pinecone / Weaviate here
    chunks = [
        {"text": "MLflow is an open source platform for ML lifecycle management...", "score": 0.92},
        {"text": "MLflow Tracking logs parameters, metrics, and artifacts...", "score": 0.87},
    ]
    # Log retrieval metadata to the span
    langfuse_context.update_current_observation(
        input=query,
        output=chunks,
        metadata={"top_k": top_k, "chunk_count": len(chunks)}
    )
    return chunks

@observe()
def build_prompt(query: str, chunks: list) -> str:
    """Assemble the final prompt from query + retrieved context"""
    context = "\n\n".join([c["text"] for c in chunks])
    return f"""Answer using only the context below.

Context:
{context}

Question: {query}
Answer:"""

@observe()  # The root trace — wraps the whole pipeline
def rag_answer(query: str) -> str:
    # Step 1: retrieve — traced as a child span
    chunks = retrieve_chunks(query)
    
    # Step 2: build prompt — traced as a child span
    prompt = build_prompt(query, chunks)
    
    # Step 3: generate — traced via patched OpenAI client
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    answer = response.choices[0].message.content
    
    # Step 4: score the output quality (0-1 scale)
    langfuse_context.score_current_trace(
        name="answer_quality",
        value=1.0,  # replace with your eval logic
        comment="Auto-scored: retrieval found relevant chunks"
    )
    
    return answer

# Run it
result = rag_answer("What is MLflow used for?")
print(result)

# Flush traces before script exits
langfuse.flush()

Step 4: Automated quality scoring (LLM-as-judge)

PYTHON llm_judge_eval.py

from openai import OpenAI

raw_client = OpenAI()  # unpatched — don't trace the judge calls

def evaluate_faithfulness(question: str, context: str, answer: str) -> float:
    """
    LLM-as-judge: score whether the answer is faithful to the retrieved context.
    Returns a score from 0.0 (hallucination) to 1.0 (fully grounded).
    """
    judge_prompt = f"""You are evaluating an AI assistant's answer for faithfulness.

RETRIEVED CONTEXT:
{context}

QUESTION: {question}

ANSWER: {answer}

Task: Score whether the answer is ONLY based on the retrieved context (not hallucinated).
Respond with JSON only: {{"score": 0.0-1.0, "reason": "brief explanation"}}
0.0 = completely hallucinated | 1.0 = fully grounded in context"""

    response = raw_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt}],
        response_format={"type": "json_object"}
    )
    
    import json
    result = json.loads(response.choices[0].message.content)
    return result["score"], result["reason"]

# Then post the score back to Langfuse for any trace
score, reason = evaluate_faithfulness(question, context, answer)

langfuse.score(
    trace_id="your-trace-id",  # from langfuse_context.get_current_trace_id()
    name="faithfulness",
    value=score,
    comment=reason
)

Step 5: Add user feedback signals

PYTHON user_feedback.py

# When a user clicks 👍 or 👎, post feedback to Langfuse
# Store the trace_id alongside your response when you serve it

from langfuse import Langfuse
from langfuse.decorators import langfuse_context

langfuse = Langfuse()

def handle_user_feedback(trace_id: str, thumbs_up: bool, comment: str = None):
    """Record user feedback against the trace that generated the response"""
    langfuse.score(
        trace_id=trace_id,
        name="user_feedback",
        value=1 if thumbs_up else 0,
        comment=comment
    )

# Example: in your FastAPI endpoint
# @app.post("/feedback")
# async def feedback(trace_id: str, positive: bool, comment: str = None):
#     handle_user_feedback(trace_id, positive, comment)
#     return {"status": "recorded"}

After this implementation, your Langfuse dashboard will show: every trace, its constituent spans (retrieval, prompt build, generation), token counts, latency breakdown by step, faithfulness scores, and user feedback — all correlated.

✅

PRO TIP

Get the current trace_id inside any @observe-decorated function with langfuse_context.get_current_trace_id(). Store this in your response payload so you can link user feedback back to the exact trace.

08 — ADVANCEDRAG observability: what's different

Retrieval-Augmented Generation pipelines have unique failure modes that generic LLM observability doesn't capture. Here are the RAG-specific metrics to track:

Metric What it measures Good range Bad signal Context precision Are retrieved chunks actually relevant to the query? > 0.8 Low → noisy retrieval, poor embedding Context recall Did retrieval find all the chunks needed to answer? > 0.75 Low → answer is incomplete, missing context Faithfulness Is the answer grounded in the retrieved context? > 0.85 Low → hallucination, ignoring retrieved data Answer relevance Does the answer address what was actually asked? > 0.8 Low → model is answering the wrong question Retrieval latency Time spent in vector search step < 200ms High → index needs optimization or scaling Chunk token count Avg tokens per retrieved chunk 200–600 Too high → inflated cost, diluted signal

The RAG failure you're not watching for

The most common undetected RAG failure is context stuffing: your retrieval returns chunks that look semantically similar to the query but don't contain the actual answer. The model then either hallucinates or returns a plausible-sounding non-answer. Context precision catches this — track it per query, and set an alert if it drops below 0.6 for more than 5% of requests.

PYTHON rag_metrics.py

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall
)
from datasets import Dataset

# Collect your RAG pipeline outputs
data = {
    "question":  ["What is MLflow used for?"],
    "answer":    ["MLflow is used for experiment tracking..."],
    "contexts":  [["MLflow is an open source platform...", "MLflow Tracking logs..."]],
    "ground_truth": ["MLflow manages the ML lifecycle including tracking..."]
}

dataset = Dataset.from_dict(data)

# Run RAGAS evaluation — gives you all 4 RAG metrics at once
result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)

print(result)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88,
#  'context_precision': 0.94, 'context_recall': 0.81}

# Then post these scores to Langfuse for the corresponding trace

09 — PITFALLSCommon mistakes to avoid

✗

Logging everything without a retention policy

Storing every raw prompt and response forever will balloon your storage costs fast. Set a 30-90 day retention window, sample high-volume low-value traces (e.g., 1 in 10 for healthy routine calls), and keep 100% of error traces and scored traces.

✗

Treating latency as the only quality signal

Fast bad answers are worse than slow good ones. Latency is table stakes. Build quality metrics from day one, even if it's just a user thumbs-up/down — don't let "it's fast" become your proxy for "it's working."

✗

Adding observability as an afterthought

If you retrofit tracing into a production system with no span structure, you'll get a flat blob of logs with no actionable signal. Instrument at the architecture level — define your spans (retrieval, generation, eval) from the first prototype.

✗

Not separating judge calls from production traces

If you're using an LLM to evaluate your LLM's outputs, those evaluation calls must use an unpatched client. Otherwise you'll get recursive tracing, inflated token counts, and meaningless cost data.

✗

Ignoring PII in logs

Users will send email addresses, names, addresses, medical info into your LLM app. Log at full fidelity in dev/staging. In production, run a PII redaction pass before writing traces to storage. This is not optional if you're handling EU users (GDPR).

10 — FAQFrequently asked questions

What's the difference between LLM monitoring and LLM observability?

Monitoring tracks predefined metrics (latency, error rate) and alerts when they cross thresholds. Observability is broader — it's the ability to ask arbitrary questions about your system's behavior from its outputs, including things you didn't anticipate when you set up the system. In practice: monitoring tells you something is wrong, observability helps you figure out why and what.

Can I use Prometheus and Grafana for LLM observability?

Yes, for system-level metrics (latency, throughput, error rate, token counts). You can expose these via a /metrics endpoint and scrape with Prometheus. But you'll still need a purpose-built tool like Langfuse or Phoenix for prompt/response tracing, RAG-specific metrics, and quality evaluation — Prometheus doesn't understand the semantic content of LLM outputs.

How do you detect hallucinations automatically?

Three main approaches: (1) Faithfulness scoring — use an LLM judge to check if the answer is grounded in retrieved context. (2) Assertion checks — programmatic rules for your domain (e.g., "answer must not contain any date before 2020"). (3) Semantic similarity — compare the answer embedding to the context embedding; low similarity suggests the answer went "off-context." None of these are perfect. Start with LLM-as-judge faithfulness scoring and user feedback signals together.

Is LLM observability the same as MLOps?

MLOps is the broader practice of operationalizing machine learning — including training pipelines, experiment tracking, model deployment, and monitoring. LLM observability is a specific subset focused on monitoring LLM-powered applications in production. It overlaps with MLOps but has different tooling and concerns (token costs, prompt management, output quality evaluation vs. model drift, retraining pipelines).

What's the cheapest way to start with LLM observability?

Self-host Langfuse via Docker (free). Use the Python SDK with the OpenAI import swap (5 lines of code). You'll have full tracing, token tracking, and a queryable UI for $0. Your only cost is the server running Langfuse — a $5/month DigitalOcean droplet or your local machine is enough for early-stage projects.

Does LLM observability work with open-source models (Llama, Mistral)?

Yes. Langfuse and Phoenix work with any model via their generic SDK (you manually log inputs/outputs). For models served via vLLM or Ollama with an OpenAI-compatible API, the OpenAI import swap works directly. Token cost tracking requires you to calculate costs manually (or use a provider that reports usage in their API response).

11 — CONCLUSIONWhere to go from here

LLM observability isn't optional at production scale. The "it works in testing" mindset breaks fast when real users send unexpected inputs, when retrieval quality degrades silently, when a token-hungry prompt pattern starts inflating your inference bill.

The stack to start with: Langfuse for tracing, RAGAS for RAG quality metrics, and user feedback signals for ground truth. That combination gives you 80% of what you need with maybe a day of implementation work.

Don't build the perfect observability system before shipping. Instrument as you build. Add quality metrics when you have baseline data to compare against. The value compounds.

🔗

NEXT STEPS

Set up Langfuse locally → instrument one LLM call → check the trace in the UI. That's the first 20 minutes. Everything else follows from having that first trace visible.

→Related articles

Ayub Shah

ML Engineering Student · MLOps Engineer

Computer engineering student on the path to becoming an ML Engineer. Testing every MLOps tool so you don't have to — no vendor bias, just hands-on code. Reach me at ayubshah014@gmail.com

References

Dong, L., Lu, Q., & Zhu, L. (2024). AgentOps: Enabling Observability of LLM Agents. arXiv. arxiv.org/abs/2411.05285
Es, S., et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv. arxiv.org/abs/2309.15217
Langfuse Documentation. Open Source LLM Engineering Platform. langfuse.com/docs
OpenTelemetry Semantic Conventions for LLM systems. opentelemetry.io/docs/specs/semconv/gen-ai/
Vesely, K., & Lewis, M. (2024). Real-Time Monitoring and Diagnostics of Machine Learning Pipelines. Journal of Systems and Software, 185, 111136.