Tutorial· 7 min read

How to Monitor Your LLM App in Production (Step-by-Step)

You shipped your LLM app. Users are using it. But do you actually know what's happening? Most teams have zero visibility into their AI app once it's live. This guide shows you how to fix that in under 30 minutes.

Why Traditional Monitoring Isn't Enough for LLM Apps

Your existing monitoring stack (Datadog, New Relic, Prometheus) tracks the basics: uptime, HTTP response codes, server CPU. But AI applications need fundamentally different monitoring:

Output quality
Is the AI giving good, accurate answers?
Hallucination detection
Is it confidently making things up?
Cost per interaction
Are some queries 100x more expensive than others?
User satisfaction
Are users happy with what they're getting?

A 200 OK response doesn't mean your AI gave a good answer. You need observability that understands AI-specific signals. Here's how to set it up.

Step 1: Instrument Your LLM Application

The foundation of LLM monitoring is event logging. Every time your app makes an LLM call, you need to capture the input, output, and relevant metadata.

app.py
import phospho
import openai

# Initialize phospho (get your API key from the dashboard)
phospho.init(api_key="ph_your_key")

def handle_user_message(user_input, session_id):
    # Your existing LLM call
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{...}]
    )

    output = response.choices[0].message.content

    # Log to phospho — this is the key line
    phospho.log(
        input=user_input,
        output=output,
        session_id=session_id,
        metadata={
            "model": "gpt-4",
            "tokens_in": response.usage.prompt_tokens,
            "tokens_out": response.usage.completion_tokens,
        }
    )

    return output

That's it for instrumentation. The phospho.log() call captures the full interaction with metadata. It's non-blocking, so it won't slow down your app.

Step 2: Set Up Your Monitoring Dashboard

Once events start flowing in, your Phospho dashboard automatically surfaces the key metrics you need:

01
Request Volume
Events over time, hourly and daily trends. Spot usage spikes and adoption patterns.
02
Quality Scores
Automatic quality scoring for every response. See average scores and trends.
03
Cost Breakdown
Token usage and API cost per interaction, per user, per model.
04
Session Replay
Full conversation flows. Click any session to see the complete user journey.
05
User Feedback
Aggregated thumbs up/down scores. See which interactions users love or hate.

No configuration needed. The dashboard is ready the moment your first event arrives.

Set up LLM monitoring in 5 minutes

Stop guessing what's happening in your AI app. Get real-time visibility with Phospho.

Get Phospho Pro — $49/mo

Step 3: Configure Alerts and Thresholds

Proactive monitoring means catching issues before your users report them. Set up alerts for:

Alert TypeThresholdWhy It Matters
Quality drop< 0.7 score (1h avg)AI is giving worse answers
Cost spike> 2x daily averageRunaway costs or abuse
Latency increase> 5s p95 response timeUsers waiting too long
Error rate> 5% of requestsLLM API issues

Step 4: Analyze Sessions and Improve

The real power of LLM monitoring comes from turning data into action. Here's the workflow teams use daily:

  1. 1.
    Review daily metrics — Check the dashboard for quality trends, cost patterns, and volume changes.
  2. 2.
    Investigate low-quality sessions — Use session replay to understand exactly where the AI went wrong.
  3. 3.
    Identify patterns — Look for common failure modes: specific query types, user segments, or times of day.
  4. 4.
    Iterate on prompts — Use insights to improve system prompts, add guardrails, or adjust model selection.
  5. 5.
    Verify improvements — Compare quality scores before and after changes to prove impact.

The ROI of LLM Monitoring

Teams using production monitoring for their LLM applications consistently report:

30%
reduction in LLM API costs
2x
faster debugging of production issues
50%
fewer user complaints about AI quality

At $49/month, Phospho pays for itself after finding a single cost optimization or catching one quality issue before users do.

Common Mistakes to Avoid

  • Only monitoring latency and ignoring response quality
  • Waiting for user complaints instead of proactively detecting issues
  • Not tracking costs at the per-user and per-feature level
  • Ignoring session-level context (individual messages don't tell the full story)
  • Using generic APM tools that don't understand LLM-specific signals

Don't wait for users to complain

Get visibility into your LLM application today. Set up monitoring in under 5 minutes and start shipping with confidence.

Get Phospho Pro — $49/mo Early Access

Founding member pricing locked in forever.