Self-Host Langfuse for LLM Observability

Updated 2026-03-06

Overview

If you run autonomous AI workflows in production, you need observability beyond app logs. Langfuse gives you prompt traces, model call histories, latency metrics, and evaluation hooks in one place.

Docker Compose

services:
  langfuse:
    image: langfuse/langfuse:latest
    restart: unless-stopped
    ports:
      - "3000:3000"
    environment:
      DATABASE_URL: postgresql://langfuse:langfuse@postgres:5432/langfuse
      NEXTAUTH_SECRET: change-me

Reverse Proxy (Caddy)

observability.example.com {
  reverse_proxy 127.0.0.1:3000
}

Reverse Proxy (nginx)

server {
  listen 443 ssl;
  server_name observability.example.com;
  location / {
    proxy_set_header Host $host;
    proxy_set_header X-Forwarded-Proto https;
    proxy_pass http://127.0.0.1:3000;
  }
}

Cost and Hardware

For small teams, 1 vCPU and 2 GB RAM is a practical baseline. If you retain large trace volumes, storage growth will dominate cost, so define retention policy early.

Deployment Notes

Instrument one critical workflow first and confirm trace completeness before broad rollout. Add dashboards for median latency, error rate, and token usage per workflow so regressions are visible within hours.

For retention, split “hot” and “archive” windows. Keep recent traces searchable for daily debugging and move older traces to cheaper storage on a schedule. This keeps the system responsive while preserving analysis history for incident reviews and model-quality audits.

Also define ownership. Someone should be accountable for schema updates, trace tag hygiene, and evaluation rubric changes. Without that, observability quality decays quickly and teams stop trusting the dashboards.

Maintenance Checklist

Run a weekly maintenance pass so observability stays trustworthy. Review failed traces, slow spans, and token spikes by workflow. A short recurring review prevents silent regressions from becoming expensive incidents.

Keep your prompts versioned and tagged by release. When output quality shifts, you need to answer one question fast: what changed in prompts, model, tools, or data? Version history makes rollback and root-cause analysis realistic.

Set retention and sampling rules intentionally. High-cardinality traces on noisy internal tasks can flood storage while adding little insight. Keep full traces for critical paths and sample aggressively for low-impact flows.

Was this article helpful?