Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
InsightsPSF Domain 4Implementation Guide
PSF Deep Dive · D4 Observability

AI Observability: What to Log, How Long to Keep It, and When to Alert

Observability is the difference between knowing your AI system failed and knowing why it failed, when it started, and which users were affected. PSF-4 defines the minimum logging, alerting, and retention requirements for production AI systems — this guide shows you how to implement them.

The Observability Gap in AI Systems

Traditional software observability (metrics, logs, traces) is necessary but not sufficient for AI systems. A model can be returning valid 200 responses while silently degrading in quality — drifting off-distribution, hallucinating more frequently, producing outputs that are technically correct but semantically wrong. Infrastructure monitoring doesn't catch this. PSF-4 requires you to monitor the intelligence layer, not just the infrastructure layer.

The PSF-4 minimum requirements are:

The Minimum Logging Schema

Every model invocation should produce a structured log entry with at minimum:

{
  "run_id": "uuid",
  "timestamp": "2026-04-30T14:32:00Z",
  "model": "gpt-4o-2024-08-06",
  "model_version_hash": "sha256:abc123",
  "input_hash": "sha256:def456",  // hash only — not the raw input if PHI possible
  "output_hash": "sha256:ghi789",
  "latency_ms": 1243,
  "input_tokens": 847,
  "output_tokens": 312,
  "total_cost_usd": 0.0043,
  "validation_passed": true,
  "confidence_score": 0.91,
  "user_id_hash": "sha256:jkl012",  // pseudonymised
  "workflow_id": "invoice-approval-v3",
  "environment": "production",
  "tags": ["billing", "high-value"]
}
PHI/PII note: Never include raw model inputs or outputs in your logging schema if they may contain personal data. Log hashes for correlation; store full traces in a separate, access-controlled trace store with appropriate retention policies.

Alert Thresholds: The Four You Must Have

Error rate> 2% over 5-minute window
Action: PagerDuty/on-call alert
Why: Sudden error spike = model API issue, validation failure spike, or upstream data problem
Output validation failure rate> 5% over 10-minute window
Action: Engineering alert + auto-fallback trigger
Why: Model drifting off schema = prompt regression, model version change, or distribution shift
Latency p99> 3× baseline for 5 minutes
Action: Infrastructure alert
Why: Latency spikes cause timeout cascades in downstream systems
Cost per hour> 200% of 7-day rolling average
Action: Engineering + finance alert
Why: Cost anomaly = runaway loops, abuse, or accidental production of very long prompts

Drift Detection: The Alert You Don't Have But Need

The most dangerous AI failure mode is silent drift — the model starts returning subtly worse outputs without triggering any infrastructure alert. Error rates stay low. Latency is normal. But quality is degrading.

Three practical drift detection approaches:

Observability Tool Selection

Langfuse
Open source / Cloud
Full comparison →
Strengths
Full trace capture
Prompt versioning
Dataset management
Cost tracking per model
Limitations
Self-hosted complexity
No native alerting
PSF: Strong D4, Partial D3 (PHI in traces)
LangSmith
LangChain ecosystem
Full comparison →
Strengths
Native LangChain integration
Run tree visualisation
Dataset/eval support
Annotation queues
Limitations
LangChain-centric
Expensive at scale
PSF: Strong D4, configure trace retention
Arize Phoenix
Open source
Full comparison →
Strengths
Embedding drift detection
RAG retrieval tracing
OpenTelemetry native
Hallucination detection
Limitations
Less mature UI
Community support
PSF: Strong D4 + D2 (drift/quality)
Helicone
Cloud proxy
Full comparison →
Strengths
Zero-code integration
Cost analytics
Rate limiting
Caching
Limitations
Proxy adds latency
Limited trace depth
PSF: Good for D4 cost/volume metrics

Related guides

PSF D4: Observability canonical guideObservability tools comparison (Langfuse vs Arize vs Helicone)Vector DB comparison — D4 observability scoringPSF-compliant stack recipesD2 Output Validation deep dive
From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential
The Production AI Brief

Get framework updates in your inbox

PSF assessments, deployment guides, and production AI analysis. Weekly. No hype.