Published: 2026-04-29 · License: CC BY 4.0
Domain: PSF-4 — Observability & Monitoring
Observability & Monitoring
Observability is the property of a system that allows you to understand its internal state from its external outputs. For AI systems in production, observability is not optional — it is the mechanism by which you know whether the system is working, degrading, drifting, or failing. Without it, you are operating blind.
What AI Observability Must Cover
Every model call should be logged: input (or a hash of it), output (or a sample), model version, latency, token counts, cost, and any routing or filtering decisions. This is the raw material for all other observability.
Define a quality metric for your system and score every (or a statistically valid sample of) outputs against it. Quality scoring can be automated (LLM-as-judge, rule-based), human-reviewed, or a combination. The score must trend visibly.
Monitor the statistical properties of inputs and outputs over time. Population stability index (PSI), KL divergence, or simpler distribution statistics can detect when the data your system sees is shifting away from what it was designed for.
P50/P95/P99 latency, tokens per request, cost per request, and error rates should be visible in real time. Cost spikes often precede quality degradation — they are a leading indicator, not just an operations metric.
Monitoring without alerting is archaeology. Define threshold-based alerts for: error rate, latency P99, quality score drop, drift PSI breach, and cost spike. Route alerts to named on-call owners, not generic inboxes.
Every inference log must include the model version (including provider-side model versions, which can change silently for hosted models). Unexplained quality changes are almost always correlated with model version changes.
The Silent Degradation Problem
The most dangerous AI production failure mode is silent degradation — the system continues to respond, error rates remain low, but output quality is declining. This is invisible without active quality monitoring. A system that returns plausible-sounding but increasingly incorrect outputs will not trigger any infrastructure alert. Only a quality scoring pipeline — running continuously and trending over time — will catch it. This is why PSF-4 mandates quality scoring, not just infrastructure health monitoring.
Logging Architecture for AI Systems
AI inference logs have different characteristics from application logs. They are larger (full prompt and response text), more sensitive (may contain PII, confidential data, or proprietary system prompts), and more valuable for retrospective analysis. Log architecture for AI systems must address: structured format (JSON, not free text), PII handling (redaction at the logging layer, not just at the application layer), retention policy (long enough for model comparison, short enough to meet data protection obligations), access controls (inference logs should not be universally readable), and searchability (you need to find the exact prompt and response for any given incident).
PSF-4 Compliance Checklist
Provider-Side Model Changes
A specific PSF-4 risk that catches many teams off-guard: hosted model providers (OpenAI, Anthropic, Google, etc.) update their models without always providing prominent advance notice. A model that was GPT-4-turbo-2024-04-09 in your logs is a specific version; GPT-4-turbo without a version pinned means you may be running a different model today than you were last month. PSF-4 requires version pinning where the API supports it, and quality monitoring rigorous enough to detect silent version changes where it does not.
AIDA Exam Tips for PSF-4
- PSF-4 questions test whether you know the difference between infrastructure monitoring (latency, errors) and AI-specific monitoring (quality scoring, drift). Infrastructure monitoring alone is not PSF-4 compliance.
- Silent degradation scenario: a system looks healthy on all infrastructure metrics but users are complaining about quality. The PSF-4 answer is a quality scoring pipeline — not more infrastructure alerts.
- Drift detection questions: know that PSI (Population Stability Index) is the standard metric for detecting input distribution shift. A PSI > 0.2 is a common threshold for requiring investigation.
- Version tracking: if a scenario describes unexplained quality changes after a model provider update, the PSF-4 failure is lack of model version pinning + no version tracking in inference logs.
- Cost spike questions: in PSF-4 context, a sudden cost spike is a signal to investigate quality, not just to optimise spend. It often indicates prompt injection (longer, adversarial prompts) or model routing failure.