PSF Domain 5: Deployment Safety
Every major framework gets a Gap or Partial rating on D5. Deployment safety is treated as an infrastructure concern outside the framework's scope — which means it ends up nobody's responsibility. This guide covers the model versioning, canary patterns, and rollback procedures that make production AI recoverable.
The D5 gap: no major agent framework includes deployment controls. Model version pinning, canary traffic splitting, automatic rollback, and deployment audit trails are all operator responsibilities with zero framework support. Most teams discover this gap after their first silent production regression.
Why Deployment Safety Is Different for AI
Software deployment safety is a solved problem. Blue-green deployments, canary releases, feature flags, and rollback procedures are standard practice. AI deployment is not a solved problem — and the differences are not cosmetic.
In conventional software, a bad deployment produces errors. Errors are observable, countable, and trigger alerts. In AI deployment, a bad model update may produce outputs that are subtly wrong — outputs that pass schema validation, do not throw exceptions, and are not obviously incorrect without domain expertise. The regression may not surface in error rates at all. It surfaces in business outcomes, user complaints, or downstream process failures, weeks later.
Three additional AI-specific deployment risks have no conventional software equivalent: silent upstream model updates (the provider patches their model without notification), prompt drift (prompts are changed outside of deployment pipelines), and evaluation-production mismatch (the eval dataset no longer represents production inputs).
The Minimum Viable Deployment Pipeline
Every production AI system should have five deployment controls. These are not aspirational — they are the minimum required to meet D5.
1. Model Version Registry
Every model that runs in production must be tracked with: model name and version/snapshot ID, training data version or cutoff, evaluation results and evaluator version, authorisation (who approved this for production and when), deployment date, and deprecation date if known. A git repository with a YAML model card per model is sufficient. An untracked model is a compliance gap.
2. Pinned Model Versions
Never reference a model by alias in production (e.g., gpt-4o-mini or claude-3-sonnet). Aliases resolve to different model versions over time. Always pin to the specific snapshot ID where the API supports it, and include a monitoring process for alias resolution changes where it does not.
Model pinning by provider:
- OpenAI: use dated versions (gpt-4o-2024-11-20, not gpt-4o). Set up evals that run on model change notifications.
- Anthropic: claude-3-5-sonnet-20241022, not claude-3-5-sonnet. Monitor release notes for version changes.
- Azure OpenAI: deploy to a named deployment with an explicit model version — the deployment name is stable even when the underlying model updates.
- Vertex AI / AWS Bedrock: use explicit model version ARNs/paths, not the latest alias.
3. Canary Deployment
A canary deployment routes a small percentage of production traffic to the new model version while the remainder continues on the current version. Metrics are compared between the two cohorts. The canary is only promoted to full traffic when the comparison is satisfactory.
The practical implementation depends on your architecture. For API-based models, a load balancer with weighted routing handles traffic splitting. For embedded models, feature flag systems (LaunchDarkly, Unleash, etc.) can gate the new model version. For batch processing, a subset of batches can be routed to the new version with comparison metrics.
Key canary metrics to compare: output validation failure rate, latency p50/p95/p99, cost per request, human escalation rate (proxy for model confidence), and downstream error rate from systems consuming AI output. Run the canary for a minimum of 24 hours on production traffic before promotion — eval datasets do not capture the full distribution of real inputs.
4. Rollback Procedure
Rollback must be a known, tested procedure with a documented time-to-rollback SLA. "We can roll back" is not a rollback procedure. The procedure must specify: who has authority to initiate rollback, what metrics trigger automatic rollback vs. manual decision, how traffic is shifted back to the previous version, and how to verify the rollback was successful.
| Metric | Rollback Threshold | Action |
|---|---|---|
| Error rate | > 5% above baseline for 10 min | Automatic rollback |
| Validation failure rate | > 3x baseline for 5 min | Automatic rollback |
| Latency p99 | > 2x baseline sustained 15 min | Alert + manual review trigger |
| Cost per request | > 150% baseline for 30 min | Alert + manual review trigger |
| Human escalation rate | > 2x baseline for 1 hr | Alert — indicates confidence degradation |
| Downstream system errors | Any schema mismatch errors | Automatic rollback |
5. Prompt Version Control
Prompts are as consequential as model changes. A prompt that re-phrases the task description, adds new constraints, or changes the output format can produce a regression as significant as swapping model versions. Prompts must be version-controlled, peer-reviewed, and deployed through the same pipeline as model changes — including canary testing.
Deployment Anti-Patterns
These are the deployment failures that show up most frequently in production AI post-mortems.
Upstream model provider updates their hosted model (e.g. gpt-4o-mini gets patched). No notification. Your evaluations pass on the old version. Production silently runs the new version.
Fix: Pin to specific model versions by ID, not aliases. Monitor for model version changes in response metadata.
Prompts live in a database or config service without version control. A prompt change deploys instantly to 100% of traffic with no evaluation, no canary, no rollback.
Fix: Treat prompts as code: version control, code review, staged deployment. Prompt changes must go through the same pipeline as model changes.
Evaluation uses a golden dataset from 6 months ago. Production inputs have drifted. The model passes eval but fails on current production inputs.
Fix: Refresh eval datasets from recent production samples on a scheduled basis. Shadow-evaluate on live traffic, not just historical golden sets.
New model version deployed to all traffic simultaneously. No ability to compare to previous version. Problems discovered only after full exposure.
Fix: Always canary: start at 5-10% of traffic, compare metrics to control for 24-48 hours before full promotion.
Rollback procedure exists in documentation but has never been tested. When needed in an incident, it fails or takes too long.
Fix: Treat rollback as a deployment. Test it quarterly. Include rollback time in your incident response SLAs.
Framework D5 Status
Every major framework is rated Gap or Partial on D5. This is structural — frameworks focus on agent orchestration, not deployment infrastructure.
No native model versioning or deployment controls. Version pinning is entirely the operator's responsibility. LangSmith provides some run comparison capability but not canary traffic splitting.
No deployment primitives. Agent and tool definitions are code — versioned via git. No concept of staged rollout or traffic shaping for different agent versions.
Similar to LangChain: no deployment safety tooling. Model selection is per-agent in config but no canary or rollback mechanism provided.
Plugin versioning is supported and the kernel can be configured per-model. Enterprise deployments via Azure AI Studio get some deployment management. Still lacks canary traffic splitting.
Lightweight library — no deployment tooling. Model is specified per-agent in code. Version control is entirely the team's responsibility.
Pipeline serialisation/deserialisation enables reproducible deployments. Deepset Cloud adds some deployment management. No native canary primitives.
The Predetermined Change Control Plan
The FDA's AI/ML SaMD guidance introduced the Predetermined Change Control Plan (PCCP) — a document that specifies in advance what types of model updates are permitted without re-authorisation. The concept applies well beyond healthcare: any regulated deployment benefits from a PCCP equivalent that defines the change envelope, the evaluation requirements for each change type, and the approval path.
A minimal PCCP for production AI specifies: (1) which model versions are pre-approved for deployment and the evaluation criteria that qualify a new version, (2) which prompt changes are minor (re-wording, examples) vs. major (task scope, output format), and the evaluation requirements for each, (3) what constitutes a significant output distribution change that requires re-authorisation before deployment.
Minimum PCCP contents for production AI:
- Change taxonomy: minor (no eval required), moderate (automated eval required), major (human eval + approval required)
- Evaluation suite: which benchmarks, what pass thresholds, run against which dataset version
- Canary protocol: traffic percentage, duration, metrics compared, promotion criteria
- Rollback criteria: which metric breaches trigger rollback, time limits before decision required
- Approval authority: who can approve each change tier, documented sign-off requirement
You understand the gaps.
Get the credential that proves it.
The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.
Get framework updates in your inbox
PSF assessments, deployment guides, and production AI analysis. Weekly. No hype.