Production AI Institute — vendor-neutral certification for AI practitioners

Verify a credential Membership For organisations Contact

CPAP · Practitioner

Study Guide: Certified Production AI Practitioner

This guide covers all 8 PSF domains tested in the CPAP exam at practitioner depth. Each domain includes key concepts, a worked scenario, and the reasoning approach the exam expects. CPAP scenarios are more complex than AIDA — they test your ability to navigate real production trade-offs, not just recall correct principles.

Take the exam — $297 →Exam blueprint →

Exam at a glance

Questions

20 drawn from a 50-question bank

Pass mark

15 correct (75%)

Time limit

45 minutes

Retake cooldown

72 hours

Fee

$297

Credential

Digital certificate + verifiable registry listing

How CPAP scenarios differ from AIDA

AIDA questions ask: what is the right principle to apply here? CPAP questions ask: you're in a real production situation with competing pressures — what is the correct decision and why does it take priority over the alternatives?

CPAP distractors are specifically designed to look correct. They will be things a reasonable engineer might do, things that are partially correct, or things that address a real concern but miss the primary issue. The exam rewards structured reasoning and correct prioritisation, not recognition of keywords.

PSF-1

Input Governance & Prompt Security

15–20% of exam

Key concepts

Prompt injection taxonomy: direct, indirect, payload injection
Input validation layers: schema, semantic, length, rate
Adversarial input detection without false positive overload
Content filtering policy design for multi-tenant systems
System prompt confidentiality and leak prevention
Input logging strategy: privacy vs. security trade-off

Worked scenario

A B2B SaaS product embeds customer-uploaded PDFs into LLM context for question answering. During red-teaming, you discover that a specially crafted PDF can override system instructions. The product team wants to strip all text over 500 tokens from uploaded documents.

Expert reasoning

Token-length stripping is a blunt mitigation that doesn't address the root cause and will break legitimate use cases. The correct approach is: (1) treat all document content as untrusted user input — never inject it into privileged instruction context; (2) use a retrieval architecture where document content is always in user/context position, never system position; (3) add semantic anomaly detection for instruction-like patterns in uploaded content. Stripping long content is a secondary defence, not a primary one.

PSF-2

Output Validation & Reliability

15–20% of exam

Key concepts

Structured output enforcement: schemas, constrained decoding, retry loops
Hallucination detection: grounding checks, citation verification, factuality scores
Confidence thresholds and abstention strategies
Output length and format validation before downstream consumption
Multi-model validation for high-consequence decisions
Graceful degradation when validation fails

Worked scenario

Your AI system generates contract clause summaries that legal teams review. An audit finds that 3% of summaries contain subtle inaccuracies that passed human review because reviewers trusted the AI. Leadership wants to set a higher model temperature to get 'more natural' summaries.

Expert reasoning

This is a reliability problem, not a style problem — increasing temperature makes it worse. The audit finding suggests over-trust in AI output, not a quality gap in the writing style. Correct remediation: (1) add a grounding check that verifies each summary claim against the source clause text; (2) implement a confidence-based flagging system that highlights low-certainty summaries for closer human review; (3) brief the legal team on the specific failure mode found so they know what to look for. Temperature is a content quality lever, not a reliability lever.

PSF-3

Human Oversight & Autonomy Design

10–15% of exam

Key concepts

Autonomy level selection: classification framework (L0–L4)
Review trigger design: consequence-based, confidence-based, novelty-based
Override mechanism implementation and audit logging
HITL throughput planning: don't create review bottlenecks that bypass themselves
Staged automation: how to move from L1 to L3 safely
Disclosing AI nature to affected users (when and how)

Worked scenario

Your customer service AI handles 10,000 queries per day and escalates 8% to human agents. The head of operations proposes reducing escalations to 2% by raising the AI's confidence threshold. The head of compliance says this will reduce human oversight. Who is right?

Expert reasoning

Both are partially right, but the compliance concern is more important to address properly. Raising the confidence threshold doesn't reduce oversight — it changes the selection of what gets reviewed. If the 2% that gets escalated is the genuinely ambiguous 2%, that's arguably better oversight than reviewing a random 8%. The key questions are: (1) is confidence score actually a good proxy for cases that need human review, or are there systematic failure modes the score doesn't capture? (2) what happens to the escalation cases that are now auto-resolved — does the system have a feedback loop to catch errors? Oversight design should be consequence-driven and outcome-tested, not percentage-driven.

PSF-4

Observability & Monitoring

10–15% of exam

Key concepts

The four AI observability layers: request, model, outcome, system
Latency SLO definition for AI endpoints (p50/p95/p99)
Quality drift detection: embedding drift, output distribution shift
Alert design: what requires PagerDuty, what goes to a dashboard
Sampling strategy for high-volume AI logs
Feedback loop instrumentation: capturing ground truth for model quality

Worked scenario

Three months after deploying a document classification AI, accuracy has dropped from 94% to 87% with no model changes. Your monitoring setup tracks latency, error rate, and API costs. What was missing from your observability setup and how would you have caught this earlier?

Expert reasoning

The missing layer is output quality monitoring. Tracking infrastructure metrics (latency, errors, cost) tells you nothing about model quality degradation. The correct approach requires: (1) a ground truth feedback loop — a sample of classifications verified by humans each week, creating a quality time series; (2) input distribution monitoring — tracking embedding drift or feature statistics on incoming documents to detect when the document population has shifted from training data; (3) an output distribution monitor — tracking the proportion of each class over time to flag unexpected shifts. The 87% accuracy was invisible for 3 months because none of these were in place.

PSF-5

Data Governance & Privacy

10–15% of exam

Key concepts

PII detection and redaction in AI pipelines
Training data lineage and consent documentation
Data minimisation: what AI systems should not retain
Cross-border data transfer constraints (GDPR, SCCs, adequacy decisions)
Right to erasure: implications for models trained on personal data
Data retention policy for AI inference logs

Worked scenario

Your RAG system indexes internal HR documents including employee performance reviews. An employee requests deletion of their data under GDPR Article 17. Legal confirms the request is valid. What are the full scope of actions required?

Expert reasoning

GDPR erasure for RAG systems is substantially more complex than database deletion. Required actions: (1) delete the source document from the document store; (2) delete all associated chunks from the vector database — this requires knowing which vector embeddings map to that document; (3) invalidate any cached responses that may have included that employee's data; (4) if any fine-tuned models were trained on that data, assess whether the personal data is 'memorised' — if so, retraining or model deletion may be required; (5) document the erasure with a completion certificate. The vector store step is the one most teams miss — embeddings are derived data containing personal information and must be deleted.

PSF-6

Incident Response

10% of exam

Key concepts

AI incident taxonomy: quality failures, safety failures, capability failures, availability failures
Severity classification for AI-specific incidents
Containment options: rate limit, rollback, kill switch, human takeover
Post-mortem structure for AI incidents (distinct from software post-mortems)
Communication strategy: users, regulators, leadership
Preventing recurrence: feedback into model and governance, not just ops

Worked scenario

Your AI-powered loan pre-approval system has been running for 6 months. An internal audit discovers that it has been approving applications at a significantly higher rate for one demographic group, with no business justification. How do you respond?

Expert reasoning

This is a Severity 1 AI incident — it's a potential discriminatory outcome with regulatory and legal exposure. Immediate actions: (1) suspend automated pre-approvals within the hour — do not wait for root cause analysis; (2) notify legal and compliance leadership immediately; (3) pull the full 6-month decision log for forensic analysis — you need the demographic breakdown, approval rates, and the model features driving decisions; (4) do not delete or modify any data — this is potential evidence. Investigation: determine whether the disparity stems from training data, feature selection, or a proxy variable (postcode, loan size) correlating with demographic. Resolution: decisions made under the biased model may need to be reviewed and potentially reversed. Regulatory notification may be required under applicable law.

PSF-7

Model & Vendor Risk

10% of exam

Key concepts

Third-party model dependency mapping
Capability change risk: what happens when the model provider updates silently
Version pinning strategy for LLM APIs
Vendor exit planning: what would you need to migrate to an alternative?
Supply chain integrity: fine-tuning data, adapters, embeddings
SLA and uptime contractual requirements for production AI dependencies

Worked scenario

Your production system uses GPT-4 via the OpenAI API. OpenAI announces that GPT-4 will be deprecated in 90 days and recommends migrating to GPT-4o. Your system has 18 months of prompt engineering optimised specifically for GPT-4's output patterns. What is the correct production approach to this migration?

Expert reasoning

This is a vendor-driven forced migration — the 90-day timeline is aggressive for a production system with significant prompt investment. Correct approach: (1) do not migrate directly to production — run GPT-4o in parallel with GPT-4 across a sample of production traffic immediately; (2) build a regression test suite from your existing prompt evaluations — every known edge case and critical output pattern must be tested against GPT-4o; (3) identify which prompts are most sensitive to model behaviour changes (structured output prompts, few-shot examples, chain-of-thought chains) and prioritise those for re-optimisation; (4) plan for a phased rollout — not a cutover — with fallback capability. The 90-day window should be used for testing, not building. The system should be ready to traffic-shift at day 60 with 30 days of runway for issues.

PSF-8

Regulatory & Ethics Compliance

5–10% of exam

Key concepts

EU AI Act risk classification for your specific system
NIST AI RMF governance, map, measure, manage functions
Bias assessment methodologies (demographic parity, equalised odds)
Explainability requirements: when are they legally mandated?
Accountability chain documentation: who is responsible for AI decisions?
AI system registration and transparency obligations

Worked scenario

Your company is deploying an AI-assisted CV screening tool across EU member states. Legal has said this is 'just a productivity tool' and not regulated. You are the AI deployment lead. What is your response?

Expert reasoning

This assessment is likely incorrect. Under the EU AI Act, AI systems used in employment, worker management, and access to employment are classified as high-risk (Annex III, point 4). 'CV screening' is explicitly an employment access decision. High-risk classification triggers significant obligations: (1) conformity assessment before deployment; (2) registration in the EU AI Act database; (3) technical documentation and logging requirements; (4) human oversight requirement for decisions affecting individuals; (5) transparency obligation to candidates that AI was involved. The 'just a productivity tool' framing does not change the legal classification — the determining factor is the decision domain, not the degree of automation. You should escalate this to legal with the specific Annex III reference and request a formal classification review before deployment.

Ready to sit the exam?

20 scenario questions · 45 minutes · Immediate result · $297

Take the CPAP exam →