This guide covers all 8 PSF domains tested in the CPAP exam at practitioner depth. Each domain includes key concepts, a worked scenario, and the reasoning approach the exam expects. CPAP scenarios are more complex than AIDA — they test your ability to navigate real production trade-offs, not just recall correct principles.
AIDA questions ask: what is the right principle to apply here? CPAP questions ask: you're in a real production situation with competing pressures — what is the correct decision and why does it take priority over the alternatives?
CPAP distractors are specifically designed to look correct. They will be things a reasonable engineer might do, things that are partially correct, or things that address a real concern but miss the primary issue. The exam rewards structured reasoning and correct prioritisation, not recognition of keywords.
A B2B SaaS product embeds customer-uploaded PDFs into LLM context for question answering. During red-teaming, you discover that a specially crafted PDF can override system instructions. The product team wants to strip all text over 500 tokens from uploaded documents.
Token-length stripping is a blunt mitigation that doesn't address the root cause and will break legitimate use cases. The correct approach is: (1) treat all document content as untrusted user input — never inject it into privileged instruction context; (2) use a retrieval architecture where document content is always in user/context position, never system position; (3) add semantic anomaly detection for instruction-like patterns in uploaded content. Stripping long content is a secondary defence, not a primary one.
Your AI system generates contract clause summaries that legal teams review. An audit finds that 3% of summaries contain subtle inaccuracies that passed human review because reviewers trusted the AI. Leadership wants to set a higher model temperature to get 'more natural' summaries.
This is a reliability problem, not a style problem — increasing temperature makes it worse. The audit finding suggests over-trust in AI output, not a quality gap in the writing style. Correct remediation: (1) add a grounding check that verifies each summary claim against the source clause text; (2) implement a confidence-based flagging system that highlights low-certainty summaries for closer human review; (3) brief the legal team on the specific failure mode found so they know what to look for. Temperature is a content quality lever, not a reliability lever.
Your customer service AI handles 10,000 queries per day and escalates 8% to human agents. The head of operations proposes reducing escalations to 2% by raising the AI's confidence threshold. The head of compliance says this will reduce human oversight. Who is right?
Both are partially right, but the compliance concern is more important to address properly. Raising the confidence threshold doesn't reduce oversight — it changes the selection of what gets reviewed. If the 2% that gets escalated is the genuinely ambiguous 2%, that's arguably better oversight than reviewing a random 8%. The key questions are: (1) is confidence score actually a good proxy for cases that need human review, or are there systematic failure modes the score doesn't capture? (2) what happens to the escalation cases that are now auto-resolved — does the system have a feedback loop to catch errors? Oversight design should be consequence-driven and outcome-tested, not percentage-driven.
Three months after deploying a document classification AI, accuracy has dropped from 94% to 87% with no model changes. Your monitoring setup tracks latency, error rate, and API costs. What was missing from your observability setup and how would you have caught this earlier?
The missing layer is output quality monitoring. Tracking infrastructure metrics (latency, errors, cost) tells you nothing about model quality degradation. The correct approach requires: (1) a ground truth feedback loop — a sample of classifications verified by humans each week, creating a quality time series; (2) input distribution monitoring — tracking embedding drift or feature statistics on incoming documents to detect when the document population has shifted from training data; (3) an output distribution monitor — tracking the proportion of each class over time to flag unexpected shifts. The 87% accuracy was invisible for 3 months because none of these were in place.
Your RAG system indexes internal HR documents including employee performance reviews. An employee requests deletion of their data under GDPR Article 17. Legal confirms the request is valid. What are the full scope of actions required?
GDPR erasure for RAG systems is substantially more complex than database deletion. Required actions: (1) delete the source document from the document store; (2) delete all associated chunks from the vector database — this requires knowing which vector embeddings map to that document; (3) invalidate any cached responses that may have included that employee's data; (4) if any fine-tuned models were trained on that data, assess whether the personal data is 'memorised' — if so, retraining or model deletion may be required; (5) document the erasure with a completion certificate. The vector store step is the one most teams miss — embeddings are derived data containing personal information and must be deleted.
Your AI-powered loan pre-approval system has been running for 6 months. An internal audit discovers that it has been approving applications at a significantly higher rate for one demographic group, with no business justification. How do you respond?
This is a Severity 1 AI incident — it's a potential discriminatory outcome with regulatory and legal exposure. Immediate actions: (1) suspend automated pre-approvals within the hour — do not wait for root cause analysis; (2) notify legal and compliance leadership immediately; (3) pull the full 6-month decision log for forensic analysis — you need the demographic breakdown, approval rates, and the model features driving decisions; (4) do not delete or modify any data — this is potential evidence. Investigation: determine whether the disparity stems from training data, feature selection, or a proxy variable (postcode, loan size) correlating with demographic. Resolution: decisions made under the biased model may need to be reviewed and potentially reversed. Regulatory notification may be required under applicable law.
Your production system uses GPT-4 via the OpenAI API. OpenAI announces that GPT-4 will be deprecated in 90 days and recommends migrating to GPT-4o. Your system has 18 months of prompt engineering optimised specifically for GPT-4's output patterns. What is the correct production approach to this migration?
This is a vendor-driven forced migration — the 90-day timeline is aggressive for a production system with significant prompt investment. Correct approach: (1) do not migrate directly to production — run GPT-4o in parallel with GPT-4 across a sample of production traffic immediately; (2) build a regression test suite from your existing prompt evaluations — every known edge case and critical output pattern must be tested against GPT-4o; (3) identify which prompts are most sensitive to model behaviour changes (structured output prompts, few-shot examples, chain-of-thought chains) and prioritise those for re-optimisation; (4) plan for a phased rollout — not a cutover — with fallback capability. The 90-day window should be used for testing, not building. The system should be ready to traffic-shift at day 60 with 30 days of runway for issues.
Your company is deploying an AI-assisted CV screening tool across EU member states. Legal has said this is 'just a productivity tool' and not regulated. You are the AI deployment lead. What is your response?
This assessment is likely incorrect. Under the EU AI Act, AI systems used in employment, worker management, and access to employment are classified as high-risk (Annex III, point 4). 'CV screening' is explicitly an employment access decision. High-risk classification triggers significant obligations: (1) conformity assessment before deployment; (2) registration in the EU AI Act database; (3) technical documentation and logging requirements; (4) human oversight requirement for decisions affecting individuals; (5) transparency obligation to candidates that AI was involved. The 'just a productivity tool' framing does not change the legal classification — the determining factor is the decision domain, not the degree of automation. You should escalate this to legal with the specific Annex III reference and request a formal classification review before deployment.
20 scenario questions · 45 minutes · Immediate result · $297
Take the CPAP exam →