Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
PSF Deep DiveDomain 3 · April 2026

PSF Domain 3:
Data Protection

D3 is a Gap for every major agent framework — without exception. This is not a coincidence. It reflects a fundamental architectural truth: data protection cannot be built into a general-purpose orchestration library. It must be built into the application. This guide explains why, maps the threat surface, and documents the implementation path.

Read time
17 min
PSF version
v1.1
CC BY 4.0
Citable

Why every framework gets a D3 Gap

In our assessment series covering eleven agent frameworks and tool integration layers, every single one received a Gap rating for PSF Domain 3. LangChain, CrewAI, AutoGen, Semantic Kernel, Haystack, DSPy, Pydantic AI, Flowise, LangFlow, Composio, the Cursor SDK — Gap, every time. This warrants explanation.

Data protection requires knowing what data you have, where it came from, who consented to what processing, where it can be stored, and how to delete it when required. These properties are domain-specific. A general-purpose agent framework cannot know whether the strings passing through its pipeline are medical records, financial transactions, children's data, or marketing copy. Therefore, no general-purpose framework can implement data protection — by definition.

This does not mean frameworks are deficient. It means data protection is the application layer's responsibility, always, without exception. The PSF Gap rating is a signal to practitioners, not a criticism of the tool.

What PSF Domain 3 requires

Data classification

Data processed by the system is classified by sensitivity level (public, internal, confidential, restricted) and type (PII, financial, health, legal).

Consent chain integrity

For any data subject, the system can demonstrate that processing is lawful — consent was obtained, a legitimate interest applies, or another legal basis exists.

Data residency controls

Data is stored and processed only in jurisdictions consistent with its origin and the applicable regulatory regime.

Minimisation and purpose limitation

The system processes the minimum data necessary for each task, and does not use data for purposes beyond those it was collected for.

Subject access and erasure

The system can respond to data subject requests — to retrieve all data held about a person, or to delete it — within the regulatory timeframe.

The vector store problem

RAG systems introduce a data protection risk that has no equivalent in traditional software: PII gets embedded into vectors and stored in a vector database. Once embedded, the relationship between the original text and the vector is indirect — you cannot simply search for a person's name to find all the data they appear in. And when a deletion request arrives, you cannot simply delete a row.

The risks are compounded by a common RAG architecture decision: embedding entire documents rather than paragraphs. A document containing one person's name and address also contains information about others. Embedding it means all that data is now jointly represented in a single vector. Deleting it means deleting information that may be legitimately needed for other purposes.

There is no elegant solution to the vector store deletion problem under current technology. The practical approaches are:

  1. Metadata filtering — store the document ID alongside every vector; on deletion request, filter by document ID and delete all associated chunks
  2. Hard delete by chunk — maintain a mapping of document ID to chunk IDs and delete them individually from the vector store
  3. Re-indexing — for high-compliance requirements, the safest approach is to be able to re-index from scratch with the offending document removed
  4. Data minimisation at ingestion — do not embed PII-containing text unless it is necessary for the retrieval use case

Regulatory requirements by jurisdiction

Regulation
Key D3 obligation
What it means for AI systems
GDPR / UK GDPR
Lawful basis, data minimisation, erasure
Every processing operation needs a lawful basis. LLM inference over personal data is processing. RAG retrieval is processing. Embeddings are processing.
CCPA / CPRA
Right to deletion, opt-out of sale
California residents can request deletion of personal information. If your RAG system indexed their data, you must be able to remove it.
HIPAA
PHI handling, BAA requirement
Using any LLM provider to process protected health information requires a Business Associate Agreement. Most providers offer BAAs; verify before deploying.
EU AI Act
Data governance for high-risk AI
High-risk AI systems must have documented data governance practices covering training and operational data quality, provenance, and bias management.
DORA (Financial)
Data integrity and audit trail
Financial services AI must maintain data integrity and a complete audit trail for AI-assisted decisions that affect financial stability or customer outcomes.

PII detection and de-identification tooling

The first implementation question is: how do you know what PII is in your data pipeline? The answer is a PII detection layer before storage and, where possible, before model processing.

Microsoft Presidio

Open-source PII detection and anonymisation. Supports 50+ entity types, extensible with custom recognisers. Best choice for self-hosted deployments. Integrates with any framework.

AWS Comprehend

Managed PII detection via API. Covers standard entity types with no infrastructure overhead. Adds API latency and sends data to AWS — verify data residency requirements first.

spaCy NER

Named entity recognition for structured PII extraction. Fast, self-hosted, customisable. Requires training for domain-specific entity types (medical codes, financial identifiers, etc.).

Azure AI Language

PII detection via Azure. Strong integration with Semantic Kernel deployments. Same data residency considerations as other Azure services — but favourable for Azure-committed deployments.

The consent chain pattern

A consent chain is the documented record of: what data was collected, from whom, under what legal basis, for what purposes, and with what retention period. In AI systems, this extends to: what data was processed by which models, in which jurisdictions, via which providers.

The minimum viable consent chain for a production AI system includes:

  1. A data inventory — what types of personal data the system processes
  2. A processing record — per-operation log of what data was processed, when, by which model call
  3. A legal basis register — the lawful basis for each category of processing
  4. A retention schedule — how long each data type is held before deletion
  5. A deletion procedure — how subject erasure requests trigger deletion across all stores including vector databases, conversation logs, and observability traces

Most observability platforms (LangSmith, Langfuse, Arize) store full conversation traces including all inputs and outputs. This is necessary for D4 compliance but creates a D3 obligation: those traces contain personal data and are subject to deletion requests. Configure trace retention policies explicitly and verify that the platform supports per-trace deletion.

D3 pre-deployment checklist

Data inventory completed — all personal data categories documentedRequired
Legal basis established for each processing categoryRequired
PII detection is active before data is stored in any vector storeRequired
LLM provider Data Processing Agreement / BAA signed and appropriate to data typeRequired
Data residency of all stores (vector DB, trace store, conversation log) is documented and compliantRequired
Subject erasure procedure exists and has been tested end-to-endRequired
Trace/log retention policies are configured explicitly (not default-forever)
Vector store deletion procedure tested with a real document
Data minimisation reviewed — are you ingesting more data than the use case requires?
Cross-border transfer mechanisms in place if provider is in a different jurisdiction

Related guides

Financial Services AI PlaybookPSF D1: Input GovernancePSF D7: Security guideFramework PSF comparison matrix
From reading to credential

You understand the gaps.
Get the credential that proves it.

The AIDA examination tests applied PSF knowledge across all eight domains — exactly the gaps and strengths covered in this assessment. 15 minutes. No charge. Ever.

Start AIDA — free →CPAP practitioner credential