PSF Domain 3: Data Protection — Why No Framework Covers It

Why every framework gets a D3 Gap

In our assessment series covering eleven agent frameworks and tool integration layers, every single one received a Gap rating for PSF Domain 3. LangChain, CrewAI, AutoGen, Semantic Kernel, Haystack, DSPy, Pydantic AI, Flowise, LangFlow, Composio, the Cursor SDK — Gap, every time. This warrants explanation.

Data protection requires knowing what data you have, where it came from, who consented to what processing, where it can be stored, and how to delete it when required. These properties are domain-specific. A general-purpose agent framework cannot know whether the strings passing through its pipeline are medical records, financial transactions, children's data, or marketing copy. Therefore, no general-purpose framework can implement data protection — by definition.

This does not mean frameworks are deficient. It means data protection is the application layer's responsibility, always, without exception. The PSF Gap rating is a signal to practitioners, not a criticism of the tool.

What PSF Domain 3 requires

Data classification

Data processed by the system is classified by sensitivity level (public, internal, confidential, restricted) and type (PII, financial, health, legal).

Consent chain integrity

For any data subject, the system can demonstrate that processing is lawful — consent was obtained, a legitimate interest applies, or another legal basis exists.

Data residency controls

Data is stored and processed only in jurisdictions consistent with its origin and the applicable regulatory regime.

Minimisation and purpose limitation

The system processes the minimum data necessary for each task, and does not use data for purposes beyond those it was collected for.

Subject access and erasure

The system can respond to data subject requests — to retrieve all data held about a person, or to delete it — within the regulatory timeframe.

The vector store problem

RAG systems introduce a data protection risk that has no equivalent in traditional software: PII gets embedded into vectors and stored in a vector database. Once embedded, the relationship between the original text and the vector is indirect — you cannot simply search for a person's name to find all the data they appear in. And when a deletion request arrives, you cannot simply delete a row.

The risks are compounded by a common RAG architecture decision: embedding entire documents rather than paragraphs. A document containing one person's name and address also contains information about others. Embedding it means all that data is now jointly represented in a single vector. Deleting it means deleting information that may be legitimately needed for other purposes.

There is no elegant solution to the vector store deletion problem under current technology. The practical approaches are:

Metadata filtering — store the document ID alongside every vector; on deletion request, filter by document ID and delete all associated chunks
Hard delete by chunk — maintain a mapping of document ID to chunk IDs and delete them individually from the vector store
Re-indexing — for high-compliance requirements, the safest approach is to be able to re-index from scratch with the offending document removed
Data minimisation at ingestion — do not embed PII-containing text unless it is necessary for the retrieval use case

Regulatory requirements by jurisdiction

Regulation

Key D3 obligation

What it means for AI systems

GDPR / UK GDPR

Lawful basis, data minimisation, erasure

Every processing operation needs a lawful basis. LLM inference over personal data is processing. RAG retrieval is processing. Embeddings are processing.

CCPA / CPRA

Right to deletion, opt-out of sale

California residents can request deletion of personal information. If your RAG system indexed their data, you must be able to remove it.

HIPAA

PHI handling, BAA requirement

Using any LLM provider to process protected health information requires a Business Associate Agreement. Most providers offer BAAs; verify before deploying.

EU AI Act

Data governance for high-risk AI

High-risk AI systems must have documented data governance practices covering training and operational data quality, provenance, and bias management.

DORA (Financial)

Data integrity and audit trail

Financial services AI must maintain data integrity and a complete audit trail for AI-assisted decisions that affect financial stability or customer outcomes.

PII detection and de-identification tooling

The first implementation question is: how do you know what PII is in your data pipeline? The answer is a PII detection layer before storage and, where possible, before model processing.

Microsoft Presidio

Open-source PII detection and anonymisation. Supports 50+ entity types, extensible with custom recognisers. Best choice for self-hosted deployments. Integrates with any framework.

AWS Comprehend

Managed PII detection via API. Covers standard entity types with no infrastructure overhead. Adds API latency and sends data to AWS — verify data residency requirements first.

spaCy NER

Named entity recognition for structured PII extraction. Fast, self-hosted, customisable. Requires training for domain-specific entity types (medical codes, financial identifiers, etc.).

Azure AI Language

PII detection via Azure. Strong integration with Semantic Kernel deployments. Same data residency considerations as other Azure services — but favourable for Azure-committed deployments.

The consent chain pattern

A consent chain is the documented record of: what data was collected, from whom, under what legal basis, for what purposes, and with what retention period. In AI systems, this extends to: what data was processed by which models, in which jurisdictions, via which providers.

The minimum viable consent chain for a production AI system includes:

A data inventory — what types of personal data the system processes
A processing record — per-operation log of what data was processed, when, by which model call
A legal basis register — the lawful basis for each category of processing
A retention schedule — how long each data type is held before deletion
A deletion procedure — how subject erasure requests trigger deletion across all stores including vector databases, conversation logs, and observability traces

Most observability platforms (LangSmith, Langfuse, Arize) store full conversation traces including all inputs and outputs. This is necessary for D4 compliance but creates a D3 obligation: those traces contain personal data and are subject to deletion requests. Configure trace retention policies explicitly and verify that the platform supports per-trace deletion.

D3 pre-deployment checklist

Data inventory completed — all personal data categories documentedRequired

Legal basis established for each processing categoryRequired

PII detection is active before data is stored in any vector storeRequired

LLM provider Data Processing Agreement / BAA signed and appropriate to data typeRequired

Data residency of all stores (vector DB, trace store, conversation log) is documented and compliantRequired

Subject erasure procedure exists and has been tested end-to-endRequired

Trace/log retention policies are configured explicitly (not default-forever)

Vector store deletion procedure tested with a real document

Data minimisation reviewed — are you ingesting more data than the use case requires?

Cross-border transfer mechanisms in place if provider is in a different jurisdiction

PSF Domain 3:
Data Protection

Why every framework gets a D3 Gap

What PSF Domain 3 requires

The vector store problem

Regulatory requirements by jurisdiction

PII detection and de-identification tooling

The consent chain pattern

D3 pre-deployment checklist

Related guides

You understand the gaps.
Get the credential that proves it.

PSF Domain 3:Data Protection

Why every framework gets a D3 Gap

What PSF Domain 3 requires

The vector store problem

Regulatory requirements by jurisdiction

PII detection and de-identification tooling

The consent chain pattern

D3 pre-deployment checklist

Related guides

You understand the gaps.Get the credential that proves it.

PSF Domain 3:
Data Protection

You understand the gaps.
Get the credential that proves it.