Why every framework gets a D3 Gap
In our assessment series covering eleven agent frameworks and tool integration layers, every single one received a Gap rating for PSF Domain 3. LangChain, CrewAI, AutoGen, Semantic Kernel, Haystack, DSPy, Pydantic AI, Flowise, LangFlow, Composio, the Cursor SDK — Gap, every time. This warrants explanation.
Data protection requires knowing what data you have, where it came from, who consented to what processing, where it can be stored, and how to delete it when required. These properties are domain-specific. A general-purpose agent framework cannot know whether the strings passing through its pipeline are medical records, financial transactions, children's data, or marketing copy. Therefore, no general-purpose framework can implement data protection — by definition.
This does not mean frameworks are deficient. It means data protection is the application layer's responsibility, always, without exception. The PSF Gap rating is a signal to practitioners, not a criticism of the tool.
What PSF Domain 3 requires
Data processed by the system is classified by sensitivity level (public, internal, confidential, restricted) and type (PII, financial, health, legal).
For any data subject, the system can demonstrate that processing is lawful — consent was obtained, a legitimate interest applies, or another legal basis exists.
Data is stored and processed only in jurisdictions consistent with its origin and the applicable regulatory regime.
The system processes the minimum data necessary for each task, and does not use data for purposes beyond those it was collected for.
The system can respond to data subject requests — to retrieve all data held about a person, or to delete it — within the regulatory timeframe.
The vector store problem
RAG systems introduce a data protection risk that has no equivalent in traditional software: PII gets embedded into vectors and stored in a vector database. Once embedded, the relationship between the original text and the vector is indirect — you cannot simply search for a person's name to find all the data they appear in. And when a deletion request arrives, you cannot simply delete a row.
The risks are compounded by a common RAG architecture decision: embedding entire documents rather than paragraphs. A document containing one person's name and address also contains information about others. Embedding it means all that data is now jointly represented in a single vector. Deleting it means deleting information that may be legitimately needed for other purposes.
There is no elegant solution to the vector store deletion problem under current technology. The practical approaches are:
- Metadata filtering — store the document ID alongside every vector; on deletion request, filter by document ID and delete all associated chunks
- Hard delete by chunk — maintain a mapping of document ID to chunk IDs and delete them individually from the vector store
- Re-indexing — for high-compliance requirements, the safest approach is to be able to re-index from scratch with the offending document removed
- Data minimisation at ingestion — do not embed PII-containing text unless it is necessary for the retrieval use case
Regulatory requirements by jurisdiction
PII detection and de-identification tooling
The first implementation question is: how do you know what PII is in your data pipeline? The answer is a PII detection layer before storage and, where possible, before model processing.
Open-source PII detection and anonymisation. Supports 50+ entity types, extensible with custom recognisers. Best choice for self-hosted deployments. Integrates with any framework.
Managed PII detection via API. Covers standard entity types with no infrastructure overhead. Adds API latency and sends data to AWS — verify data residency requirements first.
Named entity recognition for structured PII extraction. Fast, self-hosted, customisable. Requires training for domain-specific entity types (medical codes, financial identifiers, etc.).
PII detection via Azure. Strong integration with Semantic Kernel deployments. Same data residency considerations as other Azure services — but favourable for Azure-committed deployments.
The consent chain pattern
A consent chain is the documented record of: what data was collected, from whom, under what legal basis, for what purposes, and with what retention period. In AI systems, this extends to: what data was processed by which models, in which jurisdictions, via which providers.
The minimum viable consent chain for a production AI system includes:
- A data inventory — what types of personal data the system processes
- A processing record — per-operation log of what data was processed, when, by which model call
- A legal basis register — the lawful basis for each category of processing
- A retention schedule — how long each data type is held before deletion
- A deletion procedure — how subject erasure requests trigger deletion across all stores including vector databases, conversation logs, and observability traces
Most observability platforms (LangSmith, Langfuse, Arize) store full conversation traces including all inputs and outputs. This is necessary for D4 compliance but creates a D3 obligation: those traces contain personal data and are subject to deletion requests. Configure trace retention policies explicitly and verify that the platform supports per-trace deletion.