The compliance theatre problem
Human-in-the-loop became a checkbox requirement as AI deployment concerns grew. The result is a generation of AI systems with nominal human oversight that provides no real safety guarantees. The signatures are familiar: an approval button that appears at 2am when the reviewer is asleep, a review queue that processes 400 items per hour, a confirmation screen that 94% of users click through without reading.
Compliance theatre is not just ineffective — it creates a false sense of safety that may be more dangerous than no oversight at all. A team that believes their human review process is working has less incentive to invest in the upstream controls (D1, D2, D4) that would actually catch problems.
PSF Domain 6 requires human oversight that is effective, not merely present. The distinction is between oversight that could plausibly catch problems and oversight that exists to satisfy an audit requirement.
The PSF autonomy level framework
The first design decision in D6 is: what level of autonomy is appropriate for this system? The PSF defines five levels. Most production deployments should target L1 or L2. L3 is appropriate only with robust D4 monitoring. L4 is not appropriate for any customer-facing or regulated system.
AI generates a draft or suggestion. Human reviews and approves before any action is taken. All consequential actions require explicit human authorisation.
AI acts autonomously on low-risk operations. Human review is required before high-risk operations. Risk classification is explicit and documented.
AI acts autonomously. Human monitors activity and can intervene. Automatic escalation when confidence drops or anomalies are detected.
AI acts autonomously on virtually all operations. Humans are involved only when the AI explicitly requests escalation or when audit sampling triggers review.
No human oversight during operation. Not appropriate for any system processing personal data, making consequential decisions, or operating in regulated contexts.
When human oversight is required
The decision of when to require human review is the core D6 design question. The answer depends on the consequence severity of the action, the confidence of the model output, and the regulatory context. A practical framework:
Designing effective human review
Once you have decided that human review is required, the design of the review interface determines whether oversight is effective. Four principles:
1. Present the decision, not the output
Most HITL implementations show the reviewer the model output and ask them to approve or reject it. This is backwards. The reviewer should be presented with the decision that needs to be made — the action that will be taken if they approve — not just the text the model generated. "Approve sending this email to John Smith declining his application" is a decision. "Here is a draft rejection email" is an output.
2. Make the cost of approval visible
Reviewers approve items faster when the cost of approval is invisible. A review interface that shows "Approve / Reject" with no context about what approval means will produce rubber-stamp oversight. Make the downstream consequence explicit in every review request.
3. Design for rejection, not just approval
If your review interface has a one-click approve and a multi-step reject, your reviewers will approve more than they should. The friction to reject should be the same as the friction to approve. And rejection should trigger a feedback loop that improves the model — not just block the action.
4. Blind review sampling
For L2/L3 deployments where most actions are automated, implement blind review sampling: randomly select a percentage of automated actions and present them to a reviewer as if they required approval, but do not block execution. This measures whether your automated actions are ones a reviewer would have approved. If the sample approval rate drops, you have a signal to investigate.
AutoGen's UserProxyAgent implements the closest native approximation to this with its human_input_mode="TERMINATE" pattern, but this is a conversation-end trigger rather than a sampling mechanism. For true blind sampling, you need to implement this at the application layer.
Skill maintenance and automation complacency
There is a documented phenomenon in aviation automation: as systems become more reliable, operators reduce their active engagement, and their ability to take over when the automation fails degrades. The same risk applies to AI oversight. If human reviewers approve 99% of what the AI produces, they are not developing the judgment to catch the 1%.
PSF D6 requires that human oversight capability is maintained over time, not just present at deployment. Practically this means: periodic exercises where reviewers encounter synthetic failures designed to test their judgment, rotation of review responsibilities to prevent rubber-stamp patterns, and monitoring of reviewer decision latency and consistency as proxies for engagement quality.
Framework D6 implementation notes
Best native D6 support of any framework assessed. UserProxyAgent with human_input_mode='ALWAYS' or 'TERMINATE' provides genuine oversight points. The challenge is that these are chat-based interactions — for production, wrap UserProxyAgent in an application that presents the decision context to the right reviewer.
Graph interrupt() nodes are the native primitive. Define interrupt conditions at edges — trigger human review when the graph transitions to high-consequence nodes. LangSmith provides the review interface. This is the cleanest HITL architecture in the LangChain ecosystem.
Step approval patterns can be implemented via kernel filters. For Azure deployments, Azure Logic Apps can serve as the human review routing layer with full audit trail. No native blind sampling — implement at application layer.
Human oversight is the most significant D6 gap in CrewAI. The multi-agent architecture means a human approval at the crew level may not catch individual agent actions. Implement approval at the task level, not just the crew kickoff — each task that takes a real-world action should have an approval gate.