Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
Pattern LibraryProduction Patterns
Part 2: Production PatternsPSF D5 · Deployment SafetyPSF D7 · SecurityPAI-8 C6 · Operational ContinuityPAI-8 C7 · Incident Management

Exception Recovery

How agents detect failure and decide whether to retry, escalate, skip, or fail gracefully.

Exception recovery is the pattern that determines whether your agent system is resilient or brittle. Every production AI system will encounter failures: tools that time out, APIs that rate-limit, model calls that return errors, sub-agents that produce unusable outputs. How the system handles these failures is an architectural choice that must be designed, not discovered.

Exception recovery operates at two levels. At the task level, individual operations have defined retry policies, timeout limits, and fallback behaviours. At the workflow level, patterns of failure trigger escalation to human review or system-level responses. The retry policy for any given operation should specify the maximum number of retries, the backoff strategy (linear, exponential, or fixed), the conditions under which retrying is appropriate, and the behaviour when all retries are exhausted. Critically, the system must distinguish between recoverable errors (a temporary API timeout) and unrecoverable errors (a data validation failure that will always fail regardless of retries).

In practice

A property data service uses an agent to enrich listings with third-party data sources. The exception recovery architecture defines: external API calls get 3 retries with exponential backoff; enrichment failures for non-critical fields are logged and the listing is published with that field empty; enrichment failures for legally required fields (e.g. energy rating) pause the listing publication and route to the operations team with a prefilled resolution form; if more than 5% of listings in a batch fail enrichment within any 10-minute window, an incident alert is raised and batch processing is suspended.

Why it matters

Production AI systems that lack exception recovery architecture behave unpredictably under load and create invisible failures — tasks that appear to have completed but produced incomplete or incorrect outputs. Designing for failure is not pessimism; it is the difference between a system that your organisation can rely on and one that periodically causes incidents without warning.

Framework alignment

PSF Domains
D5
Deployment Safety
View PSF domain →
PAI-8 Controls
C6
Operational Continuity
View PAI-8 standard →
C7
Incident Management
View PAI-8 standard →

Production failure modes

How this pattern fails in practice — and what to watch for.

Retry storms

An external API starts returning errors. The agent retries immediately, without backoff. All concurrent agent instances do the same. The retry traffic exceeds the API's rate limits, causing it to reject all requests. What started as a temporary API issue becomes a permanent service disruption caused by the agent's own retry behaviour.

Silent skip masquerading as success

The exception handler marks a failed task as 'skipped' to allow the workflow to continue. The workflow completes and is logged as successful. The downstream system receives an output with a missing component. No alert is raised. The missing component is only discovered when someone manually inspects the output.

Recovery mechanism failure

The exception recovery logic itself encounters an error — perhaps a bug in the escalation routing logic. The system is now in an unknown state: the primary task failed, and the recovery mechanism failed. No one is notified. The task is orphaned.

Implementation checklist

Seven things to verify before deploying this pattern in production.

1

Implement exponential backoff on all external API calls with a defined maximum retry count

2

Distinguish recoverable errors from unrecoverable ones at design time — not at runtime

3

Log all exceptions with full context: what failed, when, after how many retries, and what state the workflow was in

4

Define what constitutes 'complete enough' for partial outputs — make this an explicit business decision

5

Test deliberate failure injection: can you inject failures at each component and verify the recovery behaves as designed?

6

Implement dead-letter queues for tasks that exhaust all recovery options

7

Define human notification thresholds: what volume of failures in what time window triggers an alert?

Certification relevance

Exception recovery is tested in AIDA under D5 Deployment Safety — the exam focuses on retry design, partial failure policies, and the distinction between recoverable and unrecoverable errors. CAIG examines the incident management procedures that exception recovery feeds into. CAIAUD auditors are tested on their ability to identify exception recovery gaps in deployed systems: specifically, where failures can occur without detection.

AIDA — Take the exam →CAIG — Take the exam →CAIAUD — Take the exam →

Related patterns

Part 1 · Core Patterns
Orchestration
A controlling agent that directs sub-agents, manages state, and decides when a task is complete.
Part 3 · Enterprise Patterns
Event-Driven Agents
Agents triggered by events in your systems rather than by direct user prompts.
Part 1 · Core Patterns
Prompt Chaining
Sequential task decomposition where each model output feeds the next input.
Production AI Institute

Certify your understanding of production AI patterns

The AIDA certification covers all 21 agentic design patterns with a focus on deployment safety, governance, and the PSF. Free to attempt.

Start AIDA — Free →All 21 patterns