How agents detect failure and decide whether to retry, escalate, skip, or fail gracefully.
Exception recovery is the pattern that determines whether your agent system is resilient or brittle. Every production AI system will encounter failures: tools that time out, APIs that rate-limit, model calls that return errors, sub-agents that produce unusable outputs. How the system handles these failures is an architectural choice that must be designed, not discovered.
Exception recovery operates at two levels. At the task level, individual operations have defined retry policies, timeout limits, and fallback behaviours. At the workflow level, patterns of failure trigger escalation to human review or system-level responses. The retry policy for any given operation should specify the maximum number of retries, the backoff strategy (linear, exponential, or fixed), the conditions under which retrying is appropriate, and the behaviour when all retries are exhausted. Critically, the system must distinguish between recoverable errors (a temporary API timeout) and unrecoverable errors (a data validation failure that will always fail regardless of retries).
A property data service uses an agent to enrich listings with third-party data sources. The exception recovery architecture defines: external API calls get 3 retries with exponential backoff; enrichment failures for non-critical fields are logged and the listing is published with that field empty; enrichment failures for legally required fields (e.g. energy rating) pause the listing publication and route to the operations team with a prefilled resolution form; if more than 5% of listings in a batch fail enrichment within any 10-minute window, an incident alert is raised and batch processing is suspended.
Production AI systems that lack exception recovery architecture behave unpredictably under load and create invisible failures — tasks that appear to have completed but produced incomplete or incorrect outputs. Designing for failure is not pessimism; it is the difference between a system that your organisation can rely on and one that periodically causes incidents without warning.
How this pattern fails in practice — and what to watch for.
An external API starts returning errors. The agent retries immediately, without backoff. All concurrent agent instances do the same. The retry traffic exceeds the API's rate limits, causing it to reject all requests. What started as a temporary API issue becomes a permanent service disruption caused by the agent's own retry behaviour.
The exception handler marks a failed task as 'skipped' to allow the workflow to continue. The workflow completes and is logged as successful. The downstream system receives an output with a missing component. No alert is raised. The missing component is only discovered when someone manually inspects the output.
The exception recovery logic itself encounters an error — perhaps a bug in the escalation routing logic. The system is now in an unknown state: the primary task failed, and the recovery mechanism failed. No one is notified. The task is orphaned.
Seven things to verify before deploying this pattern in production.
Exception recovery is tested in AIDA under D5 Deployment Safety — the exam focuses on retry design, partial failure policies, and the distinction between recoverable and unrecoverable errors. CAIG examines the incident management procedures that exception recovery feeds into. CAIAUD auditors are tested on their ability to identify exception recovery gaps in deployed systems: specifically, where failures can occur without detection.
The AIDA certification covers all 21 agentic design patterns with a focus on deployment safety, governance, and the PSF. Free to attempt.