Production AI Institute · Independent certification for production AI practice
Verify a credential|Contact
HomeResearchLLM Incident Patterns
AnalysisJune 2025CC BY 4.0

Incident Patterns in Production LLM Deployments

An analysis of 47 documented production incidents involving large language models, drawn from anonymised case data contributed by PAI community members. This paper identifies the failure modes that occur most frequently in practice, maps them to PSF domains, and describes the intervention patterns that resolved them.

Key findings

47
production incidents analysed
68%
attributable to PSF-1 or PSF-2 gaps
83%
had no pre-deployment output contract
29%
involved a prompt injection component

The most striking finding is the concentration of incidents in PSF domains 1 and 2. Input governance failures — unvalidated user inputs reaching the model — and output validation failures — model outputs consumed downstream without schema validation — together account for more than two-thirds of the incidents in this dataset. Both are preventable with straightforward engineering controls.

Incident distribution by PSF domain

PSF-1 Input Governance16 incidents (34%)

Prompt injection (13 incidents), unvalidated inputs causing off-topic outputs (3 incidents)

PSF-2 Output Validation16 incidents (34%)

Malformed JSON consumed downstream (8), hallucinated values in structured fields (5), schema drift after model update (3)

PSF-5 Deployment Safety6 incidents (13%)

Silent model version change causing output format regression (4), lack of rollback on failed deployment (2)

PSF-6 Human Oversight5 incidents (11%)

Autonomous action taken beyond intended scope (3), escalation path not reached on time-sensitive decision (2)

PSF-4 Observability2 incidents (4%)

Incident not detected for >48 hours due to absent logging (2)

PSF-8 Vendor Resilience2 incidents (4%)

Provider API outage with no fallback (1), unexpected model deprecation (1)

The prompt injection problem

Prompt injection — the technique by which adversarial content in user inputs or retrieved documents causes the model to deviate from its intended behaviour — appears in 13 of the 47 incidents (28%). This is consistent with broader industry reporting and represents the fastest-growing incident category in the dataset.

In 9 of the 13 injection incidents, the attack vector was retrieved content: documents, emails, or web pages that the system was asked to summarise or process contained embedded instructions. The remaining 4 involved direct user manipulation of the prompt structure.

Pattern:Systems that process third-party content — documents, emails, web pages — require explicit injection defence that treats retrieved content as untrusted data, separate from the system prompt. Systems that do not implement this separation are consistently exploitable.

What resolves incidents fastest

Across the 47 incidents, time-to-resolution correlated most strongly with two factors: whether the system had observable logging (PSF-4) and whether a human escalation path existed (PSF-6). Systems with both resolved incidents in a median of 4.2 hours. Systems with neither had a median resolution time of 31 hours.

The intervention pattern most associated with fast resolution was the existence of a named incident owner — a person whose responsibility included monitoring the system — combined with an alerting threshold configured before deployment. 38 of the 47 incidents did not have both conditions met at the time of the incident.

Methodology and limitations

Incident data was contributed voluntarily by PAI community members under a standard anonymisation protocol. Contributors provided incident timelines, system descriptions (redacted), root cause assessments, and resolution notes. Classification by PSF domain was performed by two reviewers independently, with disagreements resolved by discussion. The dataset is not a random sample; it reflects the types of incident that community members chose to share and is likely biased toward incidents that were ultimately resolved. Severe or ongoing incidents are probably under-represented.

Published by the Production AI Institute, June 2025. Licensed CC BY 4.0.

Related: Production Safety Framework · Seven failure modes (Insights) · Submit an incident report