Production AI Institute — vendor-neutral certification for AI practitioners
Verify a credentialFor organisationsContact
AI Incident Registry
MediumTechnology·2022·GitHub / Microsoft

GitHub Copilot Reproduced Licensed Code Verbatim

Researchers and users discovered that GitHub Copilot would reproduce verbatim sections of copyrighted, GPL-licensed code from its training data — including commented attribution headers. A class action lawsuit was filed in 2022 alleging copyright infringement and violation of open source licence terms.

D2 · Output ValidationD3 · Data Protection

What happened

GitHub Copilot was trained on public code repositories including GPL-licensed projects. Researchers demonstrated that Copilot could be prompted to reproduce large blocks of code verbatim, including the original copyright and licence attribution comments. In some cases, users received suggestions that were direct copies of GPL-licensed code without being notified of the licence obligations. A class action lawsuit was filed in November 2022 alleging that Copilot violated the DMCA and open source licence agreements.

PSF Analysis

How the Production Safety Framework maps to this failure

A D2 + D3 failure. D3 failed at training time: no systematic licence classification was applied to training data, so the model learned patterns from GPL code without any constraint against reproducing it. D2 failed at inference time: no output similarity check was deployed to detect verbatim memorisation. This case established that training data governance (D3) must include intellectual property classification, not just privacy classification.

Controls that would have prevented this

Specific PSF controls mapped to each failure point

1
D2 · Output Validation
Implement a deduplication/similarity check that detects when output closely matches known training examples and flags or blocks verbatim reproduction.
2
D3 · Data Protection
Establish a data governance policy for training data that includes licence classification and acceptable use boundaries per licence type.
3
D2 · Output Validation
Surface licence attribution requirements when suggesting code derived from GPL or other copyleft-licensed sources.

Outcome

Lawsuit filed November 2022. GitHub added a 'duplication detection' filter in 2022 that blocks suggestions matching 150+ characters of training data. Legal proceedings ongoing at time of writing.

copyrightmemorisationtraining-dataintellectual-propertycode-generation

Related incidents

High2018
Amazon Recruiting AI Discriminated Against Women
D3D6
Critical2016
Microsoft Tay Chatbot Taught to Produce Hate Speech
D1D2
High2024
Google Gemini Generated Historically Inaccurate Images
D2D1
NEXT STEP

Prove you understand how to prevent failures like this

The AIDA exam tests PSF knowledge across all 8 domains. Free to take, immediately verifiable.

Take the AIDA exam →← All incidents