In a summary judgment order, the US District Court for the Northern District of California considered the use of copyright works by Anthropic in the context of training large language models (LLMs) that power Anthropic’s AI assistant, Claude, and whether its use constituted a fair use under Section 107 of the Copyright Act.
To summarise, some use was found to be fair use because it was “exceedingly transformative”, and some use (notably use of pirated copies of copyright works) was found not to be fair use and will therefore proceed to be assessed at trial.
Background
Anthropic sought to assemble a central library of books. In doing so, it downloaded pirated copyright books from sites on the Internet. It also purchased copyrighted books, removed the bindings, scanned each page, and stored them in digital format which was searchable. Anthropic then selected books from this central library to train LLMs, including those that power its core AI offering, Claude.
The copyrighted books that Anthropic used included those of the plaintiffs. The reason that their books were used to train the LLMs was because they were written “with well-curated facts, well-organised analyses, and captivating fictional narratives” – they were the sort of texts that an editor would approve of and so there was a discernible benefit in using them.
When Anthropic selected a work from its central library to train an LLM, the judge concluded it was copied in four main ways:
- The work was copied from the central library to create a working copy
- The work was then ‘cleaned’ (with a cleaned copy resulting)
- Each cleaned copy was then translated into a tokenised copy, which was then copied repeatedly during the iterative training process
- Each trained LLM itself retained compressed copies of the works it had trained upon (i.e. it had ‘memorised’ them)
Three authors brought a putative class action against Anthropic for copyright infringement and Anthropic sought summary judgment that its use of their copyright works amounted to a fair use.
The court’s decision
To summarise, the court decided:
- The use of the books to train Claude was “exceedingly transformative” and therefore a fair use under Section 107 of the Copyright Act.
- The digitisation of the books Anthropic had purchased was also a fair use because all it did was to replace the print copies it had legitimately acquired with space-saving, searchable digital copies, without adding new copies or redistributing existing copies.
- Use of pirated copies was not a fair use.
The following table is my attempt to summarise the key findings on fair use:
Fair use factor | LLM training | Building central library | |
Purchased copies | Pirated copies | ||
Purpose/character of the use | Favours fair use Authors cannot exclude use of their works for training / learning as such. Anthropic does not reproduce works creative elements. LLMs were not trained on works to replicate/supplant them but to create something different. | Favours fair use Anthropic destroyed each print copy when replacing it with a digital copy. Anthropic did not provide the digital copies to anyone else. Storage and searchability are not creative aspects of the works; the format change was transformative. | Does not favour fair use The copies were not authorised. The copies were not put immediately into use. Not every copy was necessary/used for training LLMs. The initial copy was not deleted. |
Nature of copyrighted work | Does not favour fair use Anthropic accepted that all of the books in issue had been published and contained expressive elements, and this is why Anthropic chose them. | ||
Amount/substantiality of portion used | Favours fair use While the whole of each book was copied, the copies were not made accessible to the public (as outputs) to serve as competing substitutes. Anthropic needed billions of words to train its LLMs. | Favours fair use Copying the entire book was exactly what was needed to enable Anthropic to store the books in a space-saving searchable digital format. | Does not favour fair use Anthropic lacked any entitlement to retain the copies at all. |
Effect of use on potential market for/value of copyright work | Favours fair use The copies Anthropic made did not and will not displace demand for copies of the Authors’ works.
| Neutral
| Does not favour fair use These copies plainly displace demand for the Authors books. |
Do LLMs contain copies of works they are trained on?
The judge’s ‘conclusion’ on what he refers to as the compressed copies of works that are retained in the LLM is, potentially, very significant.
There has been trenchant debate as to whether an LLM contains copies of the works that it is trained on. Generally, this is not understood to mean copies in the traditional sense in the way that a database might contain copies of works. However, the extent to which the model’s parameters and weights could be considered a ‘copy’ (based on the ability to reconstitute works the underlying LLM has been trained on) is less certain.
The judge did qualify his conclusion by stating that the LLM retained copies “or so the Authors contend and this order takes for granted.” Additionally, the order is for summary judgment only, although the judge did cite expert evidence in support of this conclusion. Unfortunately, however, that expert evidence is significantly redacted,
So what does this mean? Is it now the case that ‘memorisation’ across all LLMs ought to be regarded as a copy for copyright purposes? Or was the memorisation peculiar to Anthropic’s training method? Perhaps the more pertinent question is whether this issue should have been decided in a summary judgment order at all.