Judge rules Anthropic training on books it purchased was “fair use,” but not for the ones it stole
Anthropic still faces litigation for training its models on millions of pirated texts.
When AI companies like OpenAI, Anthropic, and Meta were racing to build and train new large language models, they scrambled to find enough text to train their systems on. Countless web pages, photos, YouTube videos, Disney movies, Reddit threads, and book texts were slurped up to feed the models to add billions and billions of tokens.
Resulting litigation initiated by copyright holders has shown that the legality of the process was on the minds of some AI company employees, like researchers at Meta who raised concerns while training its Llama model, only to be told that the use of LibGen, a corpus of pirated texts, was approved by “MZ.”
But yesterday, a court decided a case partially in favor of AI companies, with far-reaching consequences for all the companies that were sucking copyrighted material into their models.
A federal judge in the Northern District of California has ruled that Anthropic was not violating the copyright of authors of the books it purchased and scanned for training.
A group of authors filed the suit against Anthropic last August, alleging that Anthropic had acknowledged training its Claude AI model using “The Pile,” a mass of text shared online that contained millions of copyrighted works, including some written by the plaintiffs.
The process of buying, scanning, and ingesting the text for use in training the Claude model was determined to be “exceedingly transformative and was a fair use under Section 107 of the Copyright Act” by Judge William Alsup, a key test of the fair use doctrine in intellectual property law.
But what about the “over seven million copies of books” that Anthropic admitted were pirated that it did not pay for? The judge said that was not fair use, and warrants its own trial.
Judge Alsup wrote:
“The downloaded pirated copies used to build a central library were not justified by a fair use. Every factor points against fair use. Anthropic employees said copies of works (pirated ones, too) would be retained ‘forever’ for ‘general purpose’ even after Anthropic determined they would never be used for training LLMs. A separate justification was required for each use. None is even offered here except for Anthropic’s pocketbook and convenience.”
The case is the first of its kind to be decided in the US, and lays out a potentially legal way for AI companies to safely train their models using copyrighted works — as long as they purchase them. That said, there are still many other cases pending and many factors at play before the industry has clear rules.
But companies that are caught knowingly using pirated, copyrighted works to train AI models may face new legal exposure.
An Anthropic spokesperson told Sherwood News:
“We are pleased that the Court recognized that using ‘works to train LLMs was transformative — spectacularly so.’ Consistent with copyright’s purpose in enabling creativity and fostering scientific progress, ‘Anthropic’s LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different.’”