
Training Your AI within the limits of the Law

Can you train your AI on a limited universe of books you digitize, and maybe some books online you don't buy? The Federal District Court opinion in Bartz et al. v. Anthropic PBC, No. 3:24-cv-05417 (N.D. Cal. Aug 19, 2024) provides direction on use of copyrighted works to train large language models (LLMs) for generative artificial intelligence (AI). LLMs develop through reviewing text and photos, often focusing on key words to expedite the process. At a higher level, LLMs can develop AGI, or Artificial General Intelligence, comparable to human intelligence. Recent comparative studies on trained AI indicate superior recall when compared to humans, depending on the universe of materials studied by LLMs.
Recently, District Court Judge William Alsup issued a split decision in his summary judgment order, defining some key permissive aspects to AI training.
First, the Court ruled that Defendant Anthropic’s use of the books at issue to train LLMs for the purpose of returning new text outputs is “spectacularly” transformative and therefore a fair use. Second, Anthropic’s digitization of books it purchased in print form for use as part of its central library was fair use because the digital copies were a replacement of the print copies it discarded after digitization. Last, Anthropic’s use of “pirated” copies of books from its central library was infringing. The pre-trial summary judgment order requested by Anthropic narrows the issue of fair use regarding certain of Anthropic’s use. Other issues remain for trial but expect an appeal to the Ninth Circuit Court of Appeals (covering 9 Western states and 2 Pacific Island jurisdictions for Federal issues of law), and a subsequent attempt for a ruling from the US Supreme Court.
Anthropic used millions of copyrighted books to train its Claude LLMs for use with its AI services capable of generating writings that imitate the writing style of humans. In preparation to train its LLMs, Anthropic compiled a “central library” of “all the books in the world” to retain “forever.” Novels and non-fiction titles written by the author plaintiffs in the case and owned by the author plaintiffs (or their companies) were among the sets of books and text in the central library.
Anthropic sourced content for its library in various ways. It downloaded for free millions of pirated copies of books in digital form. In addition, it purchased millions of copyrighted books (some the same as those acquired from pirate sites), removed the bindings, scanned and stored the works in a digitized searchable format, and then discarded the paper originals.
Each work selected for training was copied in four key ways and, as Anthropic admitted, so many times that it would be impractical to estimate. The training copies were not disseminated to the outside world. Instead, each LLM was placed into a public version of Claude, and was combined with software filtering user inputs to the LLM and filtering outputs from the LLM back to the user. Ultimately, the plaintiffs did not allege any infringing copy of their works was or would ever be provided to users by the Claude service. The plaintiffs did, however, claim Anthropic infringed their copyrights by pirating copies of their works for Anthropic’s library and by reproducing their works to train Anthropic’s LLMs.
The Court Ruled that use of the books at issue to train Anthropic’s LLMs was “exceedingly transformative” and a fair use under Section 107 of the Copyright Act. Specifically, the court noted that authors cannot exclude others from using their works to learn, similar to how humans have learned before the advent of computers.
Regarding the digitization of the books purchased in print form by Anthropic, the Court concluded this was fair use. By discarded print copies and saving space through digitization, the “format change” did not relate to one of the exclusive rights granted under the Copyright Act.
The Court arrived at a different conclusion regarding the pirated copies. Because Anthropic never paid for the pirated copies, the Court believed it was clear the pirated copies displaced demand for the authors’ works. Even if the pirated copies would later be used for a transformative purpose, such as training LLMs, the pirating was not fair use.
Remaining matters left for trial include whether the making of copies of the books from Anthropic’s central library copies for non-LLM training use is infringing, or a fair use.
Days later, in the District Court for the Northern District of California, Judge Vince Chhabria granted another AI developer’s motion for summary judgment on the issue of fair use related to training LLMs. But, in doing so, applied a different fair use analysis than Judge Alsup. Kadrey v. Meta Platforms, Inc., No. 23-CV-03417-VC, 2025 WL 1752484 (N.D. Cal. June 25, 2025). That ruling foreclosed a group of authors’ copyright infringement claims against an AI developer, who the plaintiffs claimed downloaded their copyrighted books from “shadow libraries” and used their works to train LLMs.
It appears that free use websites, public libraries, and books copied that are purchased are all fair game for AI to learn from. But pirated copyrighted books is infringing conduct. Yet, how many times have you walked into a bookstore, browsed, learned a few things, and walked out, without buying anything? Where do we draw the line? It’s a lot easier to call an action “pirating”, but when we invite the audience in to take a peek, the line may not be crossed.
Experts
