News: 1750794760

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

LLMs can hoover up data from books, judge rules

(2025/06/24)


One of the most tech-savvy judges in the US has ruled that Anthropic is within its rights to scan purchased books to train its Claude AI model, but that pirating content is legally out of bounds.

In training its model, Anthropic bought millions of books, many second-hand, then cut them up and digitized the content. It also downloaded over 7 million pirated books from Books3 dataset, Library Genesis (Libgen), and the Pirate Library Mirror (PiLiMi), and that was the sticking point for Judge William Alsup of California's Northern District court.

On Monday, he ruled that simply digitizing a print copy counted as fair use under current US law, as there was no duplication of the copyrighted work since the printed pages were destroyed after they were scanned. However, Anthropic may have to face trial over the use of pirated material.

[1]

"As Anthropic trained successive LLMs, it became convinced that using books was the most cost-effective means to achieve a world-class LLM," Alsup [2]wrote [PDF] in Monday's ruling. "During this time, however, Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.' It kept them anyway."

Anthropic became 'not so gung ho about' training on pirated books 'for legal reasons.'

The case was filed by three authors - Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson - who claimed that Anthropic illegally used their fiction and non-fiction works to train Claude. At least two of each author's books were included in the pirated material used by Anthropic.

Alsup noted that Anthropic hired the former head of partnerships at Google’s book-scanning project, Tom Turvey, who began conversations with publishers about licensing content, as other AI developers have done. But these talks were abandoned in favor of simply buying millions of books, taking the pages out, and scanning them, which the judge ruled was fair use.

[3]

[4]

"We are pleased that the Court recognized that using 'works to train LLMs was transformative — spectacularly so,'" an Anthropic spokesperson told The Register .

"Consistent with copyright’s purpose in enabling creativity and fostering scientific progress, Anthropic's LLMs trained upon works not to race ahead and replicate or supplant them — but to turn a hard corner and create something different."

[5]

On the matter of piracy, however, Alsup noted that in January or February 2021, Anthropic cofounder Ben Mann "downloaded Books3, an online library of 196,640 books that he knew had been assembled from unauthorized copies of copyrighted books — that is, pirated." In June, he downloaded "at least five million copies of books" from Libgen, and in July 2022, another two million copies were downloaded from PiLiMi, both of which Alsup classified as "pirate libraries."

[6]Canadian artist wants Anthropic AI lawsuit corrected

[7]Writers sue Anthropic for feeding 'stolen' copyrighted work into Claude

[8]Anthropic's law firm throws Claude under the bus over citation errors in court filing

[9]Judge orders Feds rehire workers falsely fired for lousy performance

Alsup found that the pirated works weren't necessarily used to train Claude, but that the company had retained them. That could prove legally problematic for the startup, Alsup ruled, since they were kept for "Anthropic’s pocketbook and convenience," he found.

"This order grants summary judgment for Anthropic that the training use was a fair use. And, it grants that the print-to-digital format change was a fair use for a different reason," he wrote.

"But it denies summary judgment for Anthropic that the pirated library copies must be treated as training copies. We will have a trial on the pirated copies used to create Anthropic’s central library and the resulting damages, actual or statutory (including for willfulness). That Anthropic later bought a copy of a book it earlier stole off the internet will not absolve it of liability for the theft but it may affect the extent of statutory damages."

Alsup's ruling is mixed news for Anthropic, but he does know his onions. For the [10]last quarter of a century , Alsup has presided over some of the biggest tech trials in history, and his rulings have been backed up by the Supreme Court in some cases.

[11]

Alsup, a coder for over two decades (primarily in BASIC), presided over the Oracle-Google trial over fair use of Java code in Android, which led him to dabbling in that language. More recently, he [12]sentenced former Google self-driving car engineer Anthony Levandowski to 18 months in prison for stealing proprietary info from his work at Google and bringing it to a new startup, Otto, which he later sold to Uber. President Trump later commuted the sentence in 2021.

Bartz and Johnson had no comment at the time of going to press. Graeber declined to discuss the verdict. ®

Get our [13]Tech Resources



[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aFsf6UfyKu-dPv7f3h6eGgAAAkI&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://regmedia.co.uk/2025/06/24/anthropic.pdf

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFsf6UfyKu-dPv7f3h6eGgAAAkI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFsf6UfyKu-dPv7f3h6eGgAAAkI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFsf6UfyKu-dPv7f3h6eGgAAAkI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2024/08/31/canadian_artist_anthropic_ai_lawsuit/

[7] https://www.theregister.com/2024/08/20/anthropic_claude_copyright/

[8] https://www.theregister.com/2025/05/15/anthopics_law_firm_blames_claude_hallucinations/

[9] https://www.theregister.com/2025/03/14/government_jobs_ruling/

[10] https://www.theregister.com/2001/11/26/via_beats_off_intel_legal/

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFsf6UfyKu-dPv7f3h6eGgAAAkI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[12] https://www.theregister.com/2020/08/05/levandowski_prison_sentence/

[13] https://whitepapers.theregister.com/



Nuance shmuance

HuBo

An interesting perspective on fair use, where you need to at least purchase a given work to be allowed to train an A I on it (no pirating allowed -- on the input side). But the bigger question remains of how LLMs and others that [1]bend copyright [2]rules on the [3]output side, sometimes [4]as blatantly as a photocopy machine, will be dealt with.

Let's hope the [5]positive outcomes seen so far do continue to occur there too so as to keep protecting [6]genuine creativity !

[1] https://www.theregister.com/2025/03/11/meta_dmca_copyright_removal_case/

[2] https://www.theregister.com/2025/04/03/openai_copyright_bypass/

[3] https://www.theregister.com/2023/05/03/openai_chatgpt_copyright/

[4] https://arstechnica.com/features/2025/06/study-metas-llama-3-1-can-recall-42-percent-of-the-first-harry-potter-book/

[5] https://www.theregister.com/2025/02/12/thomson_reuters_wins_ai_copyright/

[6] https://www.theregister.com/2025/04/14/miyazaki_ai_and_intellectual_property/

LLMs can hoover up data from books

abend0c4

Given that the courts have already decided that the output of LLMs are not subject to copyright, the optimal solution is obviously for LLMs to train themselves on their own results. Given that it's ultimately inevitable, there's presumably a first-mover advantage. I'd say Let the race begin! , except I'm sure it already has. Caveat non-emptor.

You never go anywhere without your soul.