How Anthropic Built Claude: Buy Books, Slice Spines, Scan Pages, Recycle the Remains (msn.com)
- Reference: 0180664942
- News link: https://news.slashdot.org/story/26/01/27/146242/how-anthropic-built-claude-buy-books-slice-spines-scan-pages-recycle-the-remains
- Source link: https://www.msn.com/en-us/technology/artificial-intelligence/how-silicon-valley-built-ai-buying-scanning-and-discarding-millions-of-books/ar-AA1V4aZv
The company spent tens of millions of dollars on the effort and hired Tom Turvey, a Google executive who had worked on the legally contested Google Books project two decades earlier. Anthropic bought books in batches of tens of thousands from retailers including Better World Books and World of Books. A vendor document noted the company was seeking to scan between 500,000 and two million books.
Before Project Panama, Anthropic co-founder Ben Mann downloaded books from LibGen, a shadow library of pirated material, over 11 days in June 2021. He later shared a link to the Pirate Library Mirror site with colleagues, writing "this is awesome!!!" Meta employees similarly downloaded books from torrent platforms after approval from Mark Zuckerberg, court filings allege, though one engineer wrote that "torrenting from a corporate laptop doesn't feel right." Anthropic settled for $1.5 billion in August without admitting wrongdoing.
[1] https://www.msn.com/en-us/technology/artificial-intelligence/how-silicon-valley-built-ai-buying-scanning-and-discarding-millions-of-books/ar-AA1V4aZv
and here i though they were one of the good ones (Score:1, Insightful)
pretty wild how flagrant the coordinated copyright infringement operation was here. large payments from these deep pocketed oligopolies isn't enough to address the issue. clearly the material they are stealing is worth well in excess of the damages they're paying. we need a way to prevent the crime from happening in the first place. like proactively.
Re:and here i though they were one of the good one (Score:5, Insightful)
How exactly is training from physical books, copyright infringement? This is exactly the kind of use that is allowed under fair use. Once you purchase a physical book, you are allowed to do what you want with it, even scan it to your computer and do data analysis on it or index it, as long as you don't republish it. AI companies are certainly not republishing books they use for training.
Re: (Score:2)
that's a fair point. i didn't consider that. thanks for your comment
Re: (Score:1)
If it was an audio CD and you're profiting from its content, the RIAA would without doubt want to shut that down ASAP.
Re: (Score:2)
Fair-use protections end the moment you start using other peoples' copyrighted works for profit.
Re: (Score:1)
This is a horseshit argument. Fair use was developed for human use, not use by an entity that is retaining vast amounts of what it processes verbatim. ChatGPT can reproduce about 40% of the Harry Potter series books verbatim. That means they have been reproduced in that data set, and that is being used for profit. It's plagiarism and FAR outside fair use.
Re: (Score:2)
Very true but then if a 'black box" (LLM) can reproduce the exact works...? Note as much as you can do anything you want with paper and ink of the book, the same is not true about particular composition of the ink molecules on the paper (ie the "text", the works). The *COPY"right is about making copy of the essence of the work.
I do not know how the copyright applies to a human that can do the same, ie reproduce exact work out of memory.
Re: (Score:2)
The problem is when someone coaxes the model to output a significant portion of a work verbatim.
Yes they can read it in and process as much as they see fit, but if some prompt demonstrates an ability for it to reconstitute the original work, or something that from a human would be called an infringing knock off, what then?
Gallica.fr (Score:5, Informative)
Way back in the mists of time, or about 1992, I worked for the company that scanned the French National Library. You can still see the images we did today - we used pretty much the method they're talking about except we would recombine certain books afterwards. We took off the spines, ran them through an automatic document feeder attached to a high speed scanner (for 1992 anyway), deskewed the images and OCR'd some of them.
One day, a production assistant came to me and said "I don't think we should guillotine this one, what do you think?". I looked at it and...flaming hell, it was the French National Academy of Science's original copy of [1]Principia Mathematica [gallica.bnf.fr] by Isaac Newton. Had we gone ahead and sliced/shredded...Douglas Adams' predictions would have come true, and we'd have been lynched by a rampaging mob of respectable physicists.
Tech - we used a combination of Mac Plus, 486SX, 486DX2 with super-incredible-powerful-specialised graphics cards containing a whole 1Mb of VRAM, and a Netware server so vast it could only be named one thing: Behemoth. I mean, what other name could we have possibly contemplate giving to a machine which had a whole 1Gb available to it...
[1] https://gallica.bnf.fr/ark:/12148/bpt6k3363w?rk=21459;2
Re: (Score:2)
Cool story! I got my first 386SX in 1992, with a whopping 2MB RAM, so 1Gb would have blown my mind... "Imagine how well Prince of Persia will run" with my brain at that time xD
Re: (Score:2)
Oh we played a lot of Prince of Persia. There was a cult of Spaceward Ho! playing as well. Fun anecdote time: Netware required you to 'ack packets, and early shareware versions of Doom had a bug in it that didn't ack. We literally filled the network up playing Doom.
Safe to say that wasn't the official reason we gave to people, and settled for "a restart seemed to fix it all". Oops.
How is this legal? (Score:2)
How come blatant large-scale download of online and offline copyrighted works and reuse in derivative works is OK if "AI" corporations does it, but download by private persons of similar works for reading / viewing is classified as piracy? (OK, besides the "AI" hype that it's somehow not a derivative work, "corporate" may be a hint ...)
Re: (Score:2)
Are you talking about the books that they bought as sources, or the previous approach of downloading a huge torrent including works in copyright?
Re: (Score:2)
And you aren't even profiting from it...they're using it to (hopefully) generate endless profits as well.
They didn't settle because they downloaded (Score:4, Informative)
Alsup ruled that the scanning was format shifting and thus 'fair use'. Alsup ruled that their usage of downloaded works for training was 'fair use'. But Anthropic kept copies of downloaded works 'as a library' - including works they didn't use for the model training. Alsup ruled that that was not fair use. Alsup also said he would not delay the trial while Anthropic appealed (which is something that usually happens), hence why Anthropic settled.
Re: (Score:2)
"Alsup ruled that the scanning was format shifting and thus 'fair use'. Alsup ruled that their usage of downloaded works for training was 'fair use'."
And the first is clearly fair use regardless of prevailing attitudes here. The second certainly could be depending on the terms of the downloads.
"But Anthropic kept copies of downloaded works 'as a library' - including works they didn't use for the model training. Alsup ruled that that was not fair use."
And it is not, but it also isn't inherently illegal. Ca
Settled? (Score:2)
> Anthropic settled for $1.5 billion in August
Settled with who? Who got $1.5 billion dollars?
Re: (Score:2)
ssshhhh. mere details. Aren't you just outraged by the whole thing . Why do you need to know this. be quiet and take in the propaganda silently.
Re: (Score:2)
The authors of the books: [1]https://apnews.com/article/ant... [apnews.com]
[1] https://apnews.com/article/anthropic-authors-copyright-judge-artificial-intelligence-9643064e847a5e88ef6ee8b620b3a44c
Hard to argue with this approach (Score:2)
There is no more clear case of fair use than this: buying physical books. Once a person or business buys a book, they don't get to continue to control what happens to that book.
But...what about AI regurgitating all that copyrighted data? Yes, it does regurgitate the data, but only in ways that are compatible with fair use. AI will never reproduce the entire work, or even large sections of it. At most, a paragraph or two. And this is exactly what humans do under fair use. We are allowed to quote small sectio
Re: (Score:2)
> At most, a paragraph or two.
Researchers have gotten the major models to regurgitate the vast majority of books by starting with a paragraph of the book. The notion that training only tweaks probabilities is complete and utter nonsense. Alsup was 95% wrong in his decision. He was blatantly lied to by Anthropic, and he bought the crap hook, line, and sinker; lock, stock, and barrel.
Re: (Score:2)
What if a human could regurgitate a book (some can)? Would this also be a copyright violation?
Re: (Score:1)
> There is no more clear case of fair use than this: buying physical books. Once a person or business buys a book, they don't get to continue to control what happens to that book.
> But...what about AI regurgitating all that copyrighted data? Yes, it does regurgitate the data, but only in ways that are compatible with fair use. AI will never reproduce the entire work, or even large sections of it. At most, a paragraph or two. And this is exactly what humans do under fair use. We are allowed to quote small sections of the text, as long as we don't reproduce the work in bulk.
> Personally, I think e-books ought to follow the same principle as physical books. You buy it, you can do what you want with it, as long as you don't republish it, and use it only for your own purposes. It should be *yours*.
yeah you guys are right. good call
Re: (Score:2)
Is the book the paper and ink or something more? I am unclear on what to think about it. There is no way to point to any particular place in the LLM that "holds the copy" but if interacted with correctly it can actually reproduce the work. But so do some humans.
Too damn lazy to build scanbots. Fascinating. (Score:1)
This fits to how I see the best use of LLMs: As a knowledge base and library you can talk to.
That they just cut open millions of books shows how lazy they were. They could've just built a battery of scanbots and resold the books afterwards. Or just get them from the local library.
There are some real dimwit arseholes running some silicon valley gigs, that's for sure. And TFA proves this once again.
Rainbows End? (Score:2)
It's one of Vernor Vinge's lesser-known books (or at least I've heard less about it), and doing almost *exactly* this was one of the big story arcs.
I really don't understand why this is illegal (Score:2)
Humans can read books, most of them can't memorize them word for word. If a computer is reading a book, and not creating a copy of it, same thing? Not if the LLM's can spit out the work verbatim. But if you give them rails to say, "hey you read this book, don't violate copyright" who cares if they 'read' the book. (granted anthropic probably has digital copies of all these books to train their modes...)
When they recycle books, they recycle people (Score:4, Insightful)
It's obvious that they want to destroy human written institutions developed over millennia and replace with a proprietary AI that only they know the algorithm for. There are a lot of horrible people involved in AI and how they want to make 8 billion people dependent on their machine. "I prompt, therefore I am" is the new human mindset they want.
Re:When they recycle books, they recycle people (Score:4, Insightful)
While not wrong: The machine cannot deliver what they promised and people are getting fed up with it.
What the horrible people want is not exactly a perfect overlap with what is realistically possible.
We will find a new equilibrium.
Re: (Score:2)
> While not wrong: The machine cannot deliver what they promised and people are getting fed up with it.
It was obvious from the beginning that it never could. And it was obvious to the people hyping it, too.
> What the horrible people want is not exactly a perfect overlap with what is realistically possible.
What the horrible people want is almost never what they say they want. What they generally want is to sell a lot of stock, then bail out before the pyramid collapses and the bubble pops. They generally don't care if what they say they want is possible.
> We will find a new equilibrium.
Indeed, but that new equilibrium will be the next bubble, lather, rinse, repeat.
Re: (Score:2)
I mean yes and no... It makes you more productive... While it can't replace a human in most cases, it does make the humans that are using it significantly more productive when using it.
Re: (Score:2)
"It's obvious that they want to destroy human written institutions..."
You nailed it. These toxic big-data corpse want to "own the future", so they must own-the-past. Can't remember who observed that.
Re: (Score:2)
I believe you are remembering a quote from George Orwell's 1984: "He who controls the past controls the future. He who controls the present controls the past."
Re: (Score:2)
They made a huge mistake here. Destroying the book (there are literately scanners for scanning books without destroying the book) means that they no longer have the book they claim to have the right to use.