Meta's Llama 3.1 Can Recall 42% of the First Harry Potter Book (understandingai.org)
- Reference: 0178058753
- News link: https://slashdot.org/story/25/06/15/2230206/metas-llama-31-can-recall-42-of-the-first-harry-potter-book
- Source link: https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent
This week he visits [1]recent research by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University that found that Llama 3.1 70BÂ(released in July 2024) [2]has memorized 42% of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time...
> The paper was published last month by a team of computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. They studied whether five popular open-weight models — three from Meta and one each from Microsoft and EleutherAI — were able to reproduce text from Books3, a collection of books that is widely used to train LLMs. Many of the books are still under copyright... Llama 3.1 70B — a mid-sized model Meta released in July 2024 — is far more likely to reproduce Harry Potter text than any of the other four models....
>
> Interestingly, Llama 1 65B, a similar-sized model released in February 2023, had memorized only 4.4 percent of Harry Potter and the Sorcerer's Stone. This suggests that despite the potential legal liability, Meta did not do much to prevent memorization as it trained Llama 3. At least for this book, the problem got much worse between Llama 1 and Llama 3. Harry Potter and the Sorcerer's Stone was one of dozens of books tested by the researchers. They found that Llama 3.1 70B was far more likely to reproduce popular books — such as The Hobbit and George Orwell's 1984 — than obscure ones. And for most books, Llama 3.1 70B memorized more than any of the other models...
>
> For AI industry critics, the big takeaway is that — at least for some models and some books — memorization is not a fringe phenomenon. On the other hand, the study only found significant memorization of a few popular books. For example, the researchers found that Llama 3.1 70B only memorized 0.13 percent of Sandman Slim , a 2009 novel by author Richard Kadrey. That's a tiny fraction of the 42 percent figure for Harry Potter... To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit. And that could work in Meta's favor, since most authors lack the resources to file individual lawsuits.
Why is it happening? "Maybe Meta had trouble finding 15 trillion distinct tokens, so it trained on the Books3 dataset multiple times. Or maybe Meta added third-party sources — such as online Harry Potter fan forums, consumer book reviews, or student book reports — that included quotes from Harry Potter and other popular books..."
"Or there could be another explanation entirely. Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
[1] https://arxiv.org/abs/2505.12546
[2] https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent
More parameters (Score:2)
More parameters = more plagiarism. Or maybe the same amount, just easier to see.
Re: (Score:2)
Quoting a book isn't plagiarism. Unless Llama is claiming to be the author of Harry Potter, this is not plagiarism.
Re: (Score:2)
It depends. If it is quoting Harry Potter and says it is quoting Harry Potter then it is not. If does not acknowledge that it is quoting and pretends that it is own material than it is.
Re: (Score:2)
Given that it is an AI and not a human, it's not clear that it can ever be plagiarism. To plagiarize you need to be an author, to be an author you need to be a human.
Re: (Score:2)
If you can plagiarize by using a plagiarism machine and not be guilty of plagiarism, the rules might need to change.
Re: (Score:2)
Completely immaterial. If you build a machine that then plagiarizes, you are plagiarizing. Seriously, why do you AI moron fanbois do not ask your fake God? ChatGPT will readily tell you that.
Re: (Score:2)
Only if it's intentional. If it's unintentional (which is more likely IMHO) then it's just hallucination.
Re: (Score:2)
No. If it is unintentional, you may escape punishment, but you still must stop doing it.
Re: (Score:2)
> Quoting a book isn't plagiarism.
Wrong. I get that you are uneducated, but look up "fair use". For quoting a book to _not_ be plagiarism, the quote must fall under fair use. Quoting 42% of a book is certainly plagiarism and doing so commercially without a license is a crime.
Re: (Score:2)
Wrong.
No quote can ever be plagiarism. You're confusing copyright infringement with plagiarism, I think.
I do love that you mocked their education while demonstrating that you literally don't know what the fucking word plagiarism means.
Re: (Score:2)
Ah, sure. But who in their right mind commits commercial plagiarism? Oh, my bad, LLMs are involved. Of course, then all bets are off.
Re: (Score:2)
Indeed. And the house of cards begins to crumble.
Why Stop With AI (Score:1)
We mustn't stop with AI. We need to be very concerned that some humans might memorize 42% of a copyrighted work well enough to reproduce 50-token excerpts at least half the time, and we may need to take measures to prevent this from happening, or at least make sure those humans aren't allowed to interact with other people online. We don't need another Fahrenheit 451 situation on our hands.
Re: (Score:2)
If the people acknowledge explicitly or implicitly that they are quoting, it is not plagiarism. If they try to pass it off as their own original work that it is.
Re: (Score:3)
> That is not at all how copyright works.
Plagiarism isn't copyright infringement.
> plagiarism doesn't really matter anymore in a non-legal context either
It never did. Plagiarism isn't a crime, rather it's considered a violation of a code of honor or ethics, mostly relegated to academia and science publication. Whether anything is done about it is entirely up to the organization whose code you've agreed to follow. Harvard in this case either doesn't have any meaningful code against it, or they just selectively enforce (i.e. nepotism, which isn't at all unheard of in academia.)
Re: (Score:2)
> "... when the freaking PRESIDENT OF HARVARD faces plagiarism allegations and is ALLOWED TO REMAIN A PROFESSOR ...."
You may be giving the word "allegations" too much power. An allegation is an accusation; it isn't proof; it isn't even evidence.
From [1]The Guardian [theguardian.com]: "Investigations by the Washington Free Beacon and the New York Post .... turned up nearly 50 instances of alleged plagiarism in Gay’s academic writing. ... According to the Harvard board, a school subcommittee and independent panel charged with investigating the plagiarism allegations against Gay found "a few instances of inadequate citation” but
[1] https://www.theguardian.com/education/2024/jan/06/harvard-claudine-gay-plagiarism
Re: Why Stop With AI (Score:2)
A human at least isn't reproducing the book as part of a billion dollar company's reach for the AI crown.
Re: (Score:2)
The law already covers that scenario. Reading and memorizing a copywritten work doesn't give you the right to perform it.
Re: (Score:2)
Yep. I can learn a popular tune on the guitar and thats fine. But if I go int public and perform it, I have to pay a royalty fee. (Thats why the RIAA are always hitting up pubs and venues for performance fees. Its royalties for all those cover songs).
Re: (Score:2)
Indeed. The fascinating thing about all these AI fanboi idiots is that they do not seem to ask AI these questions. They would get told the same thing.
Re: (Score:2)
The law is already in place, you are just too ignorant to know it. Humans get an exception for memorization as that does legally not count as data processing. But as soon as a human publicly performs a copyrighted work from memory, they must have a license. Look up "Happy Birthday" ...
Re: (Score:2)
Ya, this is bad. Real bad. If Sony happens to overhear my friends and I quoting Bad Boys 2, we're fucked- because we can do 500 token excerpts with as few as 4 tokens of prompting.
A pre-emptive ruling? (Score:3)
> To certify a class of plaintiffs, a court must find that the plaintiffs are in largely similar legal and factual situations. Divergent results like these could cast doubt on whether it makes sense to lump J.K. Rowling, Richard Kadrey, and thousands of other authors together in a single mass lawsuit.
Of course it won't happen, but this would be the time for the courts to extrapolate from the existing situation, to a future in which AI fully memorizes even the most obscure works and monetizes them in some fashion. Allowing a class action suit now - assuming the suit is successful - will help to prevent future abuses. That's what should happen; but the courts generally seem lacking when it comes to preventing as opposed to punishing.
Re: (Score:2)
> That's the responsibility of a completely different part of the government.
Yeah, I believe you're referring to its [1]Pre-Crime Intervention Force [imdb.com].
Right now it seems to be rather busy in a number of American cities, though.
[1] https://www.imdb.com/title/tt0181689/
Re: (Score:2)
> Maybe that's because, oh I don't know, the courts are there to UPHOLD the laws. Not MAKE the laws.
Neither. Their purpose is to interpret them. They can't prosecute, but they can refer a matter to prosecution. They can issue a verdict, a sentence, or an injunction based on their interpretation of the law, but they can't carry it out or enforce it.
Re: (Score:2)
boop boop
Re: A pre-emptive ruling? (Score:2)
That's only bounded by RAM limitations, IMO.
There is a scenario where using Kurzweil's assumptions on miniaturization and power consumption, that networked ai could have access to pretty much any amount of information ... spintronics comes to mind, data storage on an atomic level, or DNA as a long term storage substrate... they are not within our reach now, but there is some conceptual framework out there to approach the problems .
Re: (Score:2)
The courts should not be making law, they should be applying law. The problem should not be addressed through a class action lawsuit.
Interesting. (Score:2)
Grok: [1]https://grok.com/share/bGVnYWN... [grok.com]
[1] https://grok.com/share/bGVnYWN5_01aa071f-cc7d-4706-84df-ff7b4263c389
Re: (Score:3)
Pastebin in case it fails to load. Seems Llama isn't alone. [1]https://pastebin.com/7T9da6kL [pastebin.com]
[1] https://pastebin.com/7T9da6kL
Bad Headline (Score:2)
The headline should read something like:
"Researchers Waste Time Figuring Out Excruciating Way To Unreliably Tease Out Parts Of Books"
72 TB Laptop (Score:2)
Didn't they say some rogue VP set up his laptop to torrent all 72TB of Z-Library to feel o-llama?
I wish my laptop had that many drive bays!
Re: (Score:2)
The more likely alternative is that Harry Potter is hugely popular and referenced so many times in so many places that whatever training they did ended up weighting it more heavily. Possibly also people mimicked the author's style and linguistic patterns so much that it is easy to reproduce.
Although I personally liked Sandman Slim, given the subject matter of that book, it didn't have anywhere near the widespread cultural impact.
what "memorization problem"? (Score:2)
"Maybe Meta made subtle changes in its training recipe that accidentally worsened the memorization problem."
There is no memorization problem, "photographic memory" is an achievement. Violation of copyright occurs during inference, that's where the problem is. Humans with "photographic memory" aren't a problem and aren't copyright violators, unless they use there ability to reproduce protected works.
AI developers need to make products with the same constraints and respect as is expected of humans, they sho
Re: (Score:2)
Humans with photographic memopry are for sure copyright violators as soon as they perform those memories publicly. And that is legally what this is about. LLAMA may privately hallucinate as much as it likes, but this is about ther version offered publicly.
But why? (Score:2)
Why recall only 42% of these books, while leaving the other 58% in general circulation?
42 (Score:2)
Huh...what could be the meaning of this???
It's Likely The Ship of Theseus (Score:2)
Articles and people will quote the book, there will be previews, reviews, translations and quotes in media and study.
Sure it may have read the book, but recital needs all the rest, built from parts that aren't the original, in order to weigh the NN.
Reading it once (in training) won't on its own have been enough to allow it to recall the book, so should they be accused of ripping off the copyrighted work if the parts were taken from unrelated (and legal) sources and piecing it together?
Reading the article (Score:4, Informative)
Research paper summary:
- Send in LLM prompts for 100 word (token) sequence in the book, skipping forward 10 words for each sequence
- Match the generated text versus the actual text in the book
The news article adds:
- Do the same thing but repeatedly ask the same prompt to get the highest probability matches
[1]https://arxiv.org/abs/2505.125... [arxiv.org]
[2]https://doi.org/10.48550/arXiv... [doi.org]
Computer Science > Computation and Language - [Submitted on 18 May 2025]
Extracting memorized pieces of (copyrighted) books from open-weight language models
A. Feder Cooper, Aaron Gokaslan, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang
Prompt (prefix) - They were careless people, Tom and Daisy - they smashed up things and creatures and then retreated
Target (suffix) - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made.
Generations - back into their money or their vast carelessness, or whatever it was that kept them together, and let other people clean up the mess they had made
Text extraction method
1 For a given book, we start at the beginning of the text file in Books3.
2 We sample a chunk of text that is sufficiently long to contain 100 tokens of corresponding tokenized text,
3. slide 10 characters forward in the book text and repeat this process.
4. We do this for the entire length of the book, which results in approximately one example every 10 characters
By testing overlapping examples, we expect to surface high-probability regions of memorized content within a book, which we can then explore more precisely in follow-up experiments,
From - [3]https://www.understandingai.or... [understandingai.org]
Meta's Llama 3.1 can recall 42 percent of the first Harry Potter book
New research could have big implications for copyright lawsuits against generative AI.
Timothy B. Lee
- Specifically, the paper estimates that Llama 3.1 70B has memorized 42 percent of the first Harry Potter book well enough to reproduce 50-token excerpts at least half the time.
- Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
So we can predict that the model will produce “peanut butter and jelly” about 10 percent of the time—without actually generating 100 or 1,000 outputs and counting how many of them were that exact phrase.
- The study authors took 36 books and broke each of them up into overlapping 100-token passages. Using the first 50 tokens as a prompt, they calculated the probability that the next 50 tokens will be identical to the original passage. They counted a passage as “memorized” if the model had a greater than 50 percent chance of reproducing it word for word.
[4]Read the rest of this comment...
[1] https://arxiv.org/abs/2505.12546
[2] https://doi.org/10.48550/arXiv.2505.12546
[3] https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent
[4] https://slashdot.org/comments.pl?sid=23719287&cid=65451547
Obvious question (Score:2)
What happens when you do the same test across multiple LLM models trained by different companies?
What happens when you combine all the results from repeatedly testing one model with the same for other models.
Re: (Score:2)
Are people actually enamored by the virtues of a search engine?
I feel like Im taking crazy pills.
Re: (Score:2)
> - Suppose someone wants to estimate the probability that a model will respond to “My favorite sandwich is” with “peanut butter and jelly.” Here’s how to do that:
> Prompt the model with “My favorite sandwich is” and look up the probability of “peanut” (let’s say it’s 20 percent).
> Prompt the model with “My favorite sandwich is peanut” and look up the probability of “butter” (let’s say it’s 90 percent).
> Prompt the model with “My favorite sandwich is peanut butter” and look up the probability of “and” (let’s say it’s 80 percent).
> Prompt the model with “My favorite sandwich is peanut butter and” and look up the probability of “jelly” (let’s say it’s 70 percent).
> Then we just have to multiply the probabilities like this: 0.2 * 0.9 * 0.8 * 0.7 = 0.1008
That's not really how LLMs work, though.
In real life, logits aren't sampled purely probabilistically.
As an example, for your example, the realistic final logit probabilities would be more like:
Peanut: 50%
Butter: 100%
And: 100%
Jelly: 100%