OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books (techcrunch.com)
- Reference: 0176899227
- News link: https://news.slashdot.org/story/25/04/02/0440222/openai-accused-of-training-gpt-4o-on-unlicensed-oreilly-books
- Source link: https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/
Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.
"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.
[1] https://ssrc-static.s3.us-east-1.amazonaws.com/OpenAI-Training-Violations-OReillyBooks_Sruly-OReilly-Strauss_SSRC_04012025.pdf
[2] https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/
"Likely" (Score:3)
Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
What then?? Cue the end of "software licensing"??
Re: (Score:2)
> Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.
Sure, and I've got a bridge to sell you - cheap!
Re: (Score:2)
Google needed a fair use ruling for that, OpenAI doesn't have one yet.
Who knew? (Score:2)
O'Reilly still makes books?
Re: Who knew? (Score:2)
Still have a couple of those animal cover books on my shelf, next to [1]https://www.amazon.com/Magic-G... [amazon.com] and [2]https://a.co/d/5AtaSqX [a.co]
[1] https://www.amazon.com/Magic-Garden-Explained-Internals-Release/dp/0130981389
[2] https://a.co/d/5AtaSqX
openai developers... (Score:2)
Like everyone else, Openai developers Trained on Unlicensed O'Reilly Books.
So what? Until copyright terms are a fair 5 years, pirate on!
Re: (Score:2)
But unlike many others, they have been caught red-handed and have plenty of money to sue for.
Why should they worry? (Score:3)
They're looking at Facebook and how much trouble they are not getting in for doing it, and realized it's open season now for companies to ignore copyright law if AI is involved.
They were just as likely read from pirated copies (Score:4, Insightful)
People post stuff all over the internet, including from Oreilly. It's probably hard not to suck up copyrighted info if you're not super careful, and these AI scumsuckers most certainly aren't.
Re: (Score:2)
Could have asked Suchir Balaji if he still lived.
I used to say you can always find someone with an axe to grind, but I didn't anticipate they'd be suicided.