News: 0176899227

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

OpenAI Accused of Training GPT-4o on Unlicensed O'Reilly Books (techcrunch.com)

(Wednesday April 02, 2025 @04:30AM (msmash) from the secret-sauce dept.)


A [1]new paper [PDF] from the AI Disclosures Project claims OpenAI likely trained its GPT-4o model on [2]paywalled O'Reilly Media books without a licensing agreement . The nonprofit organization, co-founded by O'Reilly Media CEO Tim O'Reilly himself, used a method called DE-COP to detect copyrighted content in language model training data.

Researchers analyzed 13,962 paragraph excerpts from 34 O'Reilly books, finding that GPT-4o "recognized" significantly more paywalled content than older models like GPT-3.5 Turbo. The technique, also known as a "membership inference attack," tests whether a model can reliably distinguish human-authored texts from paraphrased versions.

"GPT-4o [likely] recognizes, and so has prior knowledge of, many non-public O'Reilly books published prior to its training cutoff date," wrote the co-authors, which include O'Reilly, economist Ilan Strauss, and AI researcher Sruly Rosenblat.



[1] https://ssrc-static.s3.us-east-1.amazonaws.com/OpenAI-Training-Violations-OReillyBooks_Sruly-OReilly-Strauss_SSRC_04012025.pdf

[2] https://techcrunch.com/2025/04/01/researchers-suggest-openai-trained-ai-models-on-paywalled-oreilly-books/



They were just as likely read from pirated copies (Score:4, Insightful)

by TheMiddleRoad ( 1153113 )

People post stuff all over the internet, including from Oreilly. It's probably hard not to suck up copyrighted info if you're not super careful, and these AI scumsuckers most certainly aren't.

Re: (Score:2)

by Pinky's Brain ( 1158667 )

Could have asked Suchir Balaji if he still lived.

I used to say you can always find someone with an axe to grind, but I didn't anticipate they'd be suicided.

"Likely" (Score:3)

by eclectro ( 227083 )

Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.

What then?? Cue the end of "software licensing"??

Re: (Score:2)

by 93 Escort Wagon ( 326346 )

> Maybe they bought a print copy off ebay, scanned the book using a book scanner, and then used it to "train" the computer.

Sure, and I've got a bridge to sell you - cheap!

Re: (Score:2)

by Pinky's Brain ( 1158667 )

Google needed a fair use ruling for that, OpenAI doesn't have one yet.

Who knew? (Score:2)

by kamapuaa ( 555446 )

O'Reilly still makes books?

Re: Who knew? (Score:2)

by zawarski ( 1381571 )

Still have a couple of those animal cover books on my shelf, next to [1]https://www.amazon.com/Magic-G... [amazon.com] and [2]https://a.co/d/5AtaSqX [a.co]

[1] https://www.amazon.com/Magic-Garden-Explained-Internals-Release/dp/0130981389

[2] https://a.co/d/5AtaSqX

openai developers... (Score:2)

by greytree ( 7124971 )

Like everyone else, Openai developers Trained on Unlicensed O'Reilly Books.

So what? Until copyright terms are a fair 5 years, pirate on!

Re: (Score:2)

by fph il quozientatore ( 971015 )

But unlike many others, they have been caught red-handed and have plenty of money to sue for.

Why should they worry? (Score:3)

by SeaFox ( 739806 )

They're looking at Facebook and how much trouble they are not getting in for doing it, and realized it's open season now for companies to ignore copyright law if AI is involved.

Is this going to involve RAW human ecstasy?