AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset (msn.com)

(Saturday June 07, 2025 @11:34PM (EditorDavid) from the trials-of-training dept.)

Is copyrighted material a requirement for training AI? [1]asks the Washington Post . That's what top AI companies are arguing, and "Few AI developers have tried the more ethical route — until now.

"A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain. They tested the dataset quality by using it to train a 7 billion parameter language model, which performed about as well as comparable industry efforts, such as Llama 2-7B, which Meta released in 2023."

> [2]A paper published Thursday detailing their effort also reveals that the process was painstaking, arduous and impossible to fully automate. The group built an AI model that is significantly smaller than the latest offered by OpenAI's ChatGPT or Google's Gemini, but their findings appear to represent the biggest, most transparent and rigorous effort yet to demonstrate a different way of building popular AI tools....

>

> As it turns out, the task involves a lot of humans. That's because of the technical challenges of data not being formatted in a way that's machine readable, as well as the legal challenges of figuring out what license applies to which website, a daunting prospect when the industry is rife with [3]improperly licensed data . "This isn't a thing where you can just scale up the resources that you have available" like access to more computer chips and a fancy web scraper, said Stella Biderman [executive director of the nonprofit research institute Eleuther AI]. "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that's just really hard."

>

> Still, the group managed to unearth new datasets that can be used ethically. Those include a set of 130,000 English language books in the Library of Congress, which is nearly double the size of the popular-books dataset Project Gutenberg. The group's initiative also builds on recent efforts to develop more ethical, but still useful, datasets, such as [4]FineWeb from Hugging Face, the open-source repository for machine learning... Still, Biderman remained skeptical that this approach could find enough content online to match the size of today's state-of-the-art models... Biderman said she didn't expect companies such as OpenAI and Anthropic to start adopting the same laborious process, but she hoped it would encourage them to at least rewind back to 2021 or 2022, when AI companies still shared a few sentences of information about what their models were trained on.

>

> "Even partial transparency has a huge amount of social value and a moderate amount of scientific value," she said.

[1] https://www.msn.com/en-us/news/technology/ai-firms-say-they-can-t-respect-copyright-these-researchers-tried/ar-AA1G96Ji

[2] https://bit.ly/common-pile-v0p1-paper

[3] https://www.washingtonpost.com/technology/2023/10/25/data-provenance/

[4] https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1

"Respecting copyright" != "Ethically" (Score:4, Insightful)

by JaredOfEuropa ( 526365 )

Copyright itself has been twisted so far from its original intent that I feel little urge to respect it, and little remorse at breaking it. I will respect copyright when it respects me.

Re: (Score:1)

by Anonymous Coward

> I will respect copyright when it respects me.

Word.

Re: (Score:1)

by alecdacyczyn ( 9294549 )

Copyright isn't even an issue. The word "use" has been thrown around so many times that many people have come to believe copyright law lets copyright owners control the use of their works. It doesn't. The law only applies to copying, distributing, and public performances. It says nothing about AI training. Maybe it SHOULD cover that. But Congress hasn't passed that law yet. This doesn't even require a Fair Use exemption. The works might have been illegally copied and distributed in order to assemble a tra

Re: (Score:2)

by SoftwareArtist ( 1472499 )

There are major problems with copyright. Like the absurdly long terms that mean a century after a work is written, the author's descendants may still be collecting royalties on it. Or DMCA style laws that abuse copyright for unrelated purposes, like saying you can't repair your own possessions because it would violate a copyright. It's absurd and it needs to be fixed.

But the AI companies don't care about that. They aren't on your side. They aren't fighting those things. The only thing they care about

Um... (Score:4, Insightful)

by fahrbot-bot ( 874524 )

> AI Firms Say They Can't Respect Copyright

Pretty sure it's not really up to them, legally.

> A group of more than two dozen AI researchers have found that they could build a massive eight-terabyte dataset using only text that was openly licensed or in public domain.

So it's really more like "won't" than "can't" ...

Re:Um... (Score:4, Insightful)

by Sebby ( 238625 )

>> AI Firms Say They Can't Respect Copyright

> Pretty sure it's not really up to them, legally.

And their response: "Who cares about legality - that's for the courts to settle, after you spend money you don't have suing us."

Re: (Score:3, Insightful)

by Brain-Fu ( 1274756 )

> Pretty sure it's not really up to them, legally.

In a fair and just world, you would be right. In this world, however, the super-rich are beholden to a different set of rules than the rest of us, and something like AI is just too interesting to allow pesky laws to get in the way (especially laws that are, by and large, only protecting copyrights held by the not-so-rich).

Re: (Score:2)

by dfghjk ( 711126 )

People like to pretend the "rules" are clear. Just because AI billionaires are criminals does not mean the commit the copyright violations alleged.

Re: (Score:2)

by dfghjk ( 711126 )

"Pretty sure it's not really up to them, legally."

Nor do they say it.

"So it's really more like "won't" than "can't" ..."

They don't say that either.

There is no reason to tell lies, Ai companies are scumbags, doesn't mean we have to be.

They've got hundreds of billions of dollars (Score:2)

by rsilvergun ( 571051 )

So yeah it's up to them. We are very much a nation of men not laws now so whoever has the most money makes the law in that instant.

Re: (Score:2)

by techno-vampire ( 666512 )

I can confidently predict rsilvergun is going to do exactly what he always does: nothing. He's not the slightest bit interested in doing anything about whatever he's complaining about any more than the intelligentsia of pre-revolution Russia were, but like them, he's only interested in complaining.

Correction (Score:4, Insightful)

by quintessencesluglord ( 652360 )

AI firms won't pay to respect copyright

On they one hand, I can only hope this leads to revisiting the insanity of copyright law.

On the other, fuck them for double dealing with regards to what ownership actually means ("I'm alright, Jack.").

Training does Respect Copyright (Score:2)

by Roger W Moore ( 538166 )

> AI firms won't pay to respect copyright

They do not need to pay. Copyright, as the name says, is the right to copy and distributute something. So long as you purchase a legal copy you are allowed to use it as you wish provided you do not distribute copies.

If I buy a book the copyright holder cannot tell me that I'm only allowed to read 5 pages a day, or that I can't use it to balance a table, prop open a door or even burn it. Similarly, they can't tell me that I'm not allowed to use it to train a machine learning algorithm provided that the a

Re: (Score:2)

by quintessencesluglord ( 652360 )

Uh-huh.

Take something like music. There are specific licenses for specific uses. We've already have a legal framework with regards to sampling. Imagine my dismay how none of these people spoke up then, but now cost of sampling and the morass of licensing is an issue.

But tell me, is any of the software copyrighted?

Oh...

Re: (Score:2)

by Roger W Moore ( 538166 )

> There are specific licenses for specific uses.

Yes but only around two things: public performance and copying/distrubution and arguably public performance is a form of distribution.

Re: (Score:2)

by bill_mcgonigle ( 4333 ) *

Yeah but Meta torrented the entirety of Z-Library, reportedly.

They won't even pay for one copy, even setting aside the issue of trained networks being a derivative work.

Re: (Score:1)

by alecdacyczyn ( 9294549 )

Yes, but that's a different issue. That would be illegal whether it was being used for AI training or not.

Re: (Score:1)

by Mrtsquare ( 6670332 )

Ok, lets consider intelligence, artificial or "real". Lets feed the AI all the books of learning from Dick and Jane all the way up to a PHD in your choice. One copy each. From your local/high school/college book store. I'm sure the AI people would not object to the cost so far. The lets set the AI up with a normal speed internet connection and let it explore for, say 10 hours per day. Ok, so now, we have a trained AI. I see no copyright issues here not found in a child genius with a perfect memory. T

They're lying. (Score:4, Insightful)

by Sebby ( 238625 )

> AI Firms Say They Can't Respect Copyright.

They say that because they're fucking liars.

They only want to serve their real customers, which is their (potentially future) investors/shareholders - they don't give a shit about anyone else, including those that ever produced the content their models have been trained on (the models which wouldn't have any use without that content exiting to begin with).

Re: (Score:3)

by quonset ( 4839537 )

they don't give a shit about anyone else,

So they're like everyone who makes excuses for why they steal music, videos, and software?

Re: (Score:2)

by dfghjk ( 711126 )

Except in the AI case, it's not clear it's not fair use. People now accusing AI training of criminality are worse.

Re: (Score:2)

by dfghjk ( 711126 )

"They say that because they're fucking liars."

They don't say it, that's a just a troll that you are excited to believe. But, yes they are fucking liars.

Copyrights redefined (Score:1)

by garompeta ( 1068578 )

Almost every should be already in public domain. The same ethos for patents was supposed to cover for Copyrights: a temporary monopoly for the creator so they can live of it for a while, and then becoming public for the benefit of the invention/creation for the rest of humanity. The way that this got distorted so despairingly between patents and copyright is really an embarrassment. How makes any ethical sense for copyright to persist 150 years AFTER the death of it's author, while parents expire 20 years

Re: (Score:1)

by registrations_suck ( 1075251 )

Very simple, really.

Medicine is inherently useful.

The mouse is only useful for making money.

Re: Copyrights redefined (Score:1)

by garompeta ( 1068578 )

Yeah, making money while stifling creativity. You know what Big Pharma could have lobbied to extend patent law to expire 200 years, destroying all generics, in the same way that Disney did. Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN, and then they freaking closed the door behind them with this absurd law. They were the primary beneficiaries of using other people's work to create something, admittedly, beautiful and original. This is the very thing they are imped

Re: (Score:2)

by techno-vampire ( 666512 )

Also the irony is that Disney freaking plagiarized and used stories in the PUBLIC DOMAIN...

No, that's not what happened because once something enters the Public Domain, nobody owns it any more, and anybody who wants is free to use it however they want. That's why Disney sticks to stories in the Public Domain so that they don't have to pay royalties.

An AI without the training data .. (Score:3)

by Mirnotoriety ( 10462951 )

AI is only as effective as the data it's trained on — without that data, it's as useful as asking a rock. The claim that no original data is retained internally is misleading. Marketing AI without compensating data creators is, in essence, intellectual property theft.

Re: (Score:2)

by dfghjk ( 711126 )

"Marketing AI without compensating data creators is, in essence, intellectual property theft."

It is not, marketing is marketing.

And it remains to be seen if training is IP theft, so far the focus has been on copying and storing data, not training with it. They are doing that because it's not clear that training isn't fair use.

Re: (Score:2)

by gweihir ( 88907 )

Obviously. And a criminal business model should not only get you shut down. It should get you sent to prison.

Maybe an adversarial approach (Score:2)

by LindleyF ( 9395567 )

Similar to GAN image generation, you can simultaneously train an LLM and a copyright classifier, to minimize the ability to output stuff that violates copyright. It's not really the training that's the problem, but the possibility of spitting it back out again without attribution.

Re: (Score:2)

by dfghjk ( 711126 )

That is true, but a "copyright classifier" contains what? Given the term used, it appears you are suggesting another LLM with imperfect memorization. How do you think that solves any problem?

The hard part is in the doing, what are suggesting is obvious.

Re: (Score:2)

by dfghjk ( 711126 )

... "you" ...

Also, it should be mentioned that humans are notorious for inadvertently making these kinds of copyright violations themselves, not necessarily with text because their recall isn't that good, but with music it happens frequently. If you think a detector applied to output is going to solve problems, intuition says you will be disappointed.

But you are right, it's not clear there is copyright violation during training but there certainly is during inferencing. Problem is, it is not AT ALL clear

Re: (Score:2)

by martin-boundary ( 547041 )

It's very clear that training is a copyright violation. FTFY.

Re: Maybe an adversarial approach (Score:2)

by LindleyF ( 9395567 )

Why? How is it fundamentally different from reading copyrighted works in school? In both cases you're adjusting a network using the material, not memorizing it. Except of course sometimes it does. That's what we need to fix.

Re: Maybe an adversarial approach (Score:2)

by LindleyF ( 9395567 )

There are many plagiarism detectors out there. Pick one.

Then don't exist. (Score:3)

by Gravis Zero ( 934156 )

If your business is incapable of existing without breaking the law then the obvious answer is that your business should not exist. How is this even a question? In the past EVERY company that flaunted copyright has been bankrupted but now with companies doing it en masse it's suddenly OK?

I'm calling bullshit on all of these companies. If you want to reform copyright then do it like all the other businesses have, buy a congressman because you aren't special.

Pointless debate (Score:2)

by rsilvergun ( 571051 )

the ability for AI to replace white collar workers is worth trillions. It also decouples the 1% from needing large numbers of consumers and employees to maintain their lifestyles

The laws will be rewritten to suit the needs of AI because they suit the needs of your ruling class.

And human beings won't do away with their ruling class because they like to pretend that all the chaos and misery in the world is under control.

Re: (Score:2)

by gweihir ( 88907 )

> the ability for AI to replace white collar workers is worth trillions.

Ah, yes, does not look good on that front. More like single digit percentages in efficiency gains. But more stress on the workers, so these may be negative gains in effect. Overall, AI is, again, an abject failure that delivers a miniscule amount of what its proponents claim.

Protection... (Score:2)

by zkiwi34 ( 974563 )

I am far more concerned with AI bots pillaging the company data for no other reason than it's there and might be useful to AI. Especially AI embedded in ubiquitous things like Google apps and office 365, that reside inside the network, but reach out, phone home etc.

If you cant respect the law (Score:2)

by djp2204 ( 713741 )

Then you cannot operate, full stop. Shut them all down until they can obey the laws.

Re: (Score:2)

by gweihir ( 88907 )

That is far too friendly. Shut them down, impund their fortunes and imprison the perpetrators.

AI Firms Say They Can't Respect Copyright (Score:3)

by GigaplexNZ ( 1233886 )

> AI Firms Say They Can't Respect Copyright

Then your business model is illegal. Shut it down.

Re: (Score:2)

by gweihir ( 88907 )

Indeed. Criminals usually claim they are not criminals and they had no choice and it really is somebody else's fault.

News: 0177950755

AI Firms Say They Can't Respect Copyright. But A Nonprofit's Researchers Just Built a Copyright-Respecting Dataset (msn.com)

"Respecting copyright" != "Ethically" (Score:4, Insightful)

Re: (Score:1)

Re: (Score:1)

Re: (Score:2)

Um... (Score:4, Insightful)

Re:Um... (Score:4, Insightful)

Re: (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

They've got hundreds of billions of dollars (Score:2)

Re: (Score:2)

Correction (Score:4, Insightful)

Training does Respect Copyright (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:1)

They're lying. (Score:4, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Copyrights redefined (Score:1)

Re: (Score:1)

Re: Copyrights redefined (Score:1)

Re: (Score:2)

An AI without the training data .. (Score:3)

Re: (Score:2)

Re: (Score:2)

Maybe an adversarial approach (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: Maybe an adversarial approach (Score:2)

Re: Maybe an adversarial approach (Score:2)

Then don't exist. (Score:3)

Pointless debate (Score:2)

Re: (Score:2)

Protection... (Score:2)

If you cant respect the law (Score:2)

Re: (Score:2)

AI Firms Say They Can't Respect Copyright (Score:3)

Re: (Score:2)