Creators demand tech giants fess up and pay for all that AI training data

(2025/02/07)

Reference: 1738931352
News link: https://www.theregister.co.uk/2025/02/07/ai_training_data_committee/
Source link:

Governments are allowing AI developers to steal content – both creative and journalistic – for fear of upsetting the tech sector and damaging investment, a UK Parliamentary committee heard this week.

You're going to get a vanilla-ization of music culture as automated material starts to edge out human creators

Despite a tech industry figure insisting that the "original sin" of text and data mining had already occurred and that content creators and legislators should move on, a joint committee of MPs heard from publishers and a composer angered by the tech industry's unchecked exploitation of copyrighted material.

The Culture, Media and Sport Committee and Science, Innovation and Technology Committee asked composer Max Richter how he would know if "bad-faith actors" were using his material to train AI models.

"There's really nothing I can do," he told MPs. "There are a couple of music AI models, and it's perfectly easy to make them generate a piece of music that sounds uncannily like me. That wouldn't be possible unless it had hoovered up my stuff without asking me and without paying for it. That's happening on a huge scale. It's obviously happened to basically every artist whose work is on the internet."

Richter, whose work has been used in a number of major film and television scores, said the consequences for creative musicians and composers would be dire.

[1]

"You're going to get a vanilla-ization of music culture as automated material starts to edge out human creators, and you're also going to get an impoverishing of human creators," he said. "It's worth remembering that the music business in the UK is a real success story. It's £7.6 billion income last year, with over 200,000 people employed. That is a big impact. If we allow the erosion of copyright, which is really how value is created in the music sector, then we're going to be in a position where there won't be artists in the future."

[2]

[3]

Speaking earlier, former Google staffer James Smith said much of the damage from text and data mining had likely already been done.

"The original sin, if you like, has happened," said Smith, co-founder and chief executive of Human Native AI. "The question is, how do we move forward? I would like to see the government put more effort into supporting licensing as a viable alternative monetization model for the internet in the age of these new AI agents."

[4]

But representatives of publishers were not so sanguine.

Matt Rogerson, director of global public policy and platform strategy at the Financial Times, said: "We can only deal with what we see in front of us and [that is] people taking our content, using it for the training, using it in substitutional ways. So from our perspective, we'll prosecute the same argument in every country where we operate, where we see our content being stolen."

The risk, if the situation continued, was a hollowing out of creative and information industries, he said.

[5]

Rogerson said an FT-commissioned study found that 1,000 unique bots were scraping data from 3,000 publisher websites. "We don't know who those bots work with, but we know that they're working with AI companies. On average, publishers have got 15 bots that they're being targeted by each for the purpose of extracting data for AI models, and they're reselling that data to AI platforms for money."

[6]Court docs allege Meta trained its AI models on contentious trove of maybe-pirated content

[7]Judge tosses publishers' copyright suit against OpenAI

[8]Major publishers sue Perplexity AI for scraping without paying

[9]OpenAI to reveal secret training data in copyright case – for lawyers' eyes only

Asked about the "unintended consequences" of creative and information industries being able to see how AI companies get and use their content and be compensated for it, Rogerson said tech companies could take lower margins, but that was something governments seemed reluctant to implement.

"The problem is we can't see who's stolen our content. We're just at this stage where these very large companies, which usually make margins of 90 percent, might have to take some smaller margin, and that's clearly going to be upsetting for their investors. But that doesn't mean they shouldn't. It's just a question of right and wrong and where we pitch this debate. Unfortunately, the government has pitched it in thinking that you can't reduce the margin of these big tech companies; otherwise, they won't build a datacenter."

Sajeeda Merali, Professional Publishers Association chief executive, said that while the AI sector is arguing that transparency over data scraping and ML training data would be commercially sensitive, its real concern is that publishers would then ask for a fair value in exchange for that data.

Meanwhile, publishers were also concerned that if they opted out of sharing data for ML training, they would be penalized in search engine results.

The debate around data used for training LLMs spiked after OpenAI's ChatGPT landed in 2022. The company is valued at around $300 billion. While Microsoft launched a $10 billion partnership with OpenAI, Google and Facebook are among other companies developing their own large language models.

Last year, Dan Conway, CEO of the UK's Publishers Association, [10]told the House of Lords Communications and Digital Committee that large language models were infringing copyrighted content on an "absolutely massive scale," arguing that the Books3 database – which lists 120,000 pirated book titles – had been entirely ingested. ®

Get our [11]Tech Resources

[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z6Y8NHBf6DiqvlhPhXbFIwAAAUo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z6Y8NHBf6DiqvlhPhXbFIwAAAUo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z6Y8NHBf6DiqvlhPhXbFIwAAAUo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z6Y8NHBf6DiqvlhPhXbFIwAAAUo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z6Y8NHBf6DiqvlhPhXbFIwAAAUo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2025/01/10/meta_libgen_allegation/

[7] https://www.theregister.com/2024/11/08/openai_copyright_suit_dismissed/

[8] https://www.theregister.com/2024/10/22/publishers_sue_perplexity_ai/

[9] https://www.theregister.com/2024/09/26/openai_training_data_author_copyright_case/

[10] https://www.theregister.com/2024/04/11/mp_committee_ai_copyright/

[11] https://whitepapers.theregister.com/

Eclectic Man

"... a tech industry figure insisting that the "original sin" of text and data mining had already occurred and that content creators and legislators should move on ..."

Translation: "All your data belong us"

I think Disney should compensate the entire creative world..

neilg

For its complete and utter corruption of "Copyright"

For e.g. Why should the great-great-great-great-great-grandson of an author make a living vicariously?

Andy 73

I already stole your car, what's the point of going to court about it? Brrrrmmm....

Tarpitting is the way to go

IamAProton

bad data is worse than no data.

T.A.S.S.

Mentat74

Theft As A Service...

That's what this is called...

Data theft on an industrial scale...

Tubz

Using copyright work to train AI is no different to piracy, love to see the artists get a High Court Order asking ISPs to block corporate IP ranges of companies guilty of this crime and refusing to pay compensation, effectively knocking them offline !

There is a technological solution for this...

NapTime ForTruth

I believe the term of art is "Rods from God".

https://www.popsci.com/scitech/article/2004-06/rods-god/

(Icon is literal.)

Doctor Syntax

Despite a tech industry figure insisting that the "original sin" of text and data mining had already occurred and that content creators and legislators should move on.

Not a problem. It's reversible. If you can't pay for what you took just delete the training. All of it. And hose backups.

Here AI...

IGotOut

...eat some Nightshade bitches.

https://nightshade.cs.uchicago.edu/whatis.html

False perceptions by 'creators

Long John Silver

Implicit to demands by 'creators', and by the massive industry (plus middlemen) marketing digital creations, is an assumption of ideas and digitally encoded products, possessing substance like physical artefacts and therefore able to be 'owned' in the same sense.

Realisation of the specious nature of 'intellectual property', such as enshrined in the Statute of Anne (1709), has been dawning painfully for copyright rentiers during the massive global build up of digital connectivity. Digital artefacts are not containable in locked cabinets. Indistinguishable copies can be made and distributed at negligible cost by anyone possessing a single copy. That is hard reality: a fact of life. Digital sequences cannot be 'stolen' because, unlike with a physical artefact tethered to a unique instantiation, a putative 'owner' cannot be deprived of his master copy. The losses whined about relate to potential income from renting-out sequences under the aegis of an artificially created monopoly; sequences have no intrinsic value, hence they lack scarcity, and therefore there cannot be 'price discovery' in the context of supply-and-demand market economics.

Set aside the concerns of 'entitled' publishers and those of the middlemen, to consider the matter from the point of view of people capable of 'creating' something of lasting cultural worth, there being far fewer of these than the industries 'milking' also-ran talent admit. People who genuinely 'create' are internally 'driven'. Some may not care about recognition, but most value, perhaps crave, admiration and respect from people deemed capable of grasping the creator's particular niche in culture. Truly 'driven' folk seek to devote all their time to creative endeavour and its offshoots (e.g. education). Most need to generate income from beyond their own resources.

Suppose, somebody has a burning desire to establish herself as an author of bodice-ripping yarns. She will devote free time to writing. She will hone her skills and seek constructive criticism from friends, teachers, and other acquaintances. Traditionally, an aspirant writer had to pique the interest of a publishing house. In general, said writer starts off trying to get short stories published in magazines and the like.

Somehow, an ethos of 'entitlement' has arisen, wherein the mere fact of a work being published confers an unquestionable accolade. People humbly buy 'the book'. If the purchaser doesn't like, or understand, the work, then, should the author already be 'known' the buyer most likely is revealing his own inadequacy. Regardless, trying to persuade a bookseller to give reimbursement for a book the customer regards as, by virtue of content, 'unfit for purpose' is an errand for fools.

Bear in mind, by now our author is thoroughly imbued by the attitude of her publisher with respect to 'rights'. Of course, the publisher gets the lion's share of revenue; in days past the publisher has taken an investment risk by ordering a print run. Electronic publication somewhat changes that.

The digital era renders previous practice almost anachronistic. An inspiring writer, perhaps paying for initial advice, can publish directly online. The author must build a following of appreciative readers. That is, to acquire 'reputation'. The author can solicit financial support, e.g. via crowdfunding, for new projects. Perhaps, writing can become a full-time occupation with money set aside for a pension. The writer can interact directly with the growing body of admirers. In the absence of copyright (it increasingly unenforceable anyway) 'reputation' is her sole asset; it designates her place in the competitive market of skills seeking patronage for her genre of writing. Anyone can distribute digital copies of the bodice-ripping stories. Anyone can take her works, a particular work even, and 'derive' a new version. For example, in a different genre one could rewrite a tale about 'Harry Potter' with a new ending.

But, shall not our authoress be ripped-off, left, right, and centre, by unscrupulous people? Not if 'entitlement to attribution', a key protection against barefaced plagiarism, is given legal backing. Anyone, is entitled to do what they like with the thrilling yarns, but whatever they distribute must give clear recognition of origin. The 'Gordian Knot' of copyright is sliced to be replaced by simpler to understand legislation based upon the concepts of misrepresentation, fraud, deceit; these tied into principles of civil and criminal law.

“The question is, how do we move forward?”

Rich 2

The way forward is to legislate to force the AI slurpers to remove the stolen data from their models.

And if they can’t do that (and they keep saying they can’t untangle it) then they must delete the WHOLE of their model and’s associated data and start again, this time WITHOUT stealing stuff

Yes it will cost them a fortune. My heart bleeds - they shouldn’t have done it in the first place

Why are governments so utterly shit when it comes to dealing with crap like this?

Re: “The question is, how do we move forward?”

Anna Nymous

To rephrase your question: "Why do companies get away with stuff an individual would not get away with?"

It is irrelevant whether or not it is "too hard" to untangle the unlicensed content from their models, its infeasibility should not factor into the decision at all. Either they can continue to use their model, provided they can remove all unlicensed content in its entirety as well as from any up- or downstream data sets they use, and provide third-party verified proof they did so, or if they can't do the complete aforementioned, then they can't use any product(s) derived from the illicitly used materials. It's that simple.

If we were serious about holding people accountable (but we'd have to be a nation of laws for that, and we aren't), then there would be actual repercussions and accountability for the officers of the company. Isn't that why they are paid the big bucks, because they are 'responsible' or something? If you want to play CEO or be some other senior officer of the company or hold effective power over direction and activities of the company, that's great, but that comes with strings attached: you will be personally liable for the activities of the company. Let's see how much law breaking still happens if we remove that immunity...

Similarly, if "companies are people", let's start executing some of them: corporate death for the entities and prohibitions for any of its officers on being an officer of any company, or have effective control over the direction of a company.

But now watch the courts kowtow to these perpetrators and not only placate them, but ask them for their own suggestions regarding "how would you, as the person being told to appear in front of the court today, like to go about rectifying things? Oh, you suggest a pinky promise to not do it again but keep what you have? That should do, of course...". Anything suggested by these perpetrators as reparations should be immediately dismissed as not enough, because nothing they will suggest would move the needle one [1]angstrom .

[1] https://en.wikipedia.org/wiki/Angstrom

Scheme of things

elsergiovolador

Creators, aka working class, the pleb, excess carbon are just expendable cogs in the machinery of the world.

They should be grateful that their creations will be forever embedded in the AI models that in the future will take over being in charge of the planet.

That poem you wrote, that song you recorded, that angry comment you sent on public forum, the picture you took of your dumb face, this will all become the foundation of what makes the AI.

It's like having children, but much bigger than that. AI will travel to other universes and your little contribution with it.

All the rest is just being salty, because you will not get some cash from it.

Think bigger, think different.

Re: Scheme of things

Anna Nymous

I'm afraid [1]Poe's law has struck here and would appreciate an indicator of your intent.

[1] https://en.wikipedia.org/wiki/Poe%27s_law

where this gets real sticky

Omnipresent

There are only 12 notes in western music scales, and where does art come from if not taking inspiration from past artists? In short, almost all music melodies have been made, and stolen, and reused at some point.

Then came software that allowed you to dissect and "explode" music into its individual parts, rearrange a few notes with little or no training, and BAM! "Look ma, I'm a musician, no hands!"

It became more about content than listening. It got even stickier with illegal software downloads (ironically, often from russian websites lol). then top acts and artists, that were really just content creators, that were young enough to work the web for attention started getting paid to advertise and push the "A.I." inspired agents that would do the stealing and rearranging for you. The kids instantly latched onto that of course. It's the latest and greatest!

All of a sudden these same influencers are getting ripped off by the up and comers and are like "wait! not like that!" lol

The music industry has very good lawyers that can go after content creators using such things for top acts, but not for the smaller indie guys, and the labels want the youth movement. They don't want to isolate their sales base.

A very sticky web indeed. As ACDC said... "who made who?!"

News: 1738931352