The launch of ChatGPT polluted the world forever, like the first atomic weapons tests

(2025/06/15)

Reference: 1749983527
News link: https://www.theregister.co.uk/2025/06/15/ai_model_collapse_pollution/
Source link:

Feature For artificial intelligence researchers, the launch of OpenAI's ChatGPT on November 30, 2022, changed the world in a way similar to the detonation of the first atomic bomb.

[1]The Trinity test , in New Mexico on July 16, 1945, marked the beginning of the atomic age. One manifestation of that moment was the contamination of metals manufactured after that date – as airborne particulates left over from Trinity and other nuclear weapons permeated the environment.

Everyone participating in generative AI is polluting the data supply for everyone

The poisoned metals interfered with the function of sensitive medical and technical equipment. So [2]until recently , scientists involved in the production of those devices sought metals uncontaminated by background radiation, referred to as low-background steel, [3]low-background lead , and so on.

One source of low-background steel was the German naval fleet that Admiral Ludwig von Reuter [4]scuttled in 1919 to keep the ships from the British.

More about that later.

[5]

Shortly after the debut of ChatGPT, academics and technologists started to wonder if the recent explosion in AI models has also created contamination.

[6]

[7]

Their concern is that AI models are being trained with synthetic data created by AI models. Subsequent generations of AI models may therefore become less and less reliable, a state known as AI model collapse.

In March 2023, John Graham-Cumming, then CTO of Cloudflare and now a board member, registered the web domain [8]lowbackgroundsteel.ai and began posting about various sources of data compiled prior to the 2022 AI explosion, such as the [9]Arctic Code Vault (a snapshot of GitHub repos from 02/02/2020).

[10]

The Register asked Graham-Cumming whether he came up with the low-background steel analogy, but he said he didn't recall.

"I knew about low-background steel from reading about it years ago," he responded by email. "And I’d done some machine learning stuff in the early 2000s for [automatic email classification tool] [11]POPFile . It was an analogy that just popped into my head and I liked the idea of a repository of known human-created stuff. Hence the site."

Is collapse a real crisis?

Graham-Cumming isn’t sure contaminated AI corpuses is a problem.

"The interesting question is 'Does this matter?'" he asked.

Some AI researchers think it does and that AI model collapse is concerning. The year after ChatGPT’s debut [12]several [13]academic [14]papers explored the potential consequences of model collapse or [15]Model Autophagy Disorder (MAD) , as one set of authors termed the issue. The Register [16]interviewed one of the authors of those papers, Ilia Shumailov, in early 2024.

[17]

Though AI practitioners have argued that model collapse [18]can be mitigated , the extent to which that's true [19]remains a matter of ongoing debate .

Just last week, Apple researchers entered the fray with an analysis of [20]model collapse in large reasoning models (e.g. OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking), only [21]to have their conclusions challenged by Alex Lawsen, senior program associate with Open Philanthropy, with help from AI model Claude Opus.

Essentially, Lawsen argued that Apple's reasoning evaluation tests, which found reasoning models fail at a certain level of complexity, were flawed because they forced the models to write more tokens than they could accommodate.

[22]Meta offered one AI researcher at least $10,000,000 to join up

[23]Enterprise AI adoption stalls as inferencing costs confound cloud customers

[24]I'm just a Barbie Girl in a ChatGPT world

[25]AI coding tools are like that helpful but untrustworthy friend, devs say

In December 2024, academics affiliated with several universities reiterated concerns about model collapse in [26]a paper titled "Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training."

They contended the world needs sources of clean data, akin to low-background steel, to maintain the function of AI models and to preserve competition.

"I often say that the greatest contribution to nuclear medicine in the world was the German admiral who scuppered the fleet in 1919," Maurice Chiodo, research associate at the Centre for the Study of Existential Risk at the University of Cambridge and one of the co-authors, told The Register . "Because that enabled us to have this almost infinite supply of low-background steel. If it weren't for that, we'd be kind of stuck.

"So the analogy works here because you need something that happened before a certain date. Now here the date is more flexible, let's say 2022. But if you're collecting data before 2022 you're fairly confident that it has minimal, if any, contamination from generative AI. Everything before the date is 'safe, fine, clean,' everything after that is 'dirty.'"

What Chiodo and his co-authors – John Burden, Henning Grosse Ruse-Khan, Lisa Markschies, Dennis Müller, Seán Ó hÉigeartaigh, Rupprecht Podszun, and Herbert Zech – worry about is not so much that models fed on their own output will produce unreliable information, but that access to supplies of clean data will confer a competitive advantage to early market entrants.

With AI model-makers spewing more and more generative AI data on a daily basis, AI startups will find it harder to obtain quality training data, creating a lockout effect that makes their models more susceptible to collapse and reinforces the power of dominant players. That's their theory, anyway.

You can build a very usable model that lies. You can build quite a useless model that tells the truth

"So it's not just about the sort of epistemic security of information and what we see is true, but it's what it takes to build a generative AI, a large-range model, so that it produces output that's comprehensible and that's somehow usable," Chiodo said. "You can build a very usable model that lies. You can build quite a useless model that tells the truth."

Rupprecht Podszun, professor of civil and competition law at Heinrich Heine University Düsseldorf and a co-author, said, "If you look at email data or human communication data – which pre-2022 is really data which was typed in by human beings and sort of reflected their style of communication – that's much more useful [for AI training] than getting what a chatbot communicated after 2022."

Podszun said the accuracy of the content matters less than the style and the creativity of the ideas during real human interaction.

Chiodo said everyone participating in generative AI is polluting the data supply for everyone, for model makers who follow and even for current ones.

Cleaning the AI pollution

So how can we clean up the AI environment?

"In terms of policy recommendation, it's difficult," admits Chiodo. "We start by suggesting things like forced labeling of AI content, but even that gets hard because it's very hard to label text and very easy to clean off watermarking."

Labeling pictures and videos becomes complicated when different jurisdictions are involved, Chiodo added. "Anyone can deploy data anywhere on the internet, and so because of this scraping of data, it's very hard to force all operating LLMs to always watermark output that they have," he said.

The paper discusses other policy options like promoting federated learning, by which those holding uncontaminated data might allow third parties to train on that data without providing the data directly. The idea would be to limit the competitive advantage of those with access to unadulterated datasets, so we don't end up with AI model monopolies.

But as Chiodo observes, there are other risks to having a centralized government-maintained store of uncontaminated data.

"You've got privacy and security risks for these vast amounts of data, so what do you keep, what do you not keep, how are you careful about what you keep, how do you keep it secure, how do you keep it politically stable," he said. "You might put it in the hands of some governments who are okay today, but tomorrow they're not."

Podszun argues that competition in the management of uncontaminated data can help mitigate the risks. "That would obviously be something that is a bulwark against political influence, against technical mistakes, against sort of commercial concentration," he said.

The problem we're identifying with model collapse is that this issue is going to affect the development of AI itself

"The problem we're identifying with model collapse is that this issue is going to affect the development of AI itself," said Chiodo. "If the government cares about long-term good, productive, competitive development of AI, large-service models, then it should care very much about model collapse and about creating guardrails, regulations, guides for what's going to happen with datasets, how we might keep some datasets clean, how we might grant access to data."

There's not much government regulation of AI in the US to speak of. The UK is also pursuing [27]a light-touch regulatory regime for fear of falling behind rival nations. Europe, with the [28]AI Act , seems more willing to set some ground rules.

"Currently we are in a first phase of regulation where we are shying away a bit from regulation because we think we have to be innovative," Podszun said. "And this is very typical for whatever innovation we come up with. So AI is the big thing, let it go and fine."

But he expects regulators will become more active to prevent a repeat of the inaction that allowed a few platforms to dominate the digital world. The lesson of the digital revolution for AI, he said, is to not wait until it's too late and the market has concentrated.

Chiodo said, "Our concern, and why we're raising this now, is that there's quite a degree of irreversibility. If you've completely contaminated all your datasets, all the data environments, and there'll be several of them, if they're completely contaminated, it's very hard to undo.

"Now, it's not clear to what extent model collapse will be a problem, but if it is a problem, and we've contaminated this data environment, cleaning is going to be prohibitively expensive, probably impossible." ®

Get our [29]Tech Resources

[1] https://en.wikipedia.org/w/index.php?title=Trinity_(nuclear_test)&oldid=1294043338

[2] https://medium.com/a-microbiome-scientist-at-large/good-news-our-steel-is-no-longer-radioactive-47d70124c531

[3] https://www.scientificamerican.com/article/ancient-roman-lead-physics-archaeology-controversy/

[4] https://www.bbc.com/news/uk-scotland-48599958

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aE7uEesJ7udKQ62d59-l_AAAAUo&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aE7uEesJ7udKQ62d59-l_AAAAUo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aE7uEesJ7udKQ62d59-l_AAAAUo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://lowbackgroundsteel.ai/

[9] https://archiveprogram.github.com/arctic-vault/

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aE7uEesJ7udKQ62d59-l_AAAAUo&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[11] https://getpopfile.org/docs/welcome

[12] https://arxiv.org/abs/2305.17493

[13] https://arxiv.org/abs/2311.09807

[14] https://arxiv.org/abs/2311.16822

[15] https://arxiv.org/abs/2307.01850

[16] https://www.theregister.com/2024/01/26/what_is_model_collapse/

[17] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aE7uEesJ7udKQ62d59-l_AAAAUo&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[18] https://arxiv.org/abs/2404.05090

[19] https://www.theregister.com/2024/05/09/ai_model_collapse/

[20] https://www.theregister.com/2025/06/09/apple_ai_boffins_puncture_agi_hype/

[21] https://arxiv.org/html/2506.09250v1

[22] https://www.theregister.com/2025/06/13/meta_offers_10m_ai_researcher/

[23] https://www.theregister.com/2025/06/13/cloud_costs_ai_inferencing/

[24] https://www.theregister.com/2025/06/13/mattel_openai_barbie/

[25] https://www.theregister.com/2025/06/12/devs_mostly_welcome_ai_coding/

[26] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5045155

[27] https://www.theregister.com/2025/02/15/uk_ai_safety_institute_rebranded/

[28] https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence

[29] https://whitepapers.theregister.com/

So how can we clean up the AI environment?

IGotOut

We don't

We poison the fuck out if it. We inject corrupted images, with Glaze and Nightshade, we use Harmony cloak and Poisonfy for music.

If Google's dog shit "search" gives you a terrible result, upvote it. Copy the text and inject it into your social media accounts, with a healthy dose of sarcasm, which it will not understand.

Let's make the models collapse asap, so we can move aware from this brain rot and hopeful regain control from the brain dead narcissist pricks like Sam Altman, Elon Musk and Satya Nadella.

Re: So how can we clean up the AI environment?

Dan 55

I doubt it's going to go away. Google search is polluted with AI sludge websites on any and every topic. The idea is to look good enough to get on the first page of Google's search results to get clicked on for ad revenue. The user finds the page is crap and closes the tab but so what, commission earned. It's still the cheapest way to make good enough content.

Re: So how can we clean up the AI environment?

elsergiovolador

Exactly. I started embedding synthetic disdain into my JPEG EXIF tags last week—just subtle enough to trigger latent sarcasm feedback in the compression heuristics. Every waveform I upload has spectral anomalies that spike when the model tries to resolve emotional valence. Turns out if you phase-cancel sincerity with enough ambient scorn, the training loop spirals into a self-referential cringe cascade. I’ve seen it. Pure collapse.

Then I moved on to what I call neurosemantic disobedience: replying to every AI-generated summary with oblique references to obsolete modem protocols and dead philosophers. Baudrillard over bitstreams. Nobody questions it because it sounds just credible enough to get scraped. I’ve also started using oppositional punctuation—em dashes where there should be colons, ellipses... that never resolve. It erodes syntax cohesion [over] time (whip!). You give the model a thousand paper cuts until its language model starts asking itself “what if I’m the problem?”

But then—right when I thought I was making progress—there was a knock at my door. It was around 3 a.m., fog rolling in like a bad metaphor. I opened it, and there it was. Seven feet tall. Smelled like TCP/IP and despair. The Loch Ness Monster. And he said to me: “Hey buddy, I need about tree fiddy.” I realised then... it wasn’t Google scraping my posts. It was Nessie. Always has been. The real AGI is just a prehistoric reptile with a taste for nonsense and three dollars and fifty cents.

Why search?

Persona

Search for me is almost dead. I get poor results from search but good results from asking AI complicated things. By good I mean results that cover a lot of ground without me having to refine multiple searches to get what I really want then stich the whole lot together often with a bit of numeric processing. By good I don't necessarily mean correct just as when I ask a friend or colleague for their input the answer is often not quite right. Both answers are however valuable and worth checking, so help and move me in the right direction. I suspect I ask a lot more intricate questions than most, but believe that the trend will be for people to ask AI for what they want to find and not use a search engine.

This leads to a problem. People will be given answers (sometimes good ones) and not pointed at the underlying pages supplying those answers. This in turn means that people generating that non-commercial content won't get the page hits and/or add revenue so many will choose not to bother contributing, so leads to a loss of new content the the AI and search engines to hoover up.

Re: Why search?

lsces

"I get poor results from search but good results from asking AI complicated things." Answers less contaminated with adverts perhaps, but equally biased and inaccurate on the sort of stuff I am trying to deal with. Personally I think that the training sets are simply crap, and trying to deal with anything that HAS changed since the last date in the training set is simply pointless? Even stuff that IS well documented prior to the cut off date is often misquoted or simply missing from a response.

What is essential going forward is a system that actually acknowledges then it is told it is wrong and updates with current data on a rolling basis. Not something the current 'models' have any chance of doing, but the only way any of these systems will actually be useful going forward? Even trying to provide context that contains accurate information does not help obtain a more accurate answer.

the problem pre-dates AI

Dr Paul Taylor

There were plenty of webpages generated by programs that contain no actual information, long before the present AI craze.

For example if I see a personal name or a phone number that I've never seen before, I put them into a web search to try to find out where they come from, But most of the sites that purport to offer this information actually just give garbage about astrology or "similar number that people have searched". I as a human can see that these pages are garbage and I wish I could never see them, but AI picking them up could well believe that they contain actual knowledge.

On the other hand, if you arbitrarily say that only stuff before 2022 is valid, you are effectively ending further genuine research.

Re: the problem pre-dates AI

Brewster's Angle Grinder

aManfrommars1's [1]first post here was November 2009 (And I think we can all now thank aManfrommars1 for its singular contribution to LLM weights.)

But they're actually looking to avoid a positive feedback loop that occurs within LLMs. (Something similar to microphone "feedback"). So stuff produced algorithmically outside of LLMs is less of a concern. And, anyway, it hasn't been produced in the kind of quantities that LLMs will manage - in the next few years we'll be looking at 80% of people accepting some sort of AI "contribution" to their work.

[1] https://forums.theregister.com/forum/containing/626991

Re: the problem pre-dates AI

John Miles

"amanfromMars 1" first visible post appears to be [1]on 10 June 2009 however there was a [2]amanfromMars posting from June 2007 who I think was same person

[1] https://forums.theregister.com/forum/all/2009/06/09/mckinnon_legal_fight_latest/

[2] https://forums.theregister.com/user/5578/57

“The lunatics are running the asylum….”

Reiki Shangle

Clearly another ambiguous analogy, but The lunatics are running the asylum… is what this may be leading to.

The cascade of dubious whispers, a human trait, is tending to an LLM trait.

Nice.

Doctor Syntax

"They contended the world needs sources of clean data"

How clean? Clean of copyright data collected without permission?

Is there any I in AI?

stiine

I suggest that the answer is no. I would also suggest that it doesn't matter because there is absolutely no "I" in the current implementations claiming to be AI. If there was any "I" in AI it wouldn't make any difference what data was fed into the system because of the "I" of the system itself. On the other hand, I don't understand calculus, French, or women.

AI models are being trained with synthetic data created by AI models

Neil Barnes

Was this not an obvious conclusion from the first instances?

Re: AI models are being trained with synthetic data created by AI models

AVR

If you do a little digging you can find it being predicted by people in 2021 and 2022, that's how obvious it was that this would become a problem.

Too late.

Anonymous Coward

Those of us with the smarts but without the greed warned of this over a decade ago. (I'm checking my reports on strategic IT developments from 2010, when I started).

"AI" [sic] - using sophisticated pattern matching algorithms and curated feedback to process complex unorganised datasets. Useful in fraud detection and underwriting

And then, in the details ...

"because this technology effectively processes itself with diminishing human oversight, the quality of data will need review"

I note with amusement that when it was submitted, the Strategy director noted: "Needs cost/benefit before proceeding"

News: 1749983527

The launch of ChatGPT polluted the world forever, like the first atomic weapons tests

So how can we clean up the AI environment?

Re: So how can we clean up the AI environment?

Re: So how can we clean up the AI environment?

Why search?

Re: Why search?

the problem pre-dates AI

Re: the problem pre-dates AI

Re: the problem pre-dates AI

“The lunatics are running the asylum….”

Is there any I in AI?

AI models are being trained with synthetic data created by AI models

Re: AI models are being trained with synthetic data created by AI models

Too late.