Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers' (msn.com)

(Saturday November 08, 2025 @11:34PM (EditorDavid) from the hole-in-the-wall dept.)

For more than a decade, the nonprofit Common Crawl "has been scraping billions of webpages to build a massive archive of the internet," [1]notes the Atlantic , making it [2]freely available for research. "In recent years, however, this archive has been put to a controversial purpose: AI companies including OpenAI, Google, Anthropic, Nvidia, Meta, and Amazon have used it to train large language models.

"In the process, my reporting has found, Common Crawl has opened a back door for AI companies to train their models with paywalled articles from major news websites. And the foundation appears to be lying to publishers about this — as well as masking the actual contents of its archives..."

> Common Crawl's website [3]states that it scrapes the internet for "freely available content" without "going behind any 'paywalls.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their LLMs on high-quality journalism for free. Meanwhile, Common Crawl's executive director, Rich Skrenta, has publicly made the case that AI models should be able to access anything on the internet. "The robots are people too," he told me, and should therefore be allowed to "read the books" for free. Multiple news publishers have requested that Common Crawl remove their articles to prevent exactly this use. Common Crawl says it complies with these requests. But my research shows that it does not.

>

> I've discovered that pages downloaded by Common Crawl have appeared in the training data of thousands of AI models. As Stefan Baack, a researcher formerly at Mozilla, [4]has written , "Generative AI in its current form would probably not be possible without Common Crawl." In 2020, OpenAI used Common Crawl's archives to train GPT-3. OpenAI [5]claimed that the program could generate "news articles which human evaluators have difficulty distinguishing from articles written by humans," and in 2022, an iteration on that model, GPT-3.5, became the basis for ChatGPT, kicking off the ongoing generative-AI boom. Many different AI companies are now using publishers' articles to train models that summarize and paraphrase the news, and are deploying those models in ways that [6]steal readers from writers and publishers.

>

> Common Crawl maintains that it is doing nothing wrong. I spoke with Skrenta twice while reporting this story. During the second conversation, I asked him about the foundation archiving news articles even after publishers have asked it to stop. Skrenta told me that these publishers are making a mistake by excluding themselves from "Search 2.0" — referring to the generative-AI products now widely being used to find information online — and said that, anyway, it is the publishers that made their work available in the first place. "You shouldn't have put your content on the internet if you didn't want it to be on the internet," he said. Common Crawl doesn't log in to the websites it scrapes, but its scraper is immune to some of the paywall mechanisms used by news publishers. For example, on many news websites, you can briefly see the full text of any article before your web browser executes the paywall code that checks whether you're a subscriber and hides the content if you're not. Common Crawl's scraper never executes that code, so it gets the full articles.

>

> Thus, by my estimate, the foundation's archives contain millions of articles from news organizations around the world, including The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic.... A search for nytimes.com in any crawl from 2013 through 2022 shows a "no captures" result, when in fact there are articles from NYTimes.com in most of these crawls.

"In the past year, Common Crawl's CCBot has become [7]the scraper most widely blocked by the top 1,000 websites," the article points out...

[1] https://www.msn.com/en-us/money/news/the-company-quietly-funneling-paywalled-articles-to-ai-developers/ar-AA1PMBHE

[2] https://data.commoncrawl.org/crawl-data/index.html

[3] https://commoncrawl.org/privacy-policy

[4] https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/

[5] https://arxiv.org/pdf/2005.14165

[6] https://www.theatlantic.com/technology/archive/2025/06/generative-ai-pirated-articles-books/683009/?utm_source=msn

[7] https://originality.ai/ai-bot-blocking

Comment Subject: (Score:1)

by Anonymous Coward

> paywall

asking the client to not look != wall

articles were openly broadcast, news ignored

Broken paywalls (Score:2)

by allo ( 1728082 )

Sites either put on the paywall only later or let in certain User-Agents and IP-Ranges, because they want search engines to list articles you cannot access, so you may buy access. If the Crawler is fast enough, it may have caught the page before the paywall was activated, just like the Googlebot is supposed to do.

So if you want a working paywall, use a working paywall, and don't leave it open to bots to spam search engines with inaccessible results.

Re: (Score:2)

by Travelsonic ( 870859 )

> It lets them send you the articles in whole so that they are in your possession after the fact which greatly aids their claims in court that you stole the content.

One problem with this hypothesis I see is ... didn't a company that makes porn movies get caught seeding their own torrents - which blew up in their face after they tried taking to court people who torrented said movies?

I can't help but imagine that would make such maneuvers, similar maneuvers, seem less likely to work - that there are cases of people trying it only to have it blow up in their faces, I mean.

Why is everything built on theft (Score:1)

by liqu1d ( 4349325 )

Is this the same with other industries or is AI alone in it being entirely based on the theft of others property?

Re: (Score:2)

by znrt ( 2424692 )

it's not like someone hacked into their computers to wade through private folders full of their bs. if it is published and accessible on the net then it is free to read. afaik there is no law yet that defines bypassing a paywall as a crime, much less "theft". if you don't want that, simple: don't publish it. if you still do and can't find enough suckers willing to pay for it then cry me a river. btw, you also seem to use a very skewed interpretation of "property".

Re: Why is everything built on theft (Score:2)

by liqu1d ( 4349325 )

Re: (Score:2)

by znrt ( 2424692 )

copyright is a law that attempts to conflate physical property with "intellectual" property, sadly with some success. it's still not the same. the intent is precisely to promote access to someone else's ideas or expressions to some equivalent of "theft". they haven't gone that far yet.

Re: (Score:2)

by h33t l4x0r ( 4107715 )

It's not just AI. If I have a conversation, and later learn that the other person repeated parts of it, I don't punch him in the nose and accuse him of stealing training data. Because I acknowledge that all conversation depends on prior conversations.

As usual, follow the money. (Score:2, Informative)

by Anonymous Coward

Summary missed key point from the original article: "In 2023, after 15 years of near-exclusive financial support from the Elbaz Family Foundation Trust, it received donations from OpenAI ($250,000), Anthropic ($250,000), and other organizations involved in AI development."

[1]https://www.msn.com/en-us/mone... [msn.com]

[1] https://www.msn.com/en-us/money/news/the-company-quietly-funneling-paywalled-articles-to-ai-developers/ar-AA1PMBHE

high quality journalism ... (Score:2)

by znrt ( 2424692 )

> The Economist, the Los Angeles Times, The Wall Street Journal, The New York Times, The New Yorker, Harper's, and The Atlantic....

high quality journalism! X'D

high quality like ofc editordavid cooking another mixed quote salad without referencing sources. the main quote seems to come from one of the atlantic's drones, of course. if anyone feels like paying these vultures to read their drivel: [1]https://www.theatlantic.com/te... [theatlantic.com]

[1] https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/

And 'scientific' papers (Score:2)

by will4 ( 7250692 )

Still wanting to know when the research community will begin downvoting the foundational social science research papers and excluding them from new citations where that flawed research used what is now consider scientific flawed and inadmissible methodology - self-reported surveys, tiny sample size, outcome based survey questions, etc..

Can someone in the research area give insight what happens when a paper is retracted to that paper and to the papers which cite that paper and then the second generation cita

Bwhahahaha! (Score:3)

by Bodhammer ( 559311 )

"high-quality journalism" - an oxymoron if there ever was!

Translation (Score:2)

by gurps_npc ( 621217 )

Translation1: Those companies are making a mistake by not giving him what he wants for free, in order for his company to become profitable.

Translation2: Those companies should not have put stuff on the internet for sale with a paywall if they didn't want people like me to steal it without getting permission.

You can not call Javascript a "paywall" (Score:2)

by TheWho79 ( 10289219 )

If a bot can get the content without logging in, then so can a human. You can not call content behind a simple JavaScript screen "paywalled".

News: 0180011180

Common Crawl Criticized for 'Quietly Funneling Paywalled Articles to AI Developers' (msn.com)

Comment Subject: (Score:1)

Broken paywalls (Score:2)

Re: (Score:2)

Why is everything built on theft (Score:1)

Re: (Score:2)

Re: Why is everything built on theft (Score:2)

Re: (Score:2)

Re: (Score:2)

As usual, follow the money. (Score:2, Informative)

high quality journalism ... (Score:2)

And 'scientific' papers (Score:2)

Bwhahahaha! (Score:3)

Translation (Score:2)

You can not call Javascript a "paywall" (Score:2)