AI crawlers haven't learned to play nice with websites

(2025/03/18)

Reference: 1742286970
News link: https://www.theregister.co.uk/2025/03/18/ai_crawlers_sourcehut/
Source link:

SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.

"SourceHut continues to face disruptions due to aggressive LLM crawlers," the biz [1]reported Monday on its status page. "We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users."

SourceHut said it had deployed [2]Nepenthes , a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users.

[3]

"We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.

[4]

[5]

This is not the first time SourceHut has borne the bandwidth burden of serving unrestrained web requests. The outfit raised similar objections to [6]Google's Go Module Mirror in 2022, likening the traffic overload a denial of service attack. And other open source projects such as GMP have [7]also faced this problem .

But AI crawlers have been [8]particularly ill-behaved over the past two years as the generative AI boom has played out. OpenAI in August 2023 [9]made it known its web crawlers would respect robots.txt files, a set of directives served by websites to tell crawlers whether they're welcome. Other AI providers made similar commitments.

[10]

Nonetheless, reports of abuse continue. Repair website iFixit raised the issue last July when Anthropic's Claudebot was [11]accused of excessive crawling.

In December 2024, cloud hosting service Vercel [12]said AI crawlers have become a significant presence. In preceding past month, the biz said, OpenAI's GPTbot generated 569 million requests on its network while Anthropic's Claude accounted for 370 million. Together, these AI crawlers accounted for about 20 percent of the 4.5 billion requests from Googlebot, used for Google's search indexing, during the same period.

[13]Don't want Copilot app on your Windows 11 machine? Install this official update

[14]Amazon to kill off local Alexa processing, all voice requests shipped to the cloud

[15]OK, Google: Are you killing Assistant and replacing it with Gemini?

[16]AI bubble? What AI bubble? Datacenter investors all in despite whispers of a pop

Later that month, Diaspora developer Dennis Schubert also noted a surge in AI bots. In [17]a post , he said that 70 percent of the traffic to his server in the previous 60 days came from LLM training bots.

The Register asked Schubert about this in early January. "Funnily enough, a few days after the post went viral, all crawling stopped," he responded at the time. "Not just on the [18]Diaspora wiki , but on my entire infrastructure. I'm not entirely sure why, but here we are."

The problem didn't entirely go away, he said, because the visibility of his post inspired internet trolls to create their own wiki crawlers that now masquerade as the OpenAI GPTbot.

[19]

The result has been that it's more difficult to do log analysis.

... it's just a**holes trying to be funny

"For example, I placed a 'canary' into the [20]robots.txt now, and that now has reached almost a million hits, including hits with the GPTBot user agent string," explained Schubert. "The problem is just that those requests are absolutely not from OpenAI. OpenAI appears to be using Microsoft Azure for their crawlers. But all those canary hits came from AWS IPs and even some US residential ISPs. So it's just assholes trying to be funny spoofing their [user-agent] string."

Meanwhile, [21]reports of ill-behaved AI crawlers continue as do [22]efforts to [23]thwart them . And the spoofing of user-agent strings has also been [24]reported in response to claims that Amazon's Amazonbot has been [25]overloading a developer's server.

According to DoubleVerify, an ad metrics firm, general invalid traffic – aka GIVT, bots that should not be counted as ad views – [26]rose by 86 percent in the second half of 2024 due to AI crawlers.

The firm said, "a record 16 percent of GIVT from known-bot impressions in 2024 were generated by those that are associated with AI scrapers, such as GPTBot, ClaudeBot and AppleBot."

The ad biz also observed that while some bots, such as the Meta AI bot and AppleBot, declare they're out to gather data for training AI, other crawlers serve a mix of purposes, which makes blocking more complicated. For example, disallowing visits from GoogleBot, which scours the web for both search and AI, could hinder the site's search visibility.

To avoid that, Google in 2023 [27]implemented a robots.txt token called [28]Google-Extended that sites can use to prevent their web content from being used for training the internet titan's Gemini and Vertex AI services while still allowing those sites to be indexed for search. ®

Get our [29]Tech Resources

[1] https://status.sr.ht/issues/2025-03-17-git.sr.ht-llms/

[2] https://zadzmo.org/code/nepenthes/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z9lSUiqfLBQIO550D_8k8gAAARQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9lSUiqfLBQIO550D_8k8gAAARQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9lSUiqfLBQIO550D_8k8gAAARQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2023/01/10/googles_go_sourcehut/

[7] https://www.theregister.com/2023/06/28/microsofts_github_gmp_project/

[8] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

[9] https://www.theregister.com/2023/08/08/openai_scraping_software/

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9lSUiqfLBQIO550D_8k8gAAARQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[11] https://www.theregister.com/2024/07/30/taming_ai_content_crawlers/

[12] https://vercel.com/blog/the-rise-of-the-ai-crawler

[13] https://www.theregister.com/2025/03/17/copilot_windows_update/

[14] https://www.theregister.com/2025/03/17/amazon_kills_on_device_alexa/

[15] https://www.theregister.com/2025/03/17/google_assistant_eol_gemini_replacement/

[16] https://www.theregister.com/2025/03/14/ai_datacenter_frenzy/

[17] https://mailman.nanog.org/pipermail/nanog/2024-April/225407.html

[18] https://wiki.diasporafoundation.org/Main_Page

[19] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9lSUiqfLBQIO550D_8k8gAAARQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[20] https://wiki.diasporafoundation.org/robots.txt

[21] https://www.reddit.com/r/selfhosted/comments/1i154h7/openai_not_respecting_robotstxt_and_being_sneaky/

[22] https://xeiaso.net/blog/2025/anubis-update-m02/

[23] https://www.jwz.org/blog/2025/01/exterminate-all-rational-ai-scrapers/

[24] https://news.ycombinator.com/item?id=42751729

[25] https://xeiaso.net/notes/2025/amazon-crawler/

[26] https://doubleverify.com/ai-crawlers-and-scrapers-are-contributing-to-an-increase-in-general-invalid-traffic/

[27] https://blog.google/technology/ai/an-update-on-web-publisher-controls/?utm_source=bensbites&utm_medium=referral&utm_campaign=opt-out-of-google-ai-while-ranking-on-google-search

[28] https://developers.google.com/search/docs/crawling-indexing/google-common-crawlers#google-extended

[29] https://whitepapers.theregister.com/

Silicon valley destroying everything

cookiecutter

These cultish lunatics think everything belongs to them and that they are the smartest people on the planet while at the same time destroying everyone else's work or stealing it

These morons think they're saving humanity & anyone who is anti AI is an enemy of humanity. Rich morons with too much time on their hands except to argue about the singularity & prepare for the AI God, while the rest of us have to put up with their shite!

AI is little more than cloud/hosting providers ruining the internet in new and unwanted ways

Anonymous Coward

In the old days I had to contend with scrapers using thes services of OVH*, Digital Ocean, Hetzner, Microsoft and AWS. Then there was all the spam coming from Linode owned ASNs. These days it's Alibaba Cloud which accounts for 30%+ of traffic, complete with not bothering with robots.txt and fake user agents.

So not much has changed, other than said providers needing to be on every DROP list imaginable. They offer nothing of value, and there's definitely no human traffic coming from those IP ranges.

* I saw a significant drop in traffic around the same time one of their data centers burned down. I didn't feel particularly sorry for them.

Fonant

Yes, AI scraper bots have been hammering my servers too, on and off for many months now. I've managed to block the majority with Apache based on their UserAgent names, and by firewall for the worst offenders.

Go away, bullshit generators are not welcome here!

News: 1742286970

AI crawlers haven't learned to play nice with websites

Silicon valley destroying everything

AI is little more than cloud/hosting providers ruining the internet in new and unwanted ways