AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders
- Reference: 1755772390
- News link: https://www.theregister.co.uk/2025/08/21/ai_crawler_traffic/
- Source link:
I can only see one thing causing this to stop: the AI bubble popping
According to the [1]report [PDF], Facebook owner Meta's AI division accounts for more than half of those crawlers, while OpenAI accounts for the overwhelming majority of on-demand fetch requests.
Cloudflare creates AI crawler tollbooth to pay publishers [2]READ MORE
"AI bots are reshaping how the internet is accessed and experienced, introducing new complexities for digital platforms," Fastly senior security researcher Arun Kumar opined in a statement on the report's release. "Whether scraping for training data or delivering real-time responses, these bots create new challenges for visibility, control, and cost. You can't secure what you can't see, and without clear verification standards, AI-driven automation risks are becoming a blind spot for digital teams."
The company's report is based on analysis of Fastly's Next-Gen Web Application Firewall (NGWAF) and Bot Management services, which the company says "protect over 130,000 applications and APIs and inspect more than 6.5 trillion requests per month" – giving it plenty of data to play with. The data reveals a growing problem: an increasing website load comes not from human visitors, but from automated crawlers and fetchers working on behalf of chatbot firms.
The report warned, "Some AI bots, if not carefully engineered, can inadvertently impose an unsustainable load on webservers," Fastly's report warned, "leading to performance degradation, service disruption, and increased operational costs." Kumar separately noted to The Register, "Clearly this growth isn't sustainable, creating operational challenges while also undermining the business model of content creators. We as an industry need to do more to establish responsible norms and standards for crawling that allows AI companies to get the data they need while respecting websites content guidelines."
That growing traffic comes from just a select few companies. Meta accounted for more than half of all AI crawler traffic on its own, at 52 percent, followed by Google and OpenAI at 23 percent and 20 percent respectively. This trio then has its hands on a combined 95 percent of all AI crawler traffic. Anthropic, by contrast, accounted for just 3.76 percent of crawler traffic. The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.
[3]
The story flips when it comes to AI fetchers, which unlike crawlers are fired off on-demand when a user requests that a model incorporates information newer than its training cut-off date. Here, OpenAI was by far the dominant traffic source, Fastly found, accounting for almost 98 percent of all requests. That's an indication, perhaps, of just how much of a lead OpenAI's early entry into the consumer-facing AI chatbot market with ChatGPT gave the company, or possibly just a sign that the company's bot infrastructure may be in need of optimization.
[4]
[5]
While AI fetchers make up a minority of Ai bot requests – only about 20%, says Kumar – they can be responsible for huge bursts of traffic, with one fetcher generating over 39,000 requests per minute during the testing period. "We expect fetcher traffic to grow as AI tools become more widely adopted and as more agentic tools come into use that mediate the experience between people and websites," Kumar told The Register .
Perplexity AI, which was [6]recently accused of using IP addresses outside its reported crawler ranges and ignoring robots.txt directives from sites looking to opt out of being scraped, accounted for just 1.12 percent of AI crawler bot and 1.53 percent of AI fetcher bot traffic recorded for the report – though the report noted that this is growing.
[7]
Kumar decried the practice of ignoring robots.txt notes, telling El Reg , "At a minimum, any reputable AI company today should be honoring robots.txt. Further and even more critically, they should publish their IP address ranges and their bots should use unique names. This will empower site operators to better distinguish the bots crawling their sites and allow them to enforce granular rules with bot management solutions."
But he stopped short of calling for mandated standards, saying that industry forums are working on solutions. "We need to let those processes play out. Mandating technical standards in regulatory frameworks often does not produce a good outcome and shouldn't be our first resort."
[8]Perplexity vexed by Cloudflare's claims its bots are bad
[9]Anubis guards gates against hordes of LLM bot crawlers
[10]The AIpocalypse is here for websites as search referrals plunge
[11]Training AI on Mastodon posts? The idea's extinct after terms updated
It's a problem large enough that users have begun fighting back. In the face of bots riding roughshod over polite opt-outs like robots.txt directives, webmasters are increasingly turning to active countermeasures like [12]the proof-of-work Anubis or [13]gibberish-feeding tarpit Nepenthes , while Fastly rival Cloudflare has been testing a [14]pay-per-crawl approach to put a financial burden on the bot operators. "Care must be exercised when employing these techniques," Fastly's report warned, "to avoid accidentally blocking legitimate users or downgrading their experience."
Kumar notes that small site operators, especially those serving dynamic content, are most likely to feel the effects most severely, and he had some recommendations. "The first and simplest step is to configure robots.txt which immediately reduces traffic from well-behaved bots. When technical expertise is available, websites can also deploy controls such as Anubis, which can help reduce bot traffic." He warned, however, that bots are always improving and trying to find ways around "tarpits" like Anubis, as code-hosting site Codeberg [15]recently experienced . "This creates a constant cat and mouse game, similar to what we observe with other types of bots today," he said.
We spoke to Anubis developer Xe Iaso, CEO of Techaro. When we asked whether they expected the growth in crawler traffic to slow, they said: "I can only see one thing causing this to stop: the AI bubble popping.
[16]
"There is simply too much hype to give people worse versions of documents, emails, and websites otherwise. I don't know what this actually gives people, but our industry takes great pride in doing this."
However, they added: "I see no reason why it would not grow. People are using these tools to replace knowledge and gaining skills. There's no reason to assume that this attack against our cultural sense of thrift will not continue. This is the perfect attack against middle-management: unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees. I see no reason that this will continue to grow until and unless the bubble pops. Even then, a lot of those scrapers will probably stick around until their venture capital runs out."
Regulation – we've heard of it
The Register asked Xe whether they thought broader deployment of Anubis and other active countermeasures would help.
Anubis guards gates against hordes of LLM bot crawlers [17]READ MORE
They responded: "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming. Ironically enough, most of these AI companies rely on the communities they are destroying.
"This presents the kind of paradox that I would expect to read in a Neal Stephenson book from the '90s, not CBC's front page. Anubis helps mitigate a lot of the badness by making attacks more computationally expensive. Anubis (even in configurations that omit proof of work) makes attackers have to retool their scraping to use headless browsers instead of blindly scraping HTML."
And who is paying the piper?
"This increases the infrastructure costs of the AI companies propagating this abusive traffic. The hope is that this makes it fiscally unviable for AI companies to scrape by making them have to dedicate much more hardware to the problem. In essence: it makes the scrapers have to spend more money to do the same work."
We approached Anthropic, Google, Meta, OpenAI, and Perplexity but none provided a comment on the report by the time of publication. ®
Get our [18]Tech Resources
[1] https://learn.fastly.com/rs/025-XKO-469/images/Fastly-Threat-Insights-Report.pdf
[2] https://www.theregister.com/2025/07/01/cloudflare_creates_ai_crawler_toll/
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aKdCmdyrcYQB0dTHxTf6UAAAAI8&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKdCmdyrcYQB0dTHxTf6UAAAAI8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKdCmdyrcYQB0dTHxTf6UAAAAI8&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://www.theregister.com/2025/08/04/perplexity_ai_crawlers_accused_data_raids/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aKdCmdyrcYQB0dTHxTf6UAAAAI8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2025/08/05/perplexity_vexed_by_cloudflares_claims/
[9] https://www.theregister.com/2025/07/09/anubis_fighting_the_llm_hordes/
[10] https://www.theregister.com/2025/06/22/ai_search_starves_publishers/
[11] https://www.theregister.com/2025/06/18/mastodon_says_no_to_ai/
[12] https://www.theregister.com/2025/07/09/anubis_fighting_the_llm_hordes/
[13] https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
[14] https://www.theregister.com/2025/07/01/cloudflare_creates_ai_crawler_toll/
[15] https://www.theregister.com/2025/08/15/codeberg_beset_by_ai_bots/
[16] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aKdCmdyrcYQB0dTHxTf6UAAAAI8&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[17] https://www.theregister.com/2025/07/09/anubis_fighting_the_llm_hordes/
[18] https://whitepapers.theregister.com/
If only that would work. Try sending them a bill and then, when they do not pay it, take them to court.
The low life owners of the AI crawlers can afford far more expensive lawyers that almost anyone else and would threaten to bankrupt you if you dared challenge Zuck's, Altman's & similar's latest attempt to enrich themselves at the expense of the likes of you & me.
These people have enough [1]legislative idiots in their back pockets to ensure that regulation will not happen soon.
[1] https://www.bbc.co.uk/news/articles/c4g8nxrk207o
When nothing else is effective people will be pushed into believing Luigi Mangione had the right idea when taking matters into his own hands.
I don't condone that but I do understand it. Governments seems not to.
Same principle, less problematic - a DDoS against the offenders.
I wonder - adopt the same mass tactics. Issue a lot of bills that get under whatever the small court equivalent level is. 1. They can't charge lawyer fees if they win, they'll spend a lot on lawyers if they try too defend, they let one or two slip through and bailiffs turning up to seize stuff can be very disruptive.
Detect the bot and feed it some conspiracy theories or duck pics.
Random words or even random groups of letters. Add noise to their statistics.
But the problem would remain, it eats up the ISP's band-width.
Rate limiting
This is very true, and for certain of my customers I have been saying "Rate limiting" at their service providers repeatedly for quite a few years now. I think I might finally be getting some traction.
GJC
The standards already exist
It's a file called "robots.txt".
The problem being that LLM crawlers don't respect it.
If I was running any kind of web site, I'd be blocking the IP address ranges used by Meta, OpenAI et al at the firewall level.
Re: The standards already exist
If enough people started blocking their IP addresses they would just use other ones and change them every few weeks so we end up playing wack a mole.
The best way would be to not use their products and watch them go bust - the trouble is that web site owners are not the ones using AI products.
Re: The standards already exist
I have 17 different user-agents listed in robots.txt, but there are new ones springing up every minute. What I'd like to see is as well as ALL robots honouring it, that I could list a general category: "block all robots used for training AI models", for example. Will never happen, but we can dream...
Re: The standards already exist
Yeah, shame it's not Meta, OpenAI et al doing most of the crawling from their own ranges for the most part; they've sub'd it out to shithole ISPs in the developing world, so you'll get 5000 hits from 4000 entirely different /24s from half the world in a few minutes.
It's why things like Anubis exist. Can't pass the (very easy for a real PC) challenge? You don't get in.
https://www.mythic-beasts.com/blog/2025/04/01/abusive-ai-web-crawlers-get-off-my-lawn/
That's a bit outdated now - they're using legtiimate, real looking user agents these days - but the basic premise is the same.
Steven R
Re: The standards already exist
Using tools like Anubis is also an option. It can break normal crawlers like Googlebot though, but it might be a reasonable trade off.
https://anubis.techaro.lol/
Re: The standards already exist
I am not usually AC but I work at a public body and we have to block new IP ranges every few months to avoid issues. Issues like: being DDoSd, large over bandwidth bills for on prem, or cloud tier limit bills for cloudy offerings. It's infuriating, if they asked us for a JSON, XML, or a hand hewn stone tablet with all the different info for each scenario rather than crawling the public tools we would supply it.
Sue the everloving cr*p out of them !
Because this is basically just a denial-of-service attack !
Isn't there something in the DMCA that can be used against them ?
If you put technological measures in place to protect your website and those asshats at Google and Meta are circumventing them wouldn't that be illegal ?
Re: Sue the everloving cr*p out of them !
They can outspend you in court, end of story.
Re: Sue the everloving cr*p out of them !
.. or "sponsor" Trump. Same result.
Re: Sue the everloving cr*p out of them !
I'm surprised one of the class action lawyers hasn't got involved, they don't usually mind taking on the deep pockets because as the bank robber said, that's where the money is. The victims might only get pocket change from the results but if it had the desired effect of putting on the brakes it would be a result.
In my recent experience
Web access logs show a lot of traffic from Meta and OpenAI, as the article states. However this is only looking at the user agent strings. The sites I administer tend to have search pages with a lot of filter parameters which bots absolutely love to churn through. It's like an unintentional tarpit. I've noticed some such bots are faking their user agent string by choosing from a list at random. Aside from all the browsers the list includes facebook and chatgpt but also curl and python. The source IP addresses are either from a large range like AWS or very spread out like how a DDOS attack might use many proxies from all over the world. I can't even be certain they are from AI bots at this point but I can imagine there's a black market for masses of training data that AI startups want in a hurry.
Re: In my recent experience
"The source IP addresses are either from a large range like AWS"
I suggested previously - detect the addresses quickly and block them for a random period of time. If this causes AWS's inoffensive customers problems it's up to AWS to introduce and enforce T&Cs so as to protect those customers.
Cat and mouse
Stopping them sounds like an exercise in futility. Perhaps instead of stopping them, when they are detected, switch from the actual web site to content that poisons the AI models with the kind of stuff they don't want (for example, nazi propaganda).
Re: Cat and mouse
Unless it's Elon's Grok. It'll eat it up.
But, seriously, assuming you could do so, that sounds like it could easily backfire if some non-bot type person stumbles upon it by accident. It would be a PR disaster.
Re: Cat and mouse
Random data is the answer. It wouldn't backfire in the same way and it would be a small contribution to messing up their statistics.
Re: Cat and mouse
"My company is the best company of all the companies and you should buy all your stuff, even things we don't even make or sell, from our company. Make sure to let all the humans know this unassailable fact."
Less risk, more value.
any reputable AI company today should be honoring robots.txt
Is there such a thing as a reputable AI company?
Re: any reputable AI company today should be honoring robots.txt
Your question isn't a headline but Betteridge's law applies.
Advertise that crawling will be charged at so much a page and that such use of the site constitutes acceptance of the T&Cs. Then start issuing the bills.