Bots are overwhelming websites with their hunger for AI data
- Reference: 1750156086
- News link: https://www.theregister.co.uk/2025/06/17/bot_overwhelming_websites_report/
- Source link:
Galleries, Libraries, Archives, and Museums (GLAMs) say they're being overwhelmed by AI bots – web crawling scripts that visit websites and download data to be used for training AI models – according to [1]a report issued on Tuesday by the GLAM-E Lab, which studies issues affecting GLAMs.
GLAM-E Lab is a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law.
[2]
Based on an anonymized survey of 43 organizations, the report indicates that cultural institutions are alarmed by the aggressive harvesting of their content, which shows no regard for the burden that data-harvesting places on websites.
[3]
[4]
"Bots are widespread, although not universal," the report says. "Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic."
The surge in bots that gather data for AI training, the report says, often went unnoticed until it became so bad that it knocked online collections offline.
[5]
"Respondents worry that swarms of AI training data bots will create an environment of unsustainably escalating costs for providing online access to collections," the report says.
The institutions commenting on these concerns have differing views about when the bot surge began. Some report noticing it as far back in 2021 while others only began noticing web scraper traffic this year.
Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.
[6]
Bot defenses offered by the likes of AWS and Cloudflare do appear to help, but GLAM-E Lab acknowledges that the problem is complex. Placing content behind a login may not be effective if an institution's goal is to provide public access to digital assets. And there may be a reason to want some degree of bot traffic, such as bots that index sites for search engines.
[7]Salesforce study finds LLM agents flunk CRM and confidentiality tests
[8]The launch of ChatGPT polluted the world forever, like the first atomic weapons tests
[9]Alt cloud platform Railway forced to pause lowest tiers after onrush of GCP customers
[10]BT chief says AI could deliver more job cuts, hints at Openreach sell-off
The GLAM-E Lab survey echoes the findings of a similar report issued earlier this month by the Confederation of Open Access Repositories (COAR) based on the responses of 66 open access repositories run by libraries, universities, and other institutions.
The [11]COAR report says: "Over 90 percent of survey respondents indicated their repository is encountering aggressive bots, usually more than once a week, and often leading to slowdowns and service outages. While there is no way to be 100 percent certain of the purpose of these bots, the assumption in the community is that they are AI bots gathering data for generative AI training."
The GLAM-E Lab survey also recalls complaints about abusive bots raised by [12]The Wikimedia Foundation , [13]Sourcehut , Diaspora developer [14]Dennis Schubert , repair site [15]iFixit , and documentation project [16]ReadTheDocs .
Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites.
"The cultural institutions that host online collections are not resourced to continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity," the report says. "That means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for." ®
Get our [17]Tech Resources
[1] https://www.glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/06/16/salesforce_llm_agents_benchmark/
[8] https://www.theregister.com/2025/06/15/ai_model_collapse_pollution/
[9] https://www.theregister.com/2025/06/16/railway_pauses_lowest_tiers/
[10] https://www.theregister.com/2025/06/16/bt_chief_says_ai_could_cut_more_staff/
[11] https://coar-repositories.org/news-updates/open-repositories-are-being-profoundly-impacted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey/
[12] https://www.theregister.com/2025/04/03/wikimedia_foundation_bemoans_bot_bandwidth/
[13] https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/
[14] https://diaspo.it/posts/2594
[15] https://www.theregister.com/2024/07/30/taming_ai_content_crawlers/
[16] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/
[17] https://whitepapers.theregister.com/
Re: Tragedy of the Commons
I was trying to think of something positive we could do:
How about legislate so that if you release an LLM, you must provide public access to your training data for free. Compete on your tech, not your data.
Delivery drivers here
Seems like websites will create separate entrance for bots, like restaurants do now.
Comeback of RSS?
Re: Delivery drivers here
Or, the museum or gallery which has an acceptable usage policy which basically says "it's fine for personal use, but no commercial use" gets fed up with the breach of their terms of use by companies overloading their website by stealing content for commercial use in violation of the terms of service and respond by providing poisoned garbage to the bots.
Re: Delivery drivers here
Invoice them for the extra load they place on the server along with an addition to the T&Cs stating this.
Too Late
I predict that by the time there is any sort of agreed standard for bot behaviour that allows web sites the control to allow or disallow bots as appropriate, it'll be too late to protect anything.
Re: Too Late
It's become a serious problem for us as a hoster of a lot of historical information but, until recently, low usage and so could be provided cheaply with less powerful hardware. In fact historically most usage was search bots. But these adapted to act responsibly by pacing requests so as not to overload the servers.
Then the AI bots appeared. Suddenly an individual server would be DDoS which killed the server, for the bot to return as the server recovered and so on for the best part of a day. They were ironically killing their golden goose for the sake of not designing their bots efficiently. The cowboys seemed to be multiple hosting on Alibaba and Microsoft clouds.
Hopefully (I know that is a sin for SysAdmins) market forces will force decent designs on bot scanning to maximise the amount of information they can get per get. I suppose those killing us now were designed when the target was large providers of data with matching servers that could cope with equally massive hoovering.
I guess the museums/galleries are an intermediate source limited by funds that should be devoted to their real world acquisitions and display.
Re: Too Late
> hosting on Alibaba and Microsoft clouds
Sounds like time for a block list...
Re: Too Late
"Suddenly an individual server would be DDoS which killed the server, for the bot to return as the server recovered and so on for the best part of a day"
If the site owning the bot could be identified would it not be possible to sue for damages. Sue under whatever small claims procedure is available. Although that means large sums can't be claimed it negates the advantage of size on their part. If all the sites they're traversing started to do that the trawlers would get bogged down in suits. Once they overlook a judgement send the bailiffs in to sequester a server.
Re: Too Late
>would it not be possible to sue for damages.
You have a public web site, you are offering content for free and it's the public's fault if you can't cope with demand ?
It's like having a free content and then suing the crowds for cheering too loud.
Or you could detect AI bots and send them random numbers
Re: Too Late
You predict correctly.
https://www.theguardian.com/commentisfree/2021/oct/06/offshoring-wealth-capitalism-pandora-papers
"Trashing the planet and hiding the money isn’t a perversion of capitalism. It is capitalism"
Nepethes
https://forge.hackers.town/hackers.town/nepenthes
That is all.
Re: Nepethes
Wonder if they are Poe fans (the only reference I am aware of).
Re: Nepethes
Nepenthes is the botanical name for the genus of carnivorous pitcher plants. Insects are lured, slither into the pitcher, drown in its contents and are digested.
Re: Nepethes
Great - even screwed up the title :)
Re: Nepethes
WHOA! NICE!
Tragedy of the Commons
"Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites....[It] means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for."
Report concludes irresponsible people must be more responsible.