Bots are overwhelming websites with their hunger for AI data

(2025/06/17)

Reference: 1750156086
News link: https://www.theregister.co.uk/2025/06/17/bot_overwhelming_websites_report/
Source link:

Bots harvesting content for AI companies have proliferated to the point that they're threatening digital collections of arts and culture.

Galleries, Libraries, Archives, and Museums (GLAMs) say they're being overwhelmed by AI bots – web crawling scripts that visit websites and download data to be used for training AI models – according to [1]a report issued on Tuesday by the GLAM-E Lab, which studies issues affecting GLAMs.

GLAM-E Lab is a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law.

[2]

Based on an anonymized survey of 43 organizations, the report indicates that cultural institutions are alarmed by the aggressive harvesting of their content, which shows no regard for the burden that data-harvesting places on websites.

[3]

[4]

"Bots are widespread, although not universal," the report says. "Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic."

The surge in bots that gather data for AI training, the report says, often went unnoticed until it became so bad that it knocked online collections offline.

[5]

"Respondents worry that swarms of AI training data bots will create an environment of unsustainably escalating costs for providing online access to collections," the report says.

The institutions commenting on these concerns have differing views about when the bot surge began. Some report noticing it as far back in 2021 while others only began noticing web scraper traffic this year.

Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.

[6]

Bot defenses offered by the likes of AWS and Cloudflare do appear to help, but GLAM-E Lab acknowledges that the problem is complex. Placing content behind a login may not be effective if an institution's goal is to provide public access to digital assets. And there may be a reason to want some degree of bot traffic, such as bots that index sites for search engines.

[7]Salesforce study finds LLM agents flunk CRM and confidentiality tests

[8]The launch of ChatGPT polluted the world forever, like the first atomic weapons tests

[9]Alt cloud platform Railway forced to pause lowest tiers after onrush of GCP customers

[10]BT chief says AI could deliver more job cuts, hints at Openreach sell-off

The GLAM-E Lab survey echoes the findings of a similar report issued earlier this month by the Confederation of Open Access Repositories (COAR) based on the responses of 66 open access repositories run by libraries, universities, and other institutions.

The [11]COAR report says: "Over 90 percent of survey respondents indicated their repository is encountering aggressive bots, usually more than once a week, and often leading to slowdowns and service outages. While there is no way to be 100 percent certain of the purpose of these bots, the assumption in the community is that they are AI bots gathering data for generative AI training."

The GLAM-E Lab survey also recalls complaints about abusive bots raised by [12]The Wikimedia Foundation , [13]Sourcehut , Diaspora developer [14]Dennis Schubert , repair site [15]iFixit , and documentation project [16]ReadTheDocs .

Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites.

"The cultural institutions that host online collections are not resourced to continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity," the report says. "That means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for." ®

Get our [17]Tech Resources

[1] https://www.glamelab.org/products/are-ai-bots-knocking-cultural-heritage-offline/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aFGRGgc8t1J129q8gPY5igAAAJA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[7] https://www.theregister.com/2025/06/16/salesforce_llm_agents_benchmark/

[8] https://www.theregister.com/2025/06/15/ai_model_collapse_pollution/

[9] https://www.theregister.com/2025/06/16/railway_pauses_lowest_tiers/

[10] https://www.theregister.com/2025/06/16/bt_chief_says_ai_could_cut_more_staff/

[11] https://coar-repositories.org/news-updates/open-repositories-are-being-profoundly-impacted-by-ai-bots-and-other-crawlers-results-of-a-coar-survey/

[12] https://www.theregister.com/2025/04/03/wikimedia_foundation_bemoans_bot_bandwidth/

[13] https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

[14] https://diaspo.it/posts/2594

[15] https://www.theregister.com/2024/07/30/taming_ai_content_crawlers/

[16] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

[17] https://whitepapers.theregister.com/

Tragedy of the Commons

Brewster's Angle Grinder

"Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites....[It] means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for."

Report concludes irresponsible people must be more responsible.

Re: Tragedy of the Commons

Brewster's Angle Grinder

I was trying to think of something positive we could do:

How about legislate so that if you release an LLM, you must provide public access to your training data for free. Compete on your tech, not your data.

Delivery drivers here

elsergiovolador

Seems like websites will create separate entrance for bots, like restaurants do now.

Comeback of RSS?

Re: Delivery drivers here

Peter2

Or, the museum or gallery which has an acceptable usage policy which basically says "it's fine for personal use, but no commercial use" gets fed up with the breach of their terms of use by companies overloading their website by stealing content for commercial use in violation of the terms of service and respond by providing poisoned garbage to the bots.

Re: Delivery drivers here

Doctor Syntax

Invoice them for the extra load they place on the server along with an addition to the T&Cs stating this.

Too Late

original_rwg

I predict that by the time there is any sort of agreed standard for bot behaviour that allows web sites the control to allow or disallow bots as appropriate, it'll be too late to protect anything.

Re: Too Late

Lon24

It's become a serious problem for us as a hoster of a lot of historical information but, until recently, low usage and so could be provided cheaply with less powerful hardware. In fact historically most usage was search bots. But these adapted to act responsibly by pacing requests so as not to overload the servers.

Then the AI bots appeared. Suddenly an individual server would be DDoS which killed the server, for the bot to return as the server recovered and so on for the best part of a day. They were ironically killing their golden goose for the sake of not designing their bots efficiently. The cowboys seemed to be multiple hosting on Alibaba and Microsoft clouds.

Hopefully (I know that is a sin for SysAdmins) market forces will force decent designs on bot scanning to maximise the amount of information they can get per get. I suppose those killing us now were designed when the target was large providers of data with matching servers that could cope with equally massive hoovering.

I guess the museums/galleries are an intermediate source limited by funds that should be devoted to their real world acquisitions and display.

Re: Too Late

Gene Cash

> hosting on Alibaba and Microsoft clouds

Sounds like time for a block list...

Re: Too Late

Doctor Syntax

"Suddenly an individual server would be DDoS which killed the server, for the bot to return as the server recovered and so on for the best part of a day"

If the site owning the bot could be identified would it not be possible to sue for damages. Sue under whatever small claims procedure is available. Although that means large sums can't be claimed it negates the advantage of size on their part. If all the sites they're traversing started to do that the trawlers would get bogged down in suits. Once they overlook a judgement send the bailiffs in to sequester a server.

Re: Too Late

Yet Another Anonymous coward

>would it not be possible to sue for damages.

You have a public web site, you are offering content for free and it's the public's fault if you can't cope with demand ?

It's like having a free content and then suing the crowds for cheering too loud.

Or you could detect AI bots and send them random numbers

Re: Too Late

ecofeco

You predict correctly.

https://www.theguardian.com/commentisfree/2021/oct/06/offshoring-wealth-capitalism-pandora-papers

"Trashing the planet and hiding the money isn’t a perversion of capitalism. It is capitalism"

Nepethes

Sp1z

https://forge.hackers.town/hackers.town/nepenthes

That is all.

Re: Nepethes

m4r35n357

Wonder if they are Poe fans (the only reference I am aware of).

Re: Nepethes

Doctor Syntax

Nepenthes is the botanical name for the genus of carnivorous pitcher plants. Insects are lured, slither into the pitcher, drown in its contents and are digested.

Re: Nepethes

Sp1z

Great - even screwed up the title :)

Re: Nepethes

ecofeco

WHOA! NICE!

News: 1750156086

Bots are overwhelming websites with their hunger for AI data

Tragedy of the Commons

Re: Tragedy of the Commons

Delivery drivers here

Re: Delivery drivers here

Re: Delivery drivers here

Too Late

Re: Too Late

Re: Too Late

Re: Too Late

Re: Too Late

Re: Too Late

Nepethes

Re: Nepethes

Re: Nepethes

Re: Nepethes

Re: Nepethes