News: 0177902217

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Web-Scraping AI Bots Cause Disruption For Scientific Databases and Journals (nature.com)

(Monday June 02, 2025 @05:22PM (msmash) from the can't-have-nice-things dept.)


Automated web-scraping bots seeking training data for AI models are flooding scientific databases and academic journals [1]with traffic volumes that render many sites unusable . The online image repository DiscoverLife, which contains nearly 3 million species photographs, started receiving millions of daily hits in February this year that slowed the site to the point that it no longer loaded, Nature reported Monday.

The surge has intensified since the release of DeepSeek, a Chinese large language model that demonstrated effective AI could be built with fewer computational resources than previously thought. This revelation triggered what industry observers describe as an "explosion of bots seeking to scrape the data needed to train this type of model." The Confederation of Open Access Repositories reported that more than 90% of 66 surveyed members experienced AI bot scraping, with roughly two-thirds suffering service disruptions. Medical journal publisher BMJ has seen bot traffic surpass legitimate user activity, overloading servers and interrupting customer services.



[1] https://www.nature.com/articles/d41586-025-01661-4



no way (Score:2)

by OrangeTide ( 124937 )

AI is the future and makes all our lives better. Why so much hate?

Re: (Score:2)

by Tablizer ( 95088 )

Maybe via the Broken Window Theory of economics. The anti-scrape-bots will need to use AI to get around the scraper source spoofing tricks, creating a never-ending cat-and-mouse escalation pattern where AI experts on both sides make buck.

It's like the military-industrial-complex, they get rich by encouraging our leaders to moon dictators, and their counterparts on the other side are doing that same.

Really? (Score:2)

by nospam007 ( 722110 ) *

So, like Googlebot, Bingbot, YandexBot, Baiduspider, DuckDuckBot, Sogou Spider, Exabot, MojeekBot, Qwantify, AhrefsBot, SemrushBot, DotBot, Censysbot, PetalBot, Gigabot, MJ12bot, Bytespider (by ByteDance), Applebot (for Siri and Spotlight), NeevaBot (defunct but crawled while active), SeznamBot...

Re: (Score:3)

by serafean ( 4896143 )

No, this is actually different.

They siphon up everything, evading any attempt to restrict them. Search engines have to be wary of indexing useless stuff, these don't.

It's a real problem for internet infrastructure.

"If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they

Re: (Score:2)

by Nonesuch ( 90847 )

Generally those traditional crawlers are well-behaved, and will follow the instructions given in robots.txt, though not all follow suggestions like crawl-delay. And if not, they tend to originate from fixed source IP addresses which can be blocked or throttled by the site operator or their CDN.

Back in 2020 IETF released a draft document " [1]RateLimit Header Fields for HTTP [github.com]" providing rate-limit headers which well-behaved clients should respect.

[1] https://github.com/ietf-wg-httpapi/ratelimit-headers

Re: (Score:2)

by test321 ( 8891681 )

Classical search engines fetch html. These new bots attempt to download the 3 million images in their maximum resolution.

AI and Nostalgia! (Score:2)

by Jhon ( 241832 )

Look! AI has reproduced the Slashdot effect! Something that's been mostly unheard of for at least a decade!

Awe... all the good feels of days gone by. What old is new again.

LLMs are the worst (Score:1)

by makotech222 ( 1645085 )

My lemmy instance recently started getting hit with LLM bots trawling for data from legitimate users (most of whom are anti-llm, so provides good quality training data) and it sucks up so much bandwidth. They don't respect robots.txt either, so outside of IP blocks, theres not much we can do. Whatever small productivity boost the programmers get out of using llms, perhaps we shouldn't destroy all of civilized society to obtain it. Ban all llms please

Re: (Score:2)

by serafean ( 4896143 )

Most FLOSS projects have set up Anubis: [1]https://anubis.techaro.lol/ [techaro.lol]

[1] https://anubis.techaro.lol/

Mass holes (Score:1)

by Tablizer ( 95088 )

Why is it so easy in internet- and phone-land to spoof the source? Our infrastructure if focked up; it should just not be that easy. Send cruise missiles up cheaters' asses, send a message. And/or make a better standard.

Re: (Score:2)

by Pinky's Brain ( 1158667 )

I happen to find that an interesting discussion, but it's not relevant here.

They aren't spoofing, they are just not identifying themselves. That's not something you can solve by infrastructure, a large IP pool owner can spam you with requests from a million IPs without any spoofing. Only legal obligations to identify themselves could help, but that would be hard to implement and not without side effects.

Re: (Score:1)

by Tablizer ( 95088 )

> a large IP pool owner can spam you with requests from a million IPs without any spoofing.

Aren't owners required to publicly register their IP blocks?

If there is lots of traffic from a single owner, it can be throttled.

Aaron Swartz must be rolling in his tomb... (Score:2)

by dargaud ( 518470 )

AI is doing exactly what he attempted doing, except that he was [1]hounded [wikipedia.org] until he took his own life. And now it's somehow all okay.

[1] https://en.wikipedia.org/wiki/Aaron_Swartz

Dead Internet Theory Again (Score:2)

by Big Hairy Gorilla ( 9839972 )

More bots than peeps.

You can't index it, because it's growing like cancer.... therefore, you can't search it with any authority.... but since most stuff now is generated to get you to land on it for google small text ad farms... i.e. it's all garbage anyways. The source material is garbage, the generated material is garbage.

Most of the traffic is bots. The I-net is finished.

Pack it up and move on folks. Nothing to see here.

So idiotic (Score:2)

by bradley13 ( 1118935 )

There's no reason for millions of queries. How many models are being trained? These are just badly behaved bots, ruining a good thing (open access) for everyone else. Tragedy of the commons.

Re: (Score:2)

by dinfinity ( 2300094 )

"The online image repository DiscoverLife, which contains nearly 3 million species photographs "

How many queries do you think it takes to download all of those?

Sceapers shouks be classed as malware (Score:2)

by xack ( 5304745 )

Residential ISPs should not allow scrapers to launder through their networks. Microsoft, Apple, Google, Antivirus and Router makers should all crack down on scraping botnets the same way they crack down own other malicious traffic.

Registration for "Expensive" content (Score:2)

by drinkypoo ( 153816 )

Stop allowing unregistered users to access even slightly [computationally] expensive content... Anything uncached, really.

Institute a delay and possibly an additional verification requirement before users can view the most expensive content.

Anything everyone can see should be aggressively cached.

Pestilence (Score:2)

by dskoll ( 99328 )

These badly-behaved bots, mostly from China, are a scourge on the Internet. I have a self-hosted gitea instance and I had to password-protect it to stop the bots from eating all my bandwidth, even after I banned huge swaths of the IPv4 space.

Proposed Additions to the PDP-11 Instruction Set:

PI Punch Invalid
POPI Punch Operator Immediately
PVLC Punch Variable Length Card
RASC Read And Shred Card
RPM Read Programmers Mind
RSSC reduce speed, step carefully (for improved accuracy)
RTAB Rewind tape and break
RWDSK rewind disk
RWOC Read Writing On Card
SCRBL scribble to disk - faster than a write
SLC Search for Lost Chord
SPSW Scramble Program Status Word
SRSD Seek Record and Scar Disk
STROM Store in Read Only Memory
TDB Transfer and Drop Bit
WBT Water Binary Tree