Anubis guards gates against hordes of LLM bot crawlers

(2025/07/09)

Reference: 1752078110
News link: https://www.theregister.co.uk/2025/07/09/anubis_fighting_the_llm_hordes/
Source link:

Anubis is a sort of CAPTCHA test, but flipped: instead of checking visitors are human, it aims to make web crawling prohibitively expensive for companies trying to feed their hungry LLM bots.

It's is a clever response to a growing problem: the ever expanding list of companies who want to sell "AI" bots powered by Large Language Models (LLMs). LLMs are built from a "corpus," a very large database of human-written text. To keep updating the model, an LLM bot-herder needs fresh text for their "corpus."

[1]Anubis is named after the [2]ancient Egyptian jackal-headed god who [3]weighed the heart of the dead, to determine their fitness. To protect websites from AI crawlers, the Anubis software [4]weighs their willingness to do some computation , in what is called a proof of work challenge.

[5]

A human visitor merely sees a jackal-styled animé girl for a moment, while their browser solves a cryptographic problem. For companies running large-scale bot farms, though, that means the expensive sound of the fans of a whole datacenter spinning up to full power. In theory, when scanning a site is so intensive, the spider backs off.

[6]

[7]

There are existing measures to stop search engines crawling your site, such as a [8]robots.txt file . But as Google's explanation says, just having a robots.txt file doesn't prevent a web spider crawling through the site. It's an honor system, and that's a weakness. If the organization running the scraper chooses not to honor it – or your intellectual property rights – then they can simply take whatever they want, as often as they want.

Repeat visits are a big problem. It's cheaper to repeatedly scrape largely identical material than it is to store local copies of it — or as Drew DeVault put it, [9]please stop externalizing your costs directly into my face .

[10]

It was already a serious problem a year ago, when The Register [11]reported on ClaudeBot crawling a million times in one day . A year later, and despite signing deals, [12]Reddit sued Anthropic over it . It doesn't just affect forums and the like: [13]LWN is facing the problem . Tech manual publishing tool ReadTheDocs [14]reported one crawler downloading 73 terabytes in a month.

Scholars sneaking phrases into papers to fool AI reviewers [15]READ MORE

The underlying technology is not new. The idea of proof-of-work as an anti-spam measure goes back to Hashcash in 1997, to which The Reg [16]referred back in 2013 . In a [17]Hacker News comment , Iaso also gave due credit:

I was inspired by [18]Hashcash , which was proof of work for email to disincentivize spam. To my horror, it worked sufficiently for my git server so I released it as open source. It's now its own project and protects big sites like GNOME's GitLab.

Other comments detail [19]how the proof of work is done , and we appreciated [20]this note :

The second reason is that the combination of Chrome/Firefox/Safari's JIT and webcrypto being native C++ is probably faster than what I could write myself. Amusingly, supporting this means it works on very old/anemic PCs like PowerMac G5 (which doesn't support WebAssembly because it's big-endian).

Iaso says that [21]Anubis works , and that post contains an impressive list of users, from UNESCO to the WINE, GNOME and Enlightenment projects. [22]Others agree too . Drew DeVault, quoted above, now uses it to protect his SourceHut code forge.

There are other such measures. [23]Nepenthes is an LLM bot tarpit: it generates endless pages of link-filled nonsense text, trapping bot-spiders. The [24]Quixotic and Linkmaze tools work similarly, while [25]TollBit is commercial.

[26]Automation needed to fight army of AI content harvesters stalking the web

[27]Reddit sues Anthropic for scraping content into the maw of its eternally ravenous AI

[28]Pirate Bay digs itself a new hole: Mining alt-coin in slurper browsers

[29]ChatGPT creates phisher’s paradise by recommending the wrong URLs for major companies

Some observers have suggested using the work performed by the browser to mine cryptocurrency, but that risks being deemed malicious. [30]Coinhive tried it nearly a decade ago , and [31]got blocked as a result . Here, we respect [32]Iaso's response :

It's to waste CPU cycles. I don't want to touch cryptocurrency with a 20 foot pole. I realize I'm leaving money on the table by doing this, but I don't want to alienate the kinds of communities I want to protect.

Others, such as the Reg FOSS desk's favorite internet guru Jamie Zawinski, are [33]less impressed :

I am 100 percent allergic to cutesey kawaii bullshit intermediating me and my readers with some maybe-cryptocurrency nonsense, so fuck to all of the no of that.

His prediction is pessimistic:

Proof of work is fundamentally inflationary, wasteful bullshit that will never work because the attacker can always outspend you.

It is wasteful – that's the point – but then, so is the vast traffic generated by these bot-feeding harvesters. Some would argue that LLM bots themselves are an even vaster waste of resources and energy, and we would not disagree. As such, we're in favor of anything that hinders them. ®

Get our [34]Tech Resources

[1] https://anubis.techaro.lol/

[2] https://www.worldhistory.org/Anubis/

[3] https://egypt-museum.com/the-weighing-of-the-heart-ceremony/

[4] https://anubis.techaro.lol/docs/design/how-anubis-works/

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aG7mdkrjnRwg106sHRk2mQAAAMI&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aG7mdkrjnRwg106sHRk2mQAAAMI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aG7mdkrjnRwg106sHRk2mQAAAMI&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://developers.google.com/search/docs/crawling-indexing/robots/intro

[9] https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html

[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aG7mdkrjnRwg106sHRk2mQAAAMI&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[11] https://www.theregister.com/2024/07/30/taming_ai_content_crawlers/

[12] https://www.theregister.com/2025/06/05/reddit_sues_anthropic_over_ai/

[13] https://lwn.net/Articles/1008897/

[14] https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse/

[15] https://www.theregister.com/2025/07/07/scholars_try_to_fool_llm_reviewers/

[16] https://www.theregister.com/2013/05/23/bitcoin_spam_byzantine_generals/

[17] https://news.ycombinator.com/item?id=43423215

[18] https://en.wikipedia.org/wiki/Hashcash

[19] https://news.ycombinator.com/item?id=43424873

[20] https://news.ycombinator.com/item?id=43423215

[21] https://xeiaso.net/notes/2025/anubis-works/

[22] https://fabulous.systems/posts/2025/05/anubis-saved-our-websites-from-a-ddos-attack/

[23] https://zadzmo.org/code/nepenthes/

[24] https://marcusb.org/hacks/quixotic.html

[25] https://tollbit.com/

[26] https://www.theregister.com/2024/07/30/taming_ai_content_crawlers/

[27] https://www.theregister.com/2025/06/05/reddit_sues_anthropic_over_ai/

[28] https://www.theregister.com/2017/09/19/pirate_bay_bitcoin_mining_script/

[29] https://www.theregister.com/2025/07/03/ai_phishing_websites/

[30] https://www.theregister.com/2017/09/19/pirate_bay_bitcoin_mining_script/

[31] https://www.theregister.com/2017/10/19/malwarebytes_blocking_coin_hive_browser_cryptocurrency_miner_after_user_revolt/

[32] https://news.ycombinator.com/item?id=43424613

[33] https://www.jwz.org/blog/2025/06/under-attack-please-stand-by/

[34] https://whitepapers.theregister.com/

Proof of work is nice and all...

Mentat74

But they're still stealing all of your data and bandwidth...

Poisoning the well seems to be the only effective way to tell them to fark off !

Re: Proof of work is nice and all...

Steven Raith

Except they're not getting the data, because the headless bots that do the scraping can't perform the proof of work in a timely manner (due to using minimal resources, to get as many bots in a hosted instance as possible, I assume), and so never get to reach the page. Regular users using browsers get a cookie set after the first instance, and then they're left alone for however long you configure it to leave them alone for.

We've implemented on quite a few sites now (mostly higher education - so higher reading age, valuable to LLMs, but can't be put on Cloudflare because we don't own the domain and can't justify several thousand pounds a month on the mystery "enterprise" subscription to get that tickbox in CF) in a white label manner (available if you sponsor the author - which we do) and it's been horrifically eye opening.

These sites are not hosted on small systems (in many cases, dedicated hosts with decent CPU/RAM/NVME storage, tuned to suit) but when you're getting 800 requests a second for a full stack index search that has to be run through a perl compiler, that's gonna bring pretty much anything down.

And as the sites normally run around the ten hits per second range, performance tuning for the bots benefit would be....a bit pointless.

800 Anubis requests a second though? Very light, the server barely notices those. Load goes from 40 to under 0.4

On most sites we use it on, >99.99% of traffic that didn't come from their own network or Jisc JANET (which gets an exception obviously) was blocked. Out of hundreds of thousands of requests a day, only a few hundred got through. With no complaints from the typically very observant clients of access problems.

After a week of this, the bots moved on. They've come back since and gone away, and come back, but the site barely notices now.

I don't think people realise the scale of this problem - this level of abuse, which is absolutely a Distributed Denial of Service attack in all but name, genuinely should be criminal.

Steven R

Re: Proof of work is nice and all...

Long John Silver

Thank you for the informative explanation.

Re: Proof of work is nice and all...

Steven Raith

No worries - Anubis has kinda come out of nowhere if you aren't involved in fairly content/text heavy archive-type sites (your Githubs, your documentation systems, your library sites etc) so it all sounds a bit 'too good to be true' - and it's not perfect, but by gum this is one of those cases where you don't want to let perfect get in the way of plenty good enough (for now).

I fully expect there to be an arms race, but I'm struggling to see how the mass-scale AI crawlers, using hundreds of threads per instance, can possibly get past just having to burn huge amounts of resources to do the math to get access to 'my' (well, my clients, but you know...) resources.

I'm quite sure that a few fly-by-night 'AI crawler provider' services will be looking at their AWS bill, the lack of data they've got, and shitting themselves. The ones stupid enough to not be running them off hijacked set top boxes and IoT devices - one suspicion is that someone's bought out one of those suppliers, and is using them to run the traffic. The amount of traffic we've seen coming from domestic ISPs internationally (Brazil, Romania, China etc) would certainly back that up.

Some more background from a decent wee hosting company back from when this started to be A Fucking Problem:

https://www.mythic-beasts.com/blog/2025/04/01/abusive-ai-web-crawlers-get-off-my-lawn/

Steven R

Re: Proof of work is nice and all...

doublelayer

"Except they're not getting the data, because the headless bots that do the scraping can't perform the proof of work in a timely manner (due to using minimal resources, [...] Regular users using browsers get a cookie set after the first instance, and then they're left alone for however long you configure it to leave them alone for."

Which means this will work for as long as it takes for the author of the bots to add a cross-thread cookie jar to their bot. Or in other words, about five minutes after they notice this. Some of the largest bots are run by a place that has lots of programmer-hours to put into their bots and lots of cash to burn on training, meaning they can either defeat this or absorb the cost, whichever they find cheaper.

In the meantime, this will break any user that doesn't run JS by default, it will cause lots of perceived lag in accessing your site, it will be annoying for anyone that doesn't at least keep cookies temporarily, and when the arms race starts, it will make the experience much worse for anyone with low-power hardware like mobile phones because the only way this will work when being actively resisted is by increasing the work that needs to be done.

Re: Proof of work is nice and all...

Jason Bloomberg

genuinely should be criminal.

And perhaps we shouldn't just bring back hanging, but add nuking as well, as many times as it takes.

I hate it when these cunts make me a not nice person.

Go one better

Throatwarbler Mangrove

In the the novel Liege-Killer , one of the characters, a synthetic person, is essentially hypnotized into answering an endless series of questions, a trap made possible by his synthetic nature. Perhaps the next step of this countermeasure is to create a problem which a regular human (or non-AI browser) will easily bypass but which an AI will hang up on. What that looks like in particular, I don't know, but I'm sure brighter minds than mine can figure it out.

Re: Go one better

Long John Silver

Many, many, years ago, and long before it became unwatchably 'woke', there was an episode of 'Dr Who' during which The Doctor defeated a malevolent computer by stating a logical antimony akin to 'the liar paradox' and asked whether the statement was true. The computer did not seek a 'get out' via, for example, the Russell/Whitehead 'theory of types', and it duly went into an endless loop and exploded.

Re: Go one better

doublelayer

The LLM itself is not solving the challenges. The challenges are being completed by the retrieval bot, a normal piece of software which is similar to or even the same as the browsers humans are using. That's why they're hoping that comparative expense will do it; it is almost impossible to come up with something a bot can't do and software used by humans can.

I want some crawlers but not all of them

alain williams

I want the ones that benefit me: the search engines (google, bing, etc); they help people find my pages. These spiders tend to be well behaved and do not overload my machine.

I do not want the ones that just suck my data but bring me no benefit: LLM crawlers. These crawlers do not care what they do to my servers, they grab too many pages per second.

The problem is how to distinguish the two.

Re: I want some crawlers but not all of them

Steven Raith

User agents, typically - the current scourge of AI bots are pretending to be regular browsers. They're quite deliberately not identifying themselves as crawlers.

Initially - a few months ago - they were just using a dumb lookup table of and picking at random, so you'd see Presto 5 on Mac OS 10.2 or Windows XP using Trident, or even Win CE Internet explorer 4 user agents. Basically, if you set a redirect to block any major browser that was more than, say, two years old, you knocked out 90% of the bot traffic. A rough solution for sure, but if you couldn't use cloudflare for whatever reason (And there are valid ones, like not owning the domain, if you're hosting for other people) then that at least kept the site up, even if it got in the way sometimes.

But then they caught on and started using more modern browsers, but they can't do crypto challenges like modern browsers on real devices, and that's where this tool comes in - we'd been testing it up to that point, then kinda had no choice but to go live with it, and it worked a treat.

You can put in exceptions to allow IP ranges and user agents through without a challenge, so you can let Google, Bing, etc in without a challenge if you want to be SEO'd, but challenge everything else if you like. I believe, upcoming, will be geographic/ASN filtering which will be very handy....

Hope that helps clear that up a bit :-)

Steven R

Re: I want some crawlers but not all of them

Jamie Jones

Install a tar pit.

Put the tarpit URL in robots.txt with a "Disallow" directive for every user agent, and then parse the log files with a cronned script for obvious bot access to that resource.

All good bots will honour robots.txt.

Use this list of IP addresses to block access to the whole server, and others, for X amount of days.

That will stop the majority of ai bots, without screwing up those who don't enable JavaScript, or use something like "w3m" ... Who uses text based browsers these days? Well, often the sort of people who want to access that very git repository you are effectively blocking!

See here: [1]https://zadzmo.org/code/nepenthes/

[1] https://zadzmo.org/code/nepenthes/

Re: I want some crawlers but not all of them

DS999

Search engines are caching the content so they probably aren't making a million requests in a single day like iFixit reported Claude was doing to them.

A better solution would be for Anubis to offer "licenses" to responsible crawlers. So if Google or Bing or whoever wants to crawl a site they can get some sort cryptographic key that allows bypassing Anubis. So when a site installs Anubis they might have configuration section that shows all the parties that have been granted licenses and for what, under broad categories. So you have a "search engine" category and unless you want to block them (or block one or two specific ones) you leave the whole category enabled. Maybe there's an "archiver" section so Wayback Machine has its license. There could also be a control panel for AI crawlers - it would obviously default to off since site owners install Anubis to block them but if for example they come to an agreement with OpenAI they can enable OpenAI's license to crawl their site while blocking all the rest.

Anubis is actually a GENIUS solution that not only allows sites to block abusive AI crawlers but would also let them reassert control over their IP instead of giving it away for free. If your site hosts something valuable to AI companies then they would have to compensate you. So instead of letting AIs grab all the information in the Reg's article and comments for free, the Register could demand compensation to let them through the Anubis gate. For a site with valuable enough content, that could become a decent revenue stream and maybe they wouldn't have to rely as much upon ads.

Maybe I'm dreaming on that last part - it would be cheaper for AI companies to start caching everything (and pay the one time price of the Anubis proof of work) than it would for them to actually pay for the IP they're stealing.

Bullwinkle ... Again?!

Anonymous Coward

human visitor merely sees a jackal-styled animé girl

That's already been done in various animes.

And how long until...

Mentat74

The likes of Google / Microsoft figure out that they could use their chrome / edge browser to do the scraping for them ?

Or even just bake it into every Android / ChromeOS / Windows device ?

That way the compute penalty would be distributed over millions of other people's pc's...

Re: And how long until...

Paul Crawford

Shh! Don't give them any more horrible ideas than they currently do!

News: 1752078110

Anubis guards gates against hordes of LLM bot crawlers

Proof of work is nice and all...

Re: Proof of work is nice and all...

Re: Proof of work is nice and all...

Re: Proof of work is nice and all...

Re: Proof of work is nice and all...

Re: Proof of work is nice and all...

Go one better

Re: Go one better

Re: Go one better

I want some crawlers but not all of them

Re: I want some crawlers but not all of them

Re: I want some crawlers but not all of them

Re: I want some crawlers but not all of them

Bullwinkle ... Again?!

And how long until...

Re: And how long until...