Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content
(2025/03/21)
- Reference: 1742527875
- News link: https://www.theregister.co.uk/2025/03/21/cloudflare_ai_labyrinth/
- Source link:
Cloudflare has created a bot-busting AI to make life hell for AI crawlers.
The network-taming company built the tool after noticing that almost one percent of all requests to access web content that it can see now come from AI crawler bots. Those bots are probably scraping data that’s gathered up to train AI models.
Web site operators can in theory block AI crawlers using various means such as a [1]robots.txt file or changing web server settings to disallow visits from bots. Some even use CAPTCHAs to test whether visitors to a site are human, or adopt software designed to stymie bots.
[2]
In reality crawler operators ignore the instructions in robots.txt files, or work around CAPTCHAs and web server settings. The result is a lot of unwanted crawler traffic consuming resources, and info fed into training data without creators’ permission – a contentious practice currently being [3]tested in court amidst allegations of copyright abuse.
No human would go four links deep into a maze of AI-generated nonsense
Cloudflare’s response is to let crawler bots in and use generative AI to create junk content for them to devour in what the company has termed an “AI Labyrinth”.
“When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” [4]explained Cloudflare’s Reid Tatoris, Harsh Saxena, and Luis Miglietti. Cloudflare uses its own serverless Workers to create the content.
[5]
[6]
The trio wrote that the content is “real looking” but “not actually the content of the site we are protecting, so the crawler wastes time and resources.” The content is also “real and related to scientific facts” because Cloudflare doesn’t want to inadvertently create misinformation.
[7]We did not have Brave clashing with Rupert Murdoch on our 2025 bingo card, but there it is
[8]Creators demand tech giants fess up and pay for all that AI training data
[9]Fining Big Tech isn't working. Make them give away illegally trained LLMs as public domain
[10]Major publishers sue Perplexity AI for scraping without paying
The AI slop is also designed not to mess with sites’ reputations or search engine optimization efforts.
It is, however, designed to act as a deterrent to crawler operators, by keeping their bots busy and thereby increasing the cost of operating content scrapers.
These aren’t the droids you’re looking for
The showrunner of Disney+ Star Wars spinoff series Andor has walked back his plan to place the show’s scripts online.
In an [11]interview with Collider, showrunner Tony Gilroy said he prepared the scripts so they could be posted online, then changed his mind.
“I wanted to do it,” he said. “AI is the reason we're not. I mean, terribly sadly, it's just too much of an X-ray and too easily absorbed.”
“Why help the fucking robots any more than you can?”
“So, it was an ego thing. It was vanity that makes you want to do it, and the downside is real. So, vanity loses.”
Cloudflare thinks this stuff is also a useful tool to detect bot activity.
“No real human would go four links deep into a maze of AI-generated nonsense,” Cloudflare’s trio wrote. “Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots, which we add to our list of known bad actors.”
This sort of thing usually creates an arms race and Cloudflare is already thinking about what it will take to stay ahead.
[12]
“In the future, we’ll continue to work to make these links harder to spot and make them fit seamlessly into the existing structure of the website they’re embedded in,” its authors wrote.
Cloudflare customers can enable the AI Labyrinth in their management consoles. ®
Get our [13]Tech Resources
[1] https://www.robotstxt.org/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://www.theregister.com/2025/03/11/meta_dmca_copyright_removal_case/
[4] https://blog.cloudflare.com/ai-labyrinth/
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/03/13/brave_news_corp_content/
[8] https://www.theregister.com/2025/02/07/ai_training_data_committee/
[9] https://www.theregister.com/2024/12/22/ai_poisoned_tree/
[10] https://www.theregister.com/2024/10/22/publishers_sue_perplexity_ai/
[11] https://collider.com/andor-season-2-preview-tony-gilroy-collateral-damage/
[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[13] https://whitepapers.theregister.com/
The network-taming company built the tool after noticing that almost one percent of all requests to access web content that it can see now come from AI crawler bots. Those bots are probably scraping data that’s gathered up to train AI models.
Web site operators can in theory block AI crawlers using various means such as a [1]robots.txt file or changing web server settings to disallow visits from bots. Some even use CAPTCHAs to test whether visitors to a site are human, or adopt software designed to stymie bots.
[2]
In reality crawler operators ignore the instructions in robots.txt files, or work around CAPTCHAs and web server settings. The result is a lot of unwanted crawler traffic consuming resources, and info fed into training data without creators’ permission – a contentious practice currently being [3]tested in court amidst allegations of copyright abuse.
No human would go four links deep into a maze of AI-generated nonsense
Cloudflare’s response is to let crawler bots in and use generative AI to create junk content for them to devour in what the company has termed an “AI Labyrinth”.
“When we detect unauthorized crawling, rather than blocking the request, we will link to a series of AI-generated pages that are convincing enough to entice a crawler to traverse them,” [4]explained Cloudflare’s Reid Tatoris, Harsh Saxena, and Luis Miglietti. Cloudflare uses its own serverless Workers to create the content.
[5]
[6]
The trio wrote that the content is “real looking” but “not actually the content of the site we are protecting, so the crawler wastes time and resources.” The content is also “real and related to scientific facts” because Cloudflare doesn’t want to inadvertently create misinformation.
[7]We did not have Brave clashing with Rupert Murdoch on our 2025 bingo card, but there it is
[8]Creators demand tech giants fess up and pay for all that AI training data
[9]Fining Big Tech isn't working. Make them give away illegally trained LLMs as public domain
[10]Major publishers sue Perplexity AI for scraping without paying
The AI slop is also designed not to mess with sites’ reputations or search engine optimization efforts.
It is, however, designed to act as a deterrent to crawler operators, by keeping their bots busy and thereby increasing the cost of operating content scrapers.
These aren’t the droids you’re looking for
The showrunner of Disney+ Star Wars spinoff series Andor has walked back his plan to place the show’s scripts online.
In an [11]interview with Collider, showrunner Tony Gilroy said he prepared the scripts so they could be posted online, then changed his mind.
“I wanted to do it,” he said. “AI is the reason we're not. I mean, terribly sadly, it's just too much of an X-ray and too easily absorbed.”
“Why help the fucking robots any more than you can?”
“So, it was an ego thing. It was vanity that makes you want to do it, and the downside is real. So, vanity loses.”
Cloudflare thinks this stuff is also a useful tool to detect bot activity.
“No real human would go four links deep into a maze of AI-generated nonsense,” Cloudflare’s trio wrote. “Any visitor that does is very likely to be a bot, so this gives us a brand-new tool to identify and fingerprint bad bots, which we add to our list of known bad actors.”
This sort of thing usually creates an arms race and Cloudflare is already thinking about what it will take to stay ahead.
[12]
“In the future, we’ll continue to work to make these links harder to spot and make them fit seamlessly into the existing structure of the website they’re embedded in,” its authors wrote.
Cloudflare customers can enable the AI Labyrinth in their management consoles. ®
Get our [13]Tech Resources
[1] https://www.robotstxt.org/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://www.theregister.com/2025/03/11/meta_dmca_copyright_removal_case/
[4] https://blog.cloudflare.com/ai-labyrinth/
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/03/13/brave_news_corp_content/
[8] https://www.theregister.com/2025/02/07/ai_training_data_committee/
[9] https://www.theregister.com/2024/12/22/ai_poisoned_tree/
[10] https://www.theregister.com/2024/10/22/publishers_sue_perplexity_ai/
[11] https://collider.com/andor-season-2-preview-tony-gilroy-collateral-damage/
[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9zyf8ygvuGLPPoY0qhD5AAAAhU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[13] https://whitepapers.theregister.com/
Re: AI generated content is poison for AI
Scotech
Depends on the crawler. The best ones are built on the back of a traditional search engine crawler's index, which includes cues regarding page keyword rankings and site authority. They can incorporate this data into the training, and from there, it can affect the weightings in the resulting model. So what happens if a highly authoritative site introduces a bucket load of AI slop into the mix?
Should be fun to watch!
Clone wars
Omnipresent
Begun, they have.
Let me off the ride
Winkypop
I’m feeling sick.
Homo.Sapien.Floridanus
bad bots, bad bots
watcha gonna do?
watcha gonna do
when they come for you?
AI generated content is poison for AI
This is rather deep.
AI generated content has been shown to very quickly [1]poison any AI build on it . Even if the content itself is perfectly fine. So this strategy not only protects websites from on wanted visitors, it also will help us to more easily recognize the resulting bad chatbots.
However, as much of the content published on the internet is already AI generated, it might not change that much for the crawlers.
But if the expected arms race leads to innovative AI being able to recognize AI generated content, that would itself be a valuable outcome.
[1] https://www.scientificamerican.com/article/ai-generated-data-can-poison-future-ai-models/