Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it

(2025/04/09)

Reference: 1744182130
News link: https://www.theregister.co.uk/2025/04/09/ietf_ai_preferences_working_group/
Source link:

The Internet Engineering Task Force has chartered a group it hopes will create a standard that lets content creators tell AI developers whether it’s OK to use their work.

Named the AI Preferences Working Group (AIPREF), the group has been asked develop two things:

A common vocabulary to express authors' and publishers' preferences regarding use of their content for AI training and related tasks;

Means of attaching that vocabulary to content on the Internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences.

The AIPREF [1]charter suggests “attaching preferences to content either by including preferences in content metadata or by signaling preferences using the protocol that delivers content” as the ways to get this done.

AIPREF co-chair Mark Nottingham thinks those items are needed because current systems aren’t working.

He [2]thinks the “non-standard signals” in robots.txt files – an IETF [3]standard that defines syntax on whether crawlers are allowed to access web content – aren’t working.

[4]

“As a result, authors and publishers lose confidence that their preferences will be adhered to, and resort to measures like blocking their [AI vendors’] IP addresses.”

[5]

[6]

Content creators resort to IP blocking because major model-makers did not ask for permission or seek licenses before scraping the internet for content needed to train their AIs.

OpenAI is now [7]lobbying for copyright reform that would allow it to scrape more content without payment.

[8]

Copyright-holders are fighting back with [9]lawsuits against those who used copyrighted material to build their models, and [10]signing licensing deals that see AI players pay to access content.

AI crawlers are also costing publishers money. The Wikimedia Foundation recently [11]complained that bandwidth it devotes to serving image retrieval requests has risen by 50 percent over the last year, mostly to AI crawlers downloading material.

The IETF doesn’t care about those legal and operational matters: It just wants to build tech that enables people to express their preferences in the hope that scraper operators buy in and ingest content that creators are happy to have fed into AIs.

[12]Cloudflare builds an AI to lead AI scraper bots into a horrible maze of junk content

[13]AI crawlers haven't learned to play nice with websites

[14]OpenAI's ChatGPT crawler can be tricked into DDoSing sites, answering your queries

[15]Websites clamp down as creepy AI crawlers sneak around for snippets

To get the ball rolling, AIPREF met at the IETF 122 conference in mid-March, and has already developed two draft proposals. One [16]proposes “Short Usage Preference Strings for Automated Processing” and suggests those strings could be used in robots.txt files or HTTP header fields.

The other, from the Common Crawl Foundation, is titled [17]Vocabulary for Expressing Content Preferences for AI Training and also suggests syntax for preferences be stored in robots.txt files or HTTP header fields, but also suggests use of the proposed vocabulary in <meta tags.

[18]

AIPREF is [19]meeting this week , although one planned session appears to have been cancelled.

The Working Group has given itself a deadline of August 2025 to deliver proposals. Participants seem to know that’s a tight deadline and that the group will therefore need to act with some urgency. ®

Get our [20]Tech Resources

[1] https://datatracker.ietf.org/doc/charter-ietf-aipref/

[2] https://www.ietf.org/blog/aipref-wg/

[3] https://www.rfc-editor.org/rfc/rfc9309.html

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z_ZFRqSgyqAaltn_yHkzEgAAANM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z_ZFRqSgyqAaltn_yHkzEgAAANM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z_ZFRqSgyqAaltn_yHkzEgAAANM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[7] https://www.theregister.com/2025/04/03/openai_copyright_bypass/

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z_ZFRqSgyqAaltn_yHkzEgAAANM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[9] https://www.theregister.com/2025/01/10/meta_libgen_allegation/

[10] https://www.theregister.com/2024/05/23/openai_news_corp/

[11] https://www.theregister.com/2025/04/03/wikimedia_foundation_bemoans_bot_bandwidth/

[12] https://www.theregister.com/2025/03/21/cloudflare_ai_labyrinth/

[13] https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

[14] https://www.theregister.com/2025/01/19/openais_chatgpt_crawler_vulnerability/

[15] https://www.theregister.com/2024/07/22/ai_training_data_shrinks/

[16] https://www.ietf.org/archive/id/draft-thomson-aipref-sup-00.html

[17] https://datatracker.ietf.org/doc/draft-vaughan-aipref-vocab/

[18] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z_ZFRqSgyqAaltn_yHkzEgAAANM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[19] https://datatracker.ietf.org/doc/agenda-interim-2025-aipref-01-sessa/

[20] https://whitepapers.theregister.com/

The only think that will work....

kmorwath

... is Spamhaus-like blacklist of hoarding AI IPs. AI companies are not better than spammers.

Re: The only think that will work....

Alex 72

The load is not negligible in some quiet but important applications e.g. the kind that let you look up information for multiple scenarios, the AI bot's can crash the system or burn through pricing tiers on cloud deployments. This is leaving organisations including public bodies with the bill and having had an AI scrape content without consent. In many cases with published info like that if they asked they could have a csv, JSON or XML without effectively stealing tax money to max out http servers.

Richard 12

They ignore robots.txt, so they'll ignore this too.

The only benefit is as evidence in a later lawsuit.

David 132

“I have a big PLEASE DO NOT BURGLE ME AND STEAL MY VALUABLE PROPERTY” sign on my front lawn and STILL people keep burgling me. Perhaps if I make the sign bigger and use a fancier font and surround it with flashing lights they’ll take heed of my preference not to be burgled?”

sarusa

No, no, obviously the problem is that you are not specific enough in your sign. As this standard suggests, your sign should say 'Okay, you complete sociopaths can take all my biscuits and tea, but my Meghan porn DVDs are completely off limits - that would be beyond the pale, you monsters!' And I'm sure they'll nod, 'Right then, we're not complete monsters, innit?'

This is like putting up a sign...

Mentat74

"Dear burglars , Please don't steal my stuff... I will be VERY sad if you steal all of my stuff !!!! So don't steal my stuff OK ?"

And expecting all burglars to honor it...

Anonymous Coward

They don't ignore robots.txt the actively use it to find content.

Its like a putting up a sign saying don't burgle these rooms cos thats where the loot is.

Yeah, good luck with that.

sarusa

AI companies are all sociopaths who would happily grind live puppies and kittens into blood meal if they could make five cents on it (I have not seen any counterexamples yet, though I guess they could exist, but would get no VC funding). They're like Elmo and the Angry Toddler and Zuck and will completely ignore anything that relies on any sort of social contract or basic human decency. I know, 'basic human decency' is extreeeeemely thin and parsimonious even at the best of times, but they couldn't even meet the most meager standards for that, like not spitting on a guy dying in a gutter for being a n00b who needs to git gud.

I think this is pretty f#$ing obvious now, so any standard that actually relies on the other party not being techbro child molestors is a complete waste of time. Well, at least it lets you prove that you gave them the option and they just ignored it? If that's the only reason this exists and you realize that then I guess it will work just fine, because they will ignore it and you can go 'aha'!

Now this is kind of making me nostalgic for the early days of the internet, when so many dumbasses insisted the internet could just route around censorship as damage and corporations or governments couldn't possibly constrain Good White Men (always white men, of course good) who just wanted to post on usenet while all the women and coloreds took care of the trivial stuff.

Re: Yeah, good luck with that.

sarusa

Actually, sorry to reply to my own comment, but I did remember that DeepMind has been basically good.

It's a goddamn shame they're owned by Google, which is overwhelmingly evil now, but I think they've still managed to maintain (for now) their balance of doing almost all good things for good reasons. So I wanted to not lump them in with the bastards at Meta, OpenAI, MS, Claude, DeepSeek, etc. I do not think they would grind puppies and kittens into bloodmeal until one of Sundar's incompetent nephews (Just Desi Things TM) demanded it, and maybe not even then!

Lee D

Make it law that unless there's some kind of explicit consent that says you can, then you can't.

Oh, we already did, but nobody cares and nobody enforces it, so we had to invent random nonsense like robots.txt to try to entice them to obey which really just tells them what you DON'T want them to access but in a way that tells them where it is and which is completely optional and which most bots just ignore and which AI bots don't even bother to check.

This is why everything ends up behind paywalls and accounts. Because people think they can just take whatever they see and do whatever they want with it.

sarusa

Because, honestly, people with enough money CAN just take whatever they see and do whatever they want with it. Justice is of the rich, by the rich, for the rich. If any normal bloke had done the shite Boris, Elmo, Zuck, and the Angry Toddler have, they'd have been in the nick long ago. And no amount of standards will change this.

Thieves steal.

Anonymous Coward

Thieves don't pay attention to locked doors, private property or no trespassing signs, the only thing they pay attention to are the punishments.

And, currently, there are no punishments.

That's going to change but probably not soon enough.

mark l 2

Who in their right minds is going to consent for AI bots to use their bandwidth and server resources so that they can then ingest it into their LLM and get no benefit from it in return?

I say blocking their IP ranges is entirely appropriate as robots.txt or other flags might get respected by some AI companies but lots of them will just do it anyway. As we have seen they really don't care about copyright, unless its their own intellectual property getting stolen that is.

P*ssing in the wind?

Long John Silver

'AI' technology is rapidly escaping into the wild. Its penetration is actively supported by one faction of 'capital' (i.e. major players in IT), and strongly opposed by another faction (i.e. smug rentiers of information). It's a replay of brash capitalists of the 18/19th centuries introducing mechanisation and being pitted against Luddites; in the present case, each side commands considerable wealth, and neither merits sympathy as being 'downtrodden'.

What's happening shall in retrospect be deemed an inevitable consequence of emerging digital technologies, these combined with the global reach of the Internet. Meanwhile, there will be heated legal tussles and much prattling in legislatures.

Continuing the analogy to times past, the 'hungry and imaginative' IT-related corporations shall trample over the time-expired and complacent neo-Luddites. On the one side, huge amounts of capital will be expended, much of it to be lost, whilst giants and start-ups pursue ideas which include many dead-ends. The other side will wither on its hitherto bountiful vine.

There is a glimmer of hope of mankind, as a whole, benefitting. This arises in part from the likelihood of computational power continuing to increase whilst cost and energy consumption decrease or stabilise. That is, the development and training for 'AI' implementations is set to escape the grip of extant financial powers and fall within the remit of small companies, universities, professional associations, charitable institutions, and Internet-connected amateurs. From these intermediaries shall flow finely-tuned, cut-down in size, packages, these suitable for use on 'domestic-specification' devices. As for all software products, these will reproduce willy-nilly regardless of attempts to package them as proprietary.

Sod off, this is NOT a solution

heyrick

We little people need copyright to be enforced as effectively as it is for the music labels where a few seconds of something in the background gets a takedown.

That so many government idiots seem to be keen on going the other way because AI bots shit out rainbows means that the end result is likely to be everybody ignoring copyright entirely. I mean, if I get ripped off and published authors get ripped off, what's the point?

Fuck this AI shit right between the eyeballs.

News: 1744182130

Copyright-ignoring AI scraper bots laugh at robots.txt so the IETF is trying to improve it

The only think that will work....

Re: The only think that will work....

This is like putting up a sign...

Yeah, good luck with that.

Re: Yeah, good luck with that.

Thieves steal.

P*ssing in the wind?

Sod off, this is NOT a solution