News: 0180089081

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

While Meta Crawls the Web for AI Training Data, Bruce Ediger Pranks Them with Endless Bad Data (bruceediger.com)

(Saturday November 15, 2025 @05:22PM (EditorDavid) from the unfriending dept.)


[1]From the personal blog of interface expert Bruce Ediger :

> Early in March 2025, I noticed that a web crawler with a user agent string of

>

> meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)

>

> was hitting my blog's machine at an unreasonable rate.

>

> I followed the URL and discovered this is what Meta uses to gather premium, human-generated content to train its LLMs. I found the rate of requests to be annoying.

>

> I already have a PHP program that creates the illusion of an [2]infinite website . I decided to answer any HTTP request that had "meta-externalagent" in its user agent string with the contents of a bork.php generated file...

>

> This worked brilliantly. Meta ramped up to requesting 270,000 URLs on May 30 and 31, 2025...

>

> After about 3 months, I got scared that Meta's insatiable consumption of Super Great Pages about condiments, underwear and circa 2010 C-List celebs would start costing me money. So I switched to giving "meta-externalagent" a 404 status code. I decided to see how long it would take one of the highest valued companies in the world to decide to go away.

>

> The answer is 5 months.



[1] https://bruceediger.com/posts/goofing-on-meta/

[2] https://bruceediger.com/posts/anti-seo-infinite-website/



media (Score:1)

by wokka1 ( 913473 )

Many many years ago, back when napster was all of the rage, we had record labels searching the web with bots looking for hosted media. There was a movement with generated lists of mp3 and fake files on our websites. I thought I had an archive of mine, but it seems to have been removed by me, too long ago.

This story reminds me of that time, good on him.

Re: (Score:2)

by NotEmmanuelGoldstein ( 6423622 )

Translation: Rich people can rape you anytime so stop wearing panties.

What you've revealed is, we all need to fight for our safety, if we want fewer rapists.

It's surprising that no-one devotes computing time to punishing bad behaviour: It's why many corporations have built bad-faith web-scrapers.

Put a robots.txt file on the server saying "don't go here". Make the destination trigger a script that populates the directory/destination with randomly generated text files. Serve slop to greedy web-scrap

Endless Bad Data (Score:5, Funny)

by PPH ( 736903 )

Isn't Reddit enough?

Re: (Score:2)

by fahrbot-bot ( 874524 )

> Isn't Reddit enough?

We now also have the "improved" under new management [1]BLS [bls.gov] ... :-)

[1] https://www.bls.gov/

Re: (Score:2)

by martin-boundary ( 547041 )

Female AIs like the Bad Datas...

AI companies and their employees are ped0philes (Score:1)

by Anonymous Coward

AI companies keep abusing websites after said websites have removed their consent from being probed by AI companies.

It's a natural conclusion that AI companies, and all their employees, treat children the exact same way IRL.

I decided to see how long it would take ... to go (Score:2)

by Growlley ( 6732614 )

but it didnt it just changed to meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/burgular

Concerned about bandwidth? Use a tarpit (Score:2)

by Nonesuch ( 90847 )

Back in the day, we used to run "tarpit" SMTP servers which looked like an open mail relay but ACK'd incoming packets only just barely fast enough to keep the remote client from timing out and giving up. The theory was that tying up spammer resources was a net good for the internet, as a sender busy trying to stuff messages through a tarpit was tied up waiting on your acknowledgement, reducing their impact on others.

Similarly, perhaps the right answer here is to limit the number of concurrent connections

(at this point the lecture turns into why APIs exist and should be used,
and it gets more boring from there...)

- Jeff Garzik explaining the PCI API on linux-kernel