News: 0176749223

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

BlueSky Proposes 'New Standard' for When Scraping Data for AI Training (techcrunch.com)

(Monday March 17, 2025 @03:34AM (EditorDavid) from the Robots.txt-2.0 dept.)


An anonymous reader shared [1]this article from TechCrunch :

> Social network Bluesky recently [2]published a proposal on GitHub outlining new options it could give users to indicate whether they want their posts and data to be scraped for things like generative AI training and public archiving.

>

> CEO Jay Graber [3]discussed the proposal earlier this week, while on-stage at South by Southwest, but it attracted fresh attention on Friday night, after she [4]posted about it on Bluesky . Some users reacted with alarm to the company's plans, which they saw as a reversal of Bluesky's previous insistence that it [5]won't sell user data to advertisers and [6]won't train AI on user posts .... Graber [7]replied that generative AI companies are "already scraping public data from across the web," including from Bluesky, since "everything on Bluesky is public like a website is public." So she said Bluesky is trying to create a "new standard" to govern that scraping, similar to the robots.txt file that websites use to communicate their permissions to web crawlers...

>

> If a user indicates that they don't want their data used to train generative AI, the proposal says, "Companies and research teams building AI training sets are expected to respect this intent when they see it, either when scraping websites, or doing bulk transfers using the protocol itself."

Over on Threads someone had a [8]different wish for our AI-enabled future . "I want to be able to conversationally chat to my feed algorithm. To be able to explain to it the types of content I want to see, and what I don't want to see. I want this to be an ongoing conversation as it refines what it shows me, or my interests change."

"Yeah I want this too," [9]posted top Instagram/Threads executive Adam Mosseri , who said he'd talked about the idea with VC [10]Sam Lessin . "There's a ways to go before we can do this at scale, but I think it'll happen eventually."



[1] https://techcrunch.com/2025/03/15/bluesky-users-debate-plans-around-user-data-and-ai-training/

[2] https://github.com/bluesky-social/proposals/tree/main/0008-user-intents

[3] https://techcrunch.com/2025/03/10/bluesky-is-weighing-a-proposal-that-gives-users-consent-over-how-their-data-is-used-for-ai/

[4] https://bsky.app/profile/jay.bsky.team/post/3lkens3n4w223

[5] https://techcrunch.com/2025/02/03/what-is-bluesky-everything-to-know-about-the-x-competitor/

[6] https://techcrunch.com/2024/11/15/unlike-x-bluesky-says-it-wont-train-ai-on-your-posts/

[7] https://bsky.app/profile/jay.bsky.team/post/3lkeojfh3u223

[8] https://www.threads.net/@benbarry/post/DHCGeBbSpz2

[9] https://www.threads.net/@mosseri/post/DHCHPBeyfpQ

[10] https://www.forbes.com/profile/sam-lessin/



Royalties (Score:3, Insightful)

by Njovich ( 553857 )

The laws are clear, and that's why Google and the like are asking for an exemption from the law: [1]https://www.theverge.com/news/... [theverge.com]

We need a way where google and others can pay fair royalties to people whose work their AI is trained on, both for the training itself and for the token generation.

Obviously that's a billion people and it won't amount to much per person, but it's ludicrous to just let them take other people's data and profit off of it without any form of compensation.

The argument that 'chinese are doing it so we need to do it as well' sounds like begging for a race to the bottom. Chinese are allowing US films, software and games to be pirated without any action, does that mean we should do that as well?

[1] https://www.theverge.com/news/630079/openai-google-copyright-fair-use-exception

Good article (Score:2, Insightful)

by evanh ( 627108 )

And it's funny how the wording of the exemption requests are all future tense, as if the copyright infringements have not yet occurred.

Tech bros cooperating on a psyop (Score:2, Insightful)

by Pinky's Brain ( 1158667 )

There is zero legal justification for opt in copyright protection. Either training on scraped data from public internet is fair use and only a click through license can protect the content or it isn't fair use and needs no further protrction.

BlueSky forcing users to opt in or defacto give up their rights is a betrayal of their users.

public != public domain (Score:5, Insightful)

by martin-boundary ( 547041 )

public != public domain.

But I repeat myself.

Public is public (Score:3)

by bradley13 ( 1118935 )

Do you know how human artists learn their craft? They spend a lot of time looking at work by other artists. Guess what: they don't buy copies or pay royalties for everything that they look at. I don't see why this should be different for AI.

If you put something onto the public internet, it is going to get looked at . If you don't want it looked at, either put it behind a paywall, or don't put it on the internet. It's that easy.

Be warned that typing \fBkillall \fIname\fP may not have the desired
effect on non-Linux systems, especially when done by a privileged user.
-- From the killall manual page