Bluesky's Open API Means Anyone Can Scrape Your Data for AI Training. It's All Public (techcrunch.com)
- Reference: 0175575233
- News link: https://tech.slashdot.org/story/24/12/01/2125225/blueskys-open-api-means-anyone-can-scrape-your-data-for-ai-training-its-all-public
- Source link: https://techcrunch.com/2024/11/27/blueskys-open-api-means-anyone-can-scrape-your-data-for-ai-training/
"Shortly after the article's publication, the dataset was removed from Hugging Face," the article notes, with the scraper at Hugging Face [3]posting an apology . "While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake." But TechCrunch [4]noted the incident's real lesson . "Bluesky's open API means anyone can scrape your data for AI training," calling it a timely reminder that everything you post on Bluesky is public.
> Bluesky might not be [5]training AI systems on user content as other social networks [6]are doing , but there's little stopping third parties from doing so...
>
> Bluesky said that it's looking at ways to enable users to communicate their consent preferences externally, [but] the company [7]posted : "Bluesky won't be able to enforce this consent outside of our systems. It will be up to outside developers to respect these settings. We're having ongoing conversations with engineers & lawyers and we hope to have more updates to share on this shortly!"
Mashable notes Bluesky's response to 404Media — that Bluesky is like a website, and "Just as robots.txt files don't always prevent outside companies from crawling those sites, the same applies here."
So "While many commentators said that data collection should be opt in, others argued that Bluesky data is publicly available anyway and so the dataset is fair use," [8]according to SiliconRepublic.com .
[1] https://mashable.com/article/bluesky-ai-dataset-using-one-million-user-posts
[2] https://www.404media.co/someone-made-a-dataset-of-one-million-bluesky-posts-for-machine-learning-research/
[3] https://bsky.app/profile/danielvanstrien.bsky.social/post/3lbvih4luvk23?ref=404media.co
[4] https://techcrunch.com/2024/11/27/blueskys-open-api-means-anyone-can-scrape-your-data-for-ai-training/
[5] https://techcrunch.com/2024/11/15/unlike-x-bluesky-says-it-wont-train-ai-on-your-posts/
[6] https://techcrunch.com/2024/09/13/meta-reignites-plans-to-train-ai-using-uk-users-public-facebook-and-instagram-posts/
[7] https://bsky.app/profile/did:plc:z72i7hdynmk6r22z27h6tvur/post/3lbvgvbvl6c2c
[8] https://www.siliconrepublic.com/machines/hugging-face-data-bluesky-posts-data-privacy-ai-training
What's wrong with this? (Score:4, Insightful)
It's public data, available for anyone, including AI bots, to peruse and learn from at their will. All this hubbub about AI stealing my shit is just that -- shit. AI, just like anyone, should have the right to view/read/scan any publicly available data, including copyrighted data if available publicly, to learn and grow. What it should not be able to do, just like real people cannot do, is plagiarize that data by using word for word quotes without proper citations. Authors/creators of data have the right to go after plagiarizing AI, just as they do with plagiarizing humans, if they find their work used without proper credit.
Again, if your work is out there for others to freely access and learn from, then those who can learn from it include AI. If you don't like it, don't publicly publish your work.
Re: What's wrong with this? (Score:4, Informative)
This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ€(TM)s available on a website. In the U.S. at least, this is legitimately not true.
I know the crypto bros are super upset that their NFTs didnâ€(TM)t go anywhere and now they want to grift on AI, but this is patently not the case.
Each US poster on Bluesky patently owns their content whether theyâ€(TM)ve asserted the copyright or not.
Re: (Score:1)
What's the point of owning the content if you freely license it out to be used? From [1]https://bsky.social/about/supp... [bsky.social]:
> By sharing User Content through Bluesky Social, you grant us permission to:
> Use User Content to develop, provide, and improve Bluesky Social, the AT Protocol, and any of our future offerings. For example, we can store and present User Content to other users in Bluesky Social. This allows us to show your posts in the Bluesky app to other users;
> Modify or otherwise utilize User Content in any media. This includes reproducing, preparing derivative works, distributing, performing, and displaying your User Content. For example, we can resize your posts to fit the Bluesky mobile or desktop app, or feature examples of User Content for promotional purposes; or
> Grant others the right to take the actions above. For example, we can grant content moderation tools access to User Content in order to monitor Bluesky Social;
[1] https://bsky.social/about/support/tos#user-content
Re: What's wrong with this? (Score:1)
This grants permission to Bluesky, but does not automatically give permission to anyone else. Most of these provisions are necessary for normal operation. I do wish these were not as broad though.
Re: What's wrong with this? (Score:2)
Yeah. The unfortunate legalize that exists in order to cover the concept of having an app to view the content is a dumb requirement, but has to be there in order to cover themselves.
But still doesnâ€(TM)t give grifters the â€oeright†to train their AI models.
Re: (Score:2)
The permission is the same one that lets me copy and paste your comment (and anything else displayed on the internet) and do whatever I want with it. You put it out there and they took it even though you asked nicely not to.
Re: What's wrong with this? (Score:2)
TECHNICALLY speaking you arenâ€(TM)t legally allowed to do that. I know itâ€(TM)s generally not something people follow up on due to limitations of effort, cost, and time; but the point stands.
If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.
Re: (Score:2)
> If someone were to catch an AI platform that grifted off their copyrighted materials, they could sue. Thatâ€(TM)s just the facts in the U.S.
You maybe surprised to hear this but a great deal of the world isn't the US and I get that while you are saying it is technically illegal good luck proving I took your post (in combination as many others as I could) and used it to make my own.
Re: (Score:2)
> You maybe surprised to hear this but a great deal of the world isn't the US
Copyright law is mostly standardized between countries.
[1]Berne Convention [wikipedia.org]
[1] https://en.wikipedia.org/wiki/Berne_Convention
Re: (Score:1)
That depends on your interpretation of their legalese; is AI training "preparing derivative works"? Or is sharing the content with AI models "distributing"? IANAL... If training AI models is allowed under those terms, then Bluesky can make your data available to others to train AI models ("Grant others the right to take the actions above.").
Re: (Score:2)
> This statement seems to imply just because you post it on the internet you relinquish all copyright rights
No, it implies that reading isn't copying.
If a human reads a website, no one considers that copying. Incidental caching doesn't count.
If a computer reads a website, is that "copying"? So far, that has not been tested in court.
> crypto bros are super upset that their NFTs didn't go anywhere and now they want to grift on AI
NFT "crypto bros" and AI developers are different sets of people with little or no overlap.
Re: (Score:2)
The computer isn't "reading it" in anything approximating a human fashion. What is happening is a company is incorporating the content into a statistical model--they are creating something from the content.
Anthropomorphizing an AI model doesn't mean you can spout your "it's reading" BS and expect people to believe it.
Re: (Score:2)
"No, it implies that reading isn't copying."
That used to be true. With computerized data storage, it is not true any longer.
Re: (Score:2)
> With computerized data storage, it is not true any longer.
Human reading of websites causes caching in "computerized data storage". That is not considered copying.
If an AI learned by re-downloading the page each time it was scanned, without caching, would you drop your objections?
Re: (Score:2)
"Human reading of websites causes caching in "computerized data storage". That is not considered copying."
By whom? It certainly looks like copying to me.
Re: (Score:2)
> By whom?
By the courts and by law.
Specifically, by Section 512 of Title 17 of the United States Civil Code.
Other countries have their own laws, but browsers are not illegal in any country, and all browsers use caching.
Re: (Score:2)
The law has been trying to stretch laws written when reading and copying were different things by creating arbitrary definitions to classify "copying" as "not copying". This is working about as well as you might expect.
Re: (Score:2)
It's considered copying if you read a book, then use sentences, phrases, characters, and to some extent concepts present in that book as part of my own work.
AI is not merely "reading" the text, it is ingesting the text explicitly for the purpose of puking it back out upon request. It doesn't even creatively add to the text it eats, just mixes it with other digested words in a grammatically correct order that, to an non-discerning user, appears to be a coherent thought.
That's copying. it also does this witho
Re: (Score:2)
The AI grifters and shills are the same people who were shilling blockchain stuff last year. Those aren't the same people as the developers, as the grifters and shills wouldn't know how to program a hello world never mind an AI model.
Re: (Score:2)
> This statement seems to imply just because you post it on the internet you relinquish all copyright rights to your content because itâ€(TM)s available on a website. In the U.S. at least, this is legitimately not true.
It most demonstrably is.
Re: (Score:2)
If you don’t like the public nature of the internet, then don’t post on social media. It doesn’t matter what contract or belief in copyright you have, when you’ve put something out in public it’s there for all to see, whether it is by a bot or human.
Feel free to hire a lawyer, but i would suggest avoiding public speech first, to save on those bills.
Re:What's wrong with this? (Score:5, Interesting)
This is the only thing that makes sense. Social networks are for being social. That means putting the info out into the world. If I wanted to make sure nobody was reading what I was writing, with automated tools or manually, I would use E2E encrypted messages, probably using public key cryptography. And then probably not even the recipient would bother to read them :)
The only things people can publish to Bluesky are 1) short text messages, 2) very poor quality images*, and 3) links. Links are by definition to published content, very poor quality images have little value for AI training, and your short text messages are ostensibly intended for public consumption so there was never going to be any stopping people from using them for training no matter where you posted them. You don't need an API to scrape public comments.
* Not only does Bluesky crunch images up at least as badly as Faceboot but when I post images they are replaced by a black square. I'm told this happens with high-res images, but of the three images I've tried to post, only one of them was over XGA resolution. Maybe it's a result of something I'm doing with ublock origin? Irritating AF.
Re: (Score:2)
"Social" and "public" are related, but distinct concepts.
When my wife and I engage in intimate affairs in our bedroom, it is a social activity but it also very VERY private.
I don't know anything about Bluesky other than what I've heard in the news the last few weeks. So you probably make a very valid point about the type of content that people post on Bluesky and whether or not that content is something that a reasonable person would feel protective about. But I do use Facebook to keep in touch with distant
Re: (Score:2)
Public data on a privately-owned website? Yeah, that's not public data.
Re: (Score:2)
> Public data on a privately-owned website? Yeah, that's not public data.
Sorry, if your data is viewable by the public, either by posting on the internet by you, allowing a public library to digitally loan out, or any other means, your data is available to the public to access and learn from. If you don't agree with that, don't publish or allow your data to be viewed by the public.
Re: (Score:3)
"others argued that Bluesky data is publicly available anyway and so the dataset is fair use"
OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
Re:What's wrong with this? (Score:5, Insightful)
> OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
You are freely permitted to record OTA signals regardless of copyright (see Sony Corp. of America v. Universal City Studios, Inc. 1984). Distributing is another matter (and it is also an open question of whether AI systems "distribute" the data they have analyzed).
Re: What's wrong with this? (Score:2)
"Sony" applied to personal, non-commercial use. As the ruling started: "If the Betamax were used to make copies for a commercial or profitmaking purpose, such use would presumptively be unfair."
Re: (Score:2)
> OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
Try watching some tv shows and then making more like them because there are entire fucking industries based on that.
Re: (Score:2)
> "others argued that Bluesky data is publicly available anyway and so the dataset is fair use" OTA TV and radio are publicly available. Try recording and distributing it, see how far "fair use" gets you.
And it is fair use. Anyone can watch/listen to TV/radio/online video/podcast and use the information learned to write/create more data, as long as they don't directly copy verbatim the data. If you find an AI spewing out your data word for word, you have every right to sue the company in control of that AI, just as you would have every right to sue an individual or corporation who directly copied your data and passed it on without crediting you.
I really don't understand how people can't grasp this. Our e
OK. (Score:1)
Just fill it up with nonsense.
Re: (Score:2)
I thought the problem was it already is.
Re: (Score:3, Insightful)
Trumpers...Trump isn't REALLY a criminal, all those illegal things he did wrong are FINE, because...well, he is Trump! That is where the true Trump Derangement is, giving him a pass for being a con artist and criminal. You know that if you lie on your taxes, that CAN get you thrown in prison, don't you?
Re: (Score:2)
Yep. Its pure projection.
They paint onto you what they are guilty of.
Re: (Score:2)
I don't think we need to bring in your ability to believe el Bunko, the Artist. I have to admit that is an amazing ability, do you have any others?
Re: (Score:2)
Jesus man, he ain't gonna fuck you. Well, not the way you want anyway.
Epic (Score:2)
Just epic.
Bet your ass AI startups are already doing it. (Score:2)
So dumping the Bluesky data is 1) Free to do, and 2) Legally and morally ambiguous due to intertwined licenses etc.
The true question is why wouldn't they?
Re: (Score:1)
Maybe Bluesky say they it will never train generative AI on its users' data, so that users know they don't have an incentive to make Bluesky more suitable for that. Of course, because in the open, it is difficult to prevent 3d parties scraping for whatever reason (legal or not). If everybody clearly knows about all that: they can act as they see fit. Like not posting stuff on Bluesky that should remain more private.
Flashback (Score:2)
We were talking about this in the comments a month ago.
[1]https://slashdot.org/comments.... [slashdot.org]
[1] https://slashdot.org/comments.pl?sid=23521835&cid=64949329
Somebody is going to get your data (Score:2)
If you go to Twitter Elon musk gets it if you go to Facebook Mark Zuckerberg gets it. I'd rather everyone get it.
Your data is going to get used to train AI to replace you. That's just a fact of modern life. The real problem is we never get a piece of the action.
Re: (Score:2)
The one thing that I haven't pointed out to the Bluesky crowd: They're having a discussion with the person who made the dataset. Rather than pushing the guy to block the dataset (which anybody else can secretly make anyway), it's an opportunity to have some grass-roots discussions about ethical use, like "Hey, it's OK, but please anonymize user names, etc."
No casual user without a legal budget has a chance at having a discussion with Meta, OpenAI, Anthropic or Google about their data collection procedures.
You know what? I wouldn't mind, if not... (Score:3)
I wouldn't so much mind all my data being sucked up by the AI training / aggregation routines if not for the fact that they are "owned" by some of the greediest, most self-centered assholes to have ever crawled up out of the slime of the rest of humanity to positions of power. I'd happily feed my manuscripts, such as they are, to an open source / truly free AI, meant to be a public good. But all of these fucking things right now are owned by massive capitalist institutions with mouthpieces that make the Gilded Age masters look like kind-hearted liberal-oriented humanitarians. Yes, I get that it takes money to run these "eats more power per second that entire neighborhoods use in a year" systems, but what good is it doing other than continuing to pull wealth from the entirety of society in order to continue to feed those who have plenty? If AI is going to replace us all, what's the benefit to those of us not already in the owner class? Like it or not, society is built on the shoulders of the lower and middle classes. If the owner class manages to find a way to not need the lower and middle classes through AI or any other means, what's the end-game for us?
The small price of interoperability (Score:3)
Whining that the data is accessible is something I expect from movie execs. Now techies too?
Oh noes, we have access to the data, because it's not locked down in a secure enclave! (Data that 100% of the users deliberately uploaded so that it [1]could be read [wikipedia.org] by others.)
[1] https://en.wikipedia.org/wiki/AT_Protocol
Yep (Score:2)
And your website if you have one. You can bet someone's ignoring your robots file. And google and X, microsoft and all the Meta and your phone, good god, y'all. Every app. Oh and email, never been private.
If I want to find out I can. If they do, they can. If you do, you can, hire a PI. I'm a little more than over it. This is fear mongering, if you weren't aware, here it is. If you're just now afraid. Sorry kid, it gets worse. The heart grows cold.
Never fear! It's fine, it's fine. (Score:2)
Dr. Kleiner says the huggy face humper has been fully debeaked.
Transfering guilt (Score:2)
Well anything that has a public api can be used to train your data. Bluesky is actually cool to be open, how on earth is it related to bad actors training on public data
Re: (Score:2)
I was thinking same... If you publish/post stuff to be read by anyone, then that would be public domain information. I don't see how this is different than posting to Shitter. So the only difference between "reading" and scraping is that scraping is automated and large scale.
Nice. Give out (Score:2)
Please also give out publicly the stats on any organization using the API extensively.
Re: Nice. Give out (Score:2)
So, the blue echo chamber has a hole in it. Will that reduce the level of the echos?
Re: (Score:2)
it's not a hole. it's a window.
And it should be transparent in both directions.