Reddit Sues Perplexity For Scraping Data To Train AI System (reuters.com)
- Reference: 0179851714
- News link: https://yro.slashdot.org/story/25/10/22/1743250/reddit-sues-perplexity-for-scraping-data-to-train-ai-system
- Source link: https://www.reuters.com/world/reddit-sues-perplexity-scraping-data-train-ai-system-2025-10-22/
> Social media platform Reddit [1]sued AI startup Perplexity in New York federal court on Wednesday, accusing it and three other companies of unlawfully scraping its data to train Perplexity's AI-based search engine. Reddit said in the complaint that the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system.
[1] https://www.reuters.com/world/reddit-sues-perplexity-scraping-data-train-ai-system-2025-10-22/
at least three lies (Score:2)
"...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
There are at least 3 lies in that one sentence alone.
Re: (Score:3)
> "...the data-scraping companies circumvented its data protection measures in order to steal data that Perplexity "desperately needs" to power its "answer engine" system."
Really, what would those be?
Arguably they are in breach of the CFAA.. Reddit's robots.txt file:
# Welcome to Reddit's robots.txt
# Reddit believes in an open internet, but not the misuse of public content.
# See [1]https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
# See [2]https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
# policy: [3]https://support.reddithelp.com... [reddithelp.com]
User-agent: *
Disallow: /
[1] https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
[2] https://www.reddit.com/r/reddit4researchers/
[3] https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
Re: (Score:1)
User-agent: *
Disallow: /
Hmmm. Based on that, anyone using any kind of a web browser shouldn't be viewing their web pages.
Which kind of takes the point out of having a web page...
Re: (Score:1)
Robots.txt isn't a legally binding document.
I still don't see how there's a basis to complain. (Score:2)
Something read public pages on reddit. So do I. What's the difference?
Re: (Score:2)
Generally courts have said that just because you put something out on the internet doesn't mean you give up copyright to it even if you make it publicly accessible.
Reddit is claiming copyright to the posts made by their users. Or at least that's the likely legal justification for this.
There are also a whole bunch of weird business laws we never think about that exists to protect businesses from other businesses. Basically stuff lobbyists put in place to protect the interests of their employers.
T
Re: (Score:2)
Listen, you AC, lobbing insults at someone I'm trying to talk doesn't benefit anyone. If you want to sit there and waste our time, have the stones to put a name on it.
Re: (Score:3)
Normally I would agree with you that the AC is a troll. But then so is rsilvergun, who is constantly pushing a communist agenda. It doesn't really matter what the topic is, he'll find a way to 1. turn it into a negative 2. tie that negative into being a result of capitalism's failure. I do disagree with the AC on some points though. For one, he's not a Chinese living in Singapore. Guaranteed he would have been deported from Singapore if he set foot. He's not a Chinese agent either, they wouldn't be so obvio
Re: (Score:2)
Yeah, but how is a copyright being violated? If Reddit can claim they have a copyright on user posts (which seems silly and like they'd screw themselves out of sec.230), how are they going to show those copyrights were violated? Is Perplexity's AI presenting exact copies as its own work, or is it producing inherently non-infringing summaries?
I suppose they could make it about server resources, especially if they can show that being hammered by LLMs is degrading the service. "You read this the wrong way
Re: (Score:2)
The difference depends on context, of course.
Generally speaking there are several cases to consider:
(1) Site requires agreeing on terms of service before browser can access content. In this case, scraping is a clear violation.
(2) Site terms of service forbid scraping content, but human visitors can view content and ...
(2a) site takes technical measures to exclude bots. In this case scraping is a no-no, but for a different reason: it violates the Computer Fraud and Abuse Act.
(2b) site takes no technical m
Re: (Score:2)
It's easy enough to look at Reddit's robots.txt file:
> # Welcome to Reddit's robots.txt
> # Reddit believes in an open internet, but not the misuse of public content.
> # See [1]https://support.reddithelp.com... [reddithelp.com] Reddit's Public Content Policy for access and use restrictions to Reddit content.
> # See [2]https://www.reddit.com/r/reddi... [reddit.com] for details on how Reddit continues to support research and non-commercial use.
> # policy: [3]https://support.reddithelp.com... [reddithelp.com]
> User-agent: *
> Disallow: /
[1] https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
[2] https://www.reddit.com/r/reddit4researchers/
[3] https://support.reddithelp.com/hc/en-us/articles/26410290525844-Public-Content-Policy
Re: (Score:2)
I am increasingly seeing the argument from this side.
Perplexity and its ilk are just a new kind of web browser that is acting as your agent to pull content from publicly available web sites.
Why isn't Reddit going after web browser makers?
I am thinking that the next escalation in this fight is simply going to be a plugin for your local web browser than your AI chatbot can proxy requests through, or just directly access content through a built-in browser.
Re: (Score:2)
An additional level of this is that Reddit is itself crowsourced. They didn't pay anybody to write their content. I'm more sympathetic to something like the Encyclopedia Britannica or NYT or the movie studios. (Although even there I agree it's still debatable).
Does Reddit themselves "own" that data? (Score:1)
Since all of that data comes entirely from posts by users, can reddit itself claim to own any of the information that they have on their website (outside of whatever stupid TOS crap they have that says whatever you post is theirs)? Since the public are by and large the originators of all of their content, it's not like they put in the work for that content that Perplexity and others are scraping. The bigger issue it seems like is the lack of attribution, with Perplexity and others frequently not citing whe
Re: (Score:2)
> Can't they just build the A.I.s to cite their sources whenever it outputs something that has a definite source, or are we past all that since they've already used all this content as training data already.
If a Reddit post amounts to a human-summary of a StackOverflow disussion, which itself is a complilation from a forum posts on a discussion board and a Wordpress blogger, who got *their* information from man pages and error outputs...who do you cite? Each of them validates the others in order to minimize the amount of "SEO Blogger Spam" that also ended up in the meat grinder somewhere.
The problem with the meat grinder is that the whole point is essentially to make it impossible to trace sources to the point
When is slashdot gonna sue? (Score:2)
When is slashdot gonna sue? We can't achieve superintelligence without scraping slashdot*.
*shove all the comments through an inverse function.
Re: (Score:2)
> When is slashdot gonna sue?
I figure the new Cloudflare "I am not a robot" challenge has something to do with that. The first time I have to click the pictures with traffic lights, I'm done here.
DMCA part of complaint looks weak (Score:2)
Reddit might have a good complaint about terms of service or CFAA or something. I don't know. But at least one part of their complaint looks like garbage:
> 7. Congress has enacted laws to prevent exactly what Defendants are doing:
> circumventing or bypassing technological measures that effectively control access to copyrighted
> works. See Digital Millennium Copyright Act, 17 U.S.C. 1201, et seq. Each of the Defendants
> in this action is profiting by evading technological control measures to access Reddit data it
> kn
Re: (Score:2)
If they have the technology to descramble the average reddit post, they're already sitting on trillion-dollar AGI tech.
What makes that illegal? (Score:2)
If Reddit provides something on the internet, people can access it. Perplexity doesn't really train either, but processes search results to create an answer that is *not* in the model itself.
Yeah, Reddits stupid "network security" tries to block VPN users, but if they are unable to block Perplexity, it's not Perplexities problem, is it? They can make Reddit login only, then someone has to accept ToS, but as long as it is freely available as long as your IP is not on a blacklist, it's just the open web.
Why? (Score:2)
Most of the Reddit AI slop 'answers' it gives me in searches are wrong anyway. Let them have their poison data.
Re: (Score:2)
It's not about the quality of the content it's about whether the content follows natural human language patterns.
The AI is just being trained to follow those patterns. It's not about generating answers that are accurate it's about generating something that looks like a human being might have written it.
Also sadly because Google is completely overrun by advertisements and low quality bot traffic reddit is the best place to get accurate information outside of a handful of extremely specific specialty
Re: (Score:2)
LLMs Can Get "Brain Rot"!
[1]https://arxiv.org/abs/2510.139... [arxiv.org]
They use data from X, but I think Reddit comments won't be much different. Outgoing links is a different topic, though.
[1] https://arxiv.org/abs/2510.13928