News: 0179383556

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Is OpenAI's Video-Generating Tool 'Sora' Scraping Unauthorized YouTube Clips? (msn.com)

(Saturday September 20, 2025 @11:34AM (EditorDavid) from the gently-down-the-streaming dept.)


"OpenAI's video generation tool, Sora, can create high-definition clips of just about anything you could ask for..." [1]reports the Washington Post .

"But OpenAI has not specified which videos it grabbed to make Sora, saying only that it combined 'publicly available and licensed data'..."

> With ChatGPT, OpenAI helped popularize the now-standard industry practice of building more capable AI tools by scraping [2]vast quantities of text from the web without consent. With Sora, [3]launched in December, OpenAI staff said they built a pioneering video generator by taking a similar approach. They developed ways to feed the system more online video — in more varied formats — including vertical videos and longer, higher-resolution clips... To explore what content OpenAI may have used, The Washington Post used Sora to create hundreds of videos that show it can closely mimic movies, TV shows and other content...

>

> In dozens of tests, The Post found that Sora can create clips that closely resemble Netflix shows such as "Wednesday"; popular video games like "Minecraft"; and beloved cartoon characters, as well as the animated logos for Warner Bros., DreamWorks and other Hollywood studios, movies and TV shows. The publicly available version of Sora can generate only 20-second clips, without audio. In most cases, the look-alike scenes were made by typing basic requests like "universal studios intro." The results also showed that Sora can create AI videos with the logos or watermarks that broadcasters and tech companies use to brand their video content, including those for the National Basketball Association, Chinese-owned social app TikTok and Amazon-owned streaming platform Twitch...

>

> Sora's ability to re-create specific imagery and brands suggests a version of the originals appeared in the tool's training data, AI researchers said. "The model is mimicking the training data. There's no magic," said Joanna Materzynska, a PhD researcher at Massachusetts Institute of Technology who has studied datasets used in AI. An AI tool's ability to reproduce proprietary content doesn't necessarily indicate that the original material was copied or obtained from its creators or owners. Content of all kinds is uploaded to video and social platforms, often without the consent of the copyright holder... Materzynska co-authored [4]a study last year that found more than 70 percent of public video datasets commonly used in AI research contained content scraped from YouTube.

Netflix and Twitch said they did not have a content partnership for training OpenAI, according to the article (which adds that OpenAI "has yet to face a copyright suit over the data used for Sora.")

Two key quotes from the article:

"Unauthorized scraping of YouTube content continues to be a violation of our Terms of Service." — YouTube spokesperson Jack Malon

"We train on publicly available data consistent with fair use and use industry-leading safeguards to avoid replicating the material they learn from." — OpenAI spokesperson Kayla Wood



[1] https://www.msn.com/en-us/news/technology/openai-won-t-say-whose-content-trained-its-video-tool-we-found-some-clues/ar-AA1MT8ll

[2] https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

[3] https://www.washingtonpost.com/technology/2024/12/09/sora-launch-openai-ai-video/

[4] https://arxiv.org/pdf/2412.17847?itid=lk_inline_enhanced-template



Terms of service (Score:2)

by topham ( 32406 )

Terms of service aren't much of a concern here. Honestly, completely irrelevant unless OpenAI opens up 30,000 streams at once.

Terms of service have little value in court when it comes to scraping of content that doesn't cause issues with the service itself. Contract violation with near zero repercussions.

Former CTO (Score:2)

by EltonJuan ( 10503148 )

A year ago, I recall the Wall Street Journal interviewing OpenAI's former CTO, Mira Murati. When she was asked about whether they scrape Youtube videos, she got suspiciously hesitant and responded like she was in a deposition coached by a lawyer saying she wasn't sure.

Probably! (Score:2)

by forrie ( 695122 )

Sora, which I refer to as "Sore-a" as it's not always great at doing *what you ask for*, has certainly used a ton of data that wasn't "authorized" -- but I have taken note that there are specific filters it refuses to act on, such as popular cartoon or comic characters, etc. This likely means they have a big "DMCA list" somewhere.

I once asked for Miss Piggy to be dancing in a background, and it gave me something that looked like "Elf on a Shelf" instead LOL.

But, AI *has* to use reference material. That's h

Re: (Score:2)

by JaredOfEuropa ( 526365 )

Reform copyright, allow derivative works, abolish moral rights. What's the worst that could happen? Solves the problem of AI being "inspired" by existing works. Well, perhaps someone will write a crappy HP-inspired story about Tanya Grotter, a machine-gun wielding lady wizard who goes after bad Chechens (that is a real book, BTW). So what? The goal of copyright is cultural abundance, and that will (eventually) include AI generated works.

Look at Nosferatu, considered to be one of the great vampire mo

Snakes. Why did it have to be snakes?
-- Indiana Jones, "Raiders of the Lost Ark"