Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co)

(Friday September 20, 2024 @09:13PM (msmash) from the all-good-things-end dept.)

The creator of an open source project that scraped the internet to determine the ever-changing popularity of different words in human language usage says that they are [1]sunsetting the project because generative AI spam has poisoned the internet to a level where the project no longer has any utility. 404 Media:

> Wordfreq is a program that tracked the ever-changing ways people used more than 40 different languages by analyzing millions of sources across Wikipedia, movie and TV subtitles, news articles, books, websites, Twitter, and Reddit. The system could be used to analyze changing language habits as slang and popular culture changed and language evolved, and was a resource for academics who study such things. In a note on the project's GitHub, creator Robyn Speer wrote that the project "will not be updated anymore."

>

> "Generative AI has polluted the data," she wrote. "I don't think anyone has reliable information about post-2021 language usage by humans." She said that open web scraping was an important part of the project's data sources and "now the web at large is full of slop generated by large language models, written by no one to communicate nothing. Including this slop in the data skews the word frequencies." While there has always been spam on the internet and in the datasets that Wordfreq used, "it was manageable and often identifiable. Large language models generate text that masquerades as real language with intention behind it, even though there is none, and their output crops up everywhere," she wrote.

[1] https://www.404media.co/project-analyzing-human-language-usage-shuts-down-because-generative-ai-has-polluted-the-data/

Internet != real life (Score:2)

by RightwingNutjob ( 1302813 )

And never did.

Things that get deliberately posted for all to see are by definition spoken in a different register than casual speech.

Back when the written word had a bigger barrier to entry between thought and printing press, this was understood. But the wrong lesson for why seems to have made people believe that blog posts and tweets are completely interchangeable with spoken interactions.

The above statement was not AI generated, and it probably contains the gist of what I would have said out loud in casua

Re: (Score:3)

by skam240 ( 789197 )

The Internet isn't real life? Good Lord! What have we all been using to post on Slashdot then!?

Re: (Score:1)

by destined2fail1990 ( 10502474 )

You're clearly not a gamer or of OG Slashdot. Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.

Re: (Score:2)

by skam240 ( 789197 )

> You're clearly not a gamer or of OG Slashdot

Why is someone with a user number approaching infinity lecturing me about not being OG Slashdot? And yes, I play video games. Of course that has nothing to do with this though.

> Anything that happens in games (which gets expanded to the internet, because they're online games) = game world, virtual world, whatever. Anything outside of games = IRL or in real life. Get with the program.

The point of my post is that people's use of language on the internet is in fact very real. One can see it, read it, and comprehend it. There is in fact nothing "not real" about it.

Re: Internet != real life (Score:1)

by Albinoman ( 584294 )

Even without AI polluting the data, it was a waste of time. No one is ever gonna go back through their data.

Re: (Score:2)

by test321 ( 8891681 )

These people don't think written and spoken interactions are interchangeable. Spoken and written languages are known different in both grammar and lexicon. Some writers are known to follow more "spoken" convention than others. Maybe the difference isn't big in all languages, though. It would certainly be worthwhile to monitor the frequency of spoken interaction but it's clearly much more difficult to implement.

Re: Internet != real life (Score:2)

by RightwingNutjob ( 1302813 )

Absolutely no writer who writes books other people choose to pick up writes the way people talk. Any transcription of real extemporaneous spoken language is fucking unreadable. It wanders, it doubles back, it contradicts itself. It relies on context communicated with body language or other visual aides.

Poochie returns to his home planet (Score:2)

by VampireByte ( 447578 )

Generative AI has polluted Poochie's reason for visiting Earth, so he is going back to his home planet where he is needed. Nothing of value was lost.

That is unfortunate (Score:5, Insightful)

by gweihir ( 88907 )

It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

Re: (Score:3)

by quonset ( 4839537 )

> It also shows another thing: Internet language data is fast becoming unusable for model training due to model collapse.

Just wait until the same thing happens when using paintings [1]as a reference [imgur.com].

[1] https://i.imgur.com/zLWNGwy.jpeg

Re: (Score:2)

by martin-boundary ( 547041 )

Feel free to believe whatever you want, it doesn't change the science of model collapse. I do want to address one point you make, which in my view is a fundamental misunderstanding you have about AI training (in the spirit of being helpful).

A document gets read, it gets split into tokens. The tokens are trained on and used, together with their one on one relationships to nearby tokens. The disclaimer you put in, that the content is AI generated, is tokenized too, but this has no appreciable effect on the r

Re: (Score:2)

by gnasher719 ( 869701 )

In other words, LLMs are now starting to learn to speak like LLMs and not like humans. I should take an LLM and modify it so that every sentence it speaks starts with "LLM". Soon the other LLMs will learn and their sentences start with "LLM" as well. And then we can filter out any sentence that starts with "LLM".

Re: (Score:2)

by PPH ( 736903 )

> LLMs are now starting to learn to speak like LLMs and not like humans.

And we can call it Lbonics.

Natural monopolies (Score:2)

by rsilvergun ( 571051 )

Anyone who got in on the ground floor has a massive competitive advantage that is basically impossible to overcome. You had mountains of free data to train your models on. Any potential competitors won't have that data and if they do manage to get a hold of it it'll be so polluted as to be worthless.

We should probably be doing something about this given that Wall Street and the people who run our economy are planning to completely transform our civilization with this tech on a scale we haven't seen sinc

Re: (Score:2)

by gweihir ( 88907 )

Well, yes. And no. Because their models still suck and age.

Re: (Score:2)

by ArmoredDragon ( 3450605 )

> I don't think we are equipped socially or politically for what's coming.

Who is 'we'? Do you have a turd in your pocket?

Re: (Score:2)

by Rick Schumann ( 4662797 )

Pretty much. Soon enough it'll be the 'AI' equivalent of copying a VHS tape too many times. They'll be outputting nonsensical gibberish that's geometrically more nonsensical than it already is.

The LLM says: (Score:3)

by chuckugly ( 2030942 )

Ah, Wordfreq, the digital linguist that bravely waded through millions of tweets, memes, and Reddit threads to track the wild and often nonsensical evolution of language. After years of tirelessly documenting humanity's descent into "yeets" and "sus," it seems even Wordfreq has decided to retire. I mean, who can blame it? One more analysis of TikTok slang and it probably would’ve needed therapy. Robyn Speer’s decision to cease updates is less about the death of a project and more about an act of mercy—for the machine. After all, there's only so many times an algorithm can process "bae," "on fleek," and "smol" before it begs for a permanent shutdown.

Re: The LLM says: (Score:3)

by EvilSS ( 557649 )

Ok gramps, finish watching NCIS then time for bed.

Re: (Score:2)

by Waccoon ( 1186667 )

Groovy peeps, daddio.

original source (Score:3)

by znrt ( 2424692 )

[1]https://github.com/rspeer/word... [github.com]

[1] https://github.com/rspeer/wordfreq/blob/master/SUNSET.md

This saddens me (Score:5, Insightful)

by PuddleBoy ( 544111 )

I have a language degree and studied in the US and abroad. While I don't work in the field (I'm an engineer now), I still get a kick out of hearing changes and new words and uses in everyday language. I watch some television shows from various countries just so that I can hear what has changed in the (many) years since I studied there. (I am always amazed at how deeply English has penetrated western European languages, especially German)

The idea that generative AI is filling the language pool with 'endless, mindless algorithmically-derived glop' (my words) means we, as humans, have handed over some control of language evolution to machines. The more we allow that, the less this quintessentially-human thing (highly complex language) that we have refined over millennia is ours. Much of the human-derived, haphazard nature of evolving language (mispronunciations, misunderstandings, clever portmanteaus) may be gently pushed aside by the insistent wave of machine-derived language preference...

I'll stop now - it's obviously a Friday afternoon...

Re: (Score:1)

by sound+vision ( 884283 )

Only if people read it, PoodleBuoy. Most of the SEO-style spam isn't intended to be read by humans. A lot of the spam going to social media is, but your average social media user is probably still reading at least 50% human-generated text. Plenty enough for them to learn and use new words.

As human language continues to evolve, the AIs language may not (easily) due to model collapse. Speaking in "2021 English" may start to flag you as a bot further down the road.

Re: (Score:3)

by PPH ( 736903 )

> Speaking in "2021 English" may start to flag you as a bot

Then I shan't engage in such folly.

Re: (Score:2)

by dsgrntlxmply ( 610492 )

Please sir, may I have some more madglop?

Re: (Score:2)

by Fly Swatter ( 30498 )

I was in a store a few weeks ago, and a child maybe 3 or 4 years old was talking to her mother - she had no emotion in her voice and strung the words together like an an internet chatbot.

I feel old saying it, but the times are changing. We are losing our personality; or at least not raising a new generation with one.

Wow (Score:1, Insightful)

by The Cat ( 19816 )

> Wikipedia, Twitter, and Reddit

The Internet ain't what it used to be.

Re: (Score:2)

by ArchieBunker ( 132337 )

Oh yeah like the alt.* hierarchy of usenet was the Library of Alexandria.

bullshit (Score:1)

by Anonymous Coward

what a load of crap, the internet has been full of bots long before AI came along, Is he claiming all that shit didn't pollute the project? sounds like just an easy excuse.

Re: (Score:1)

by destined2fail1990 ( 10502474 )

Yeah, article spinners have been around for some time now just "rewriting" articles from other websites. This is not a new concept. Although social media is now often times more AI than it was previously. I know I use AI sometimes to write my posts, mainly because it auto-adds the emojis.

I'd still like to delve into this realm of data... (Score:2)

by EmoryM ( 2726097 )

It's important to consider how LLMs have polluted the tapestry of internet language so I hope they embark on a comprehensive journey to document this vital landscape.

Guess you'll just have to TALK to people. (Score:2)

by Eunomion ( 8640039 )

"What a fuckin' nightmayuh!" /Marisa Tomei

Not just polluted, took over (Score:2)

by sinij ( 911942 )

We created linguistic AI in form of LLM, it can do nothing but generate speech. It took over language and we no longer can know what is language. Pray that we don't create AI that can reason, as we won't know what is reason after that.

Re: (Score:2)

by Rick Schumann ( 4662797 )

We're not in any danger of creating machines that can truly reason. We'll have practical, large-scale commercial fusion reactors a long time before that ever happens.

Pff who needs Wordfreq (Score:1)

by Rosco P. Coltrane ( 209368 )

"Hey Google! Give me the 500 most used English words by decreasing order of usage."

See? It's much easier and it's guaranteed to be correct.

Also, spambots (Score:2)

by Misagon ( 1135 )

Internet forums also have a huge problem with spambots that have reposted older posts scraped from forums (often Reddit), in attempts to appear like legitimate users.

The original posts could be any from a day old to ten years old, and thus skew any analysis for language trends.

If a forum is left unchecked, the amount of spamposts could become quite high. Even on an actively moderated forum with vigilant users that report them, spamposts sometimes evade detection.

Usually a spampost is from a new user account

"The internet" was always a bad source (Score:2)

by Tony Isaac ( 1301187 )

The internet has been polluted by automated website generating systems for at least a decade. SEO companies take customer money and use it to spam the web with their trash. The English in this spam is often broken, roughly translated from Chinese or some other Asian language, skewing any language analysis that might come from the internet. AI-generated content is just the next step of this evolution.

News: 0175093063

Project Analyzing Human Language Usage Shuts Down Because 'Generative AI Has Polluted the Data' (404media.co)

Internet != real life (Score:2)

Re: (Score:3)

Re: (Score:1)

Re: (Score:2)

Re: Internet != real life (Score:1)

Re: (Score:2)

Re: Internet != real life (Score:2)

Poochie returns to his home planet (Score:2)

That is unfortunate (Score:5, Insightful)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Natural monopolies (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

The LLM says: (Score:3)

Re: The LLM says: (Score:3)

Re: (Score:2)

original source (Score:3)

This saddens me (Score:5, Insightful)

Re: (Score:1)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Wow (Score:1, Insightful)

Re: (Score:2)

bullshit (Score:1)

Re: (Score:1)

I'd still like to delve into this realm of data... (Score:2)

Guess you'll just have to TALK to people. (Score:2)

Not just polluted, took over (Score:2)

Re: (Score:2)

Pff who needs Wordfreq (Score:1)

Also, spambots (Score:2)

"The internet" was always a bad source (Score:2)