Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds (aeon.co)

(Tuesday October 14, 2025 @05:21PM (msmash) from the lost-in-translation dept.)

Reference: 0179780120
News link: https://slashdot.org/story/25/10/14/155258/generative-ai-systems-miss-vast-bodies-of-human-knowledge-study-finds
Source link: https://aeon.co/essays/generative-ai-has-access-to-a-small-slice-of-human-knowledge

Generative AI models trained on internet data [1]lack exposure to vast domains of human knowledge that remain undigitized or underrepresented online. English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide. Approximately 97% of the world's languages are classified as "low-resource" in computing.

A 2020 study found 88% of languages face such severe neglect in AI technologies that bringing them up to speed would require herculean efforts. Research on medicinal plants in North America, northwest Amazonia and New Guinea found more than 75% of 12,495 distinct uses of plant species were unique to just one local language. Large language models amplify dominant patterns through what researchers call "mode amplification." The phenomenon narrows the scope of accessible knowledge as AI-generated content increasingly fills the internet and becomes training data for subsequent models.

[1] https://aeon.co/essays/generative-ai-has-access-to-a-small-slice-of-human-knowledge

OK, so put it on the internet (Score:3, Insightful)

by drinkypoo ( 153816 )

It's not a surprise if human knowledge which is kept secret doesn't show up in LLMs. And today, not putting any knowledge on the internet is effectively that. The reason all our nerd shit shows up in LLM data is that we made it freely available to all on the open internet.

Re: (Score:2)

by Z00L00K ( 682162 )

Secrecy, copyright protection and in obscure languages are probably the most common reasons for knowledge not contributing to AI generations.

Re: (Score:2)

by allo ( 1728082 )

DRM is killing our digital legacy. There will be no abandonware sites for today's games, as the publishers make sure you won't be able to run them.

Re: What? Didn't just 12 days ago /. proclaim... (Score:2)

by sziring ( 2245650 )

For the internet scrapers. Don't expect anyone to get off their ass and digitize something manually.. you can get paper cuts in the real world.

Not on the internet != "kept secret" (Score:1)

by davidwr ( 791652 )

Many low-press-run or one-off publications never made it to Google Book's library-vacuum effort of the early 2000s.

Ditto the countless archives in courthouses/governments, schools, religious institutions, companies, and elsewhere that haven't been fully digitized yet.

It's not like these are being deliberately kept secret as much as they are obscure or the maintainers don't have the funds to digitize them.

If you have time or money to donate to your local historical society or other not-yet-digitized-archive-

True, but BS (Score:4, Insightful)

by rta ( 559125 )

from TFA :

> Over time, epistemological approaches rooted in Western traditions have come to be seen as objective and universal, rather than culturally situated or historically contingent. This has normalised Western knowledge as the standard, obscuring the specific historical and political forces that enabled its rise.

In a basic sense, this is true, but in general it is used to bamboozle people into the (incorrect) "math is racist" mindset and then to much handwringing and government spending on dumb reports

If you line up Christianity vs Islam vs Hinduism etc yes, your have different epistemological and metaphysical approaches.

But physical based science and even modern psychology and economics are both universal and generally hostile to (or at least orthogonal to) all the classic views.

Articles like this really underplay the degree to which the past is a foreign land for all of us.

Re:True, but BS (Score:4, Insightful)

by hdyoung ( 5182939 )

that quote is a good example of something that sounds incredibly intellectual and worth thinking about, but it still quite wrong on several levels if you dissect it.

To translate this liberal-professer-garble into plain speaking:

Apparently since 1) western thought is dominant across the world right now, 2) that very dominance prevents us from thinking about the non-western history that gave rise to it? Um, no. Statement 1 does NOT lead to statement 2. Just because I'm at the top of the dog pile, that doesn't mean I'm necessarily blind to how I got there.

Getting a bit further in the weeds, the writer questions the universality of western thought. That's the sort of self-loathing that'll trigger just about anyone outside a liberal arts department. I'm not denying that western ideology is full of inconsistencies and hypocrisy. Such as the US founding fathers making sure that everyone is free, excluding women and brown people). But, western thought has almost always aspired to be better and, yes, universal. Do western countries subvert and twist it to fit their own agendas? Sure. But the ideas of the enlightenment were pretty close to universal, which is quite different than a lot of non-western ways of thinking. Most of those amount to some form of "my race/religion/ethnicity/city/country/village is the chosen one because *insert nonrational reason here* thus we should rule and everyone else is lower on the hierarchy.

I'll take the "western epistemiological approach" any day of the week, thank you very much.

Re: True, but BS (Score:1)

by LindleyF ( 9395567 )

I hope you appreciate the irony of calling out my-way-is-best while declaring western thought best.

Re: (Score:1)

by Wheres the kaboom ( 10344974 )

> I hope you appreciate the irony of calling out my-way-is-best while declaring western thought best.

That is essentially straw-manning. Hdyoung made no claim that western thought as a whole is best. His only claim is that the aspirational parts of western thought are worth pursuing, particularly as they deliberately make room for objectively assessing other philosophies without resorting to violence or polemics.

Presumably he’s including western ideals like blind justice, equality of opportunity, judging by merit instead of identity, valuing the sanctity of individual life, recognizing every man is a

Re: True, but BS (Score:2)

by LindleyF ( 9395567 )

T'was a little joke sair. (Extremely little, Ensign.)

Just BS (Score:2)

by Roger W Moore ( 538166 )

> In a basic sense, this is true

> Not really it's just wrong. The one approach that came from Western cultures is the scientific method which is both objective (to the maximum extent any human method has yet achieved) and universal which is why there is no such thing as Chinese, Canadian or Indian etc science there is just science because it is universal. As you alluded to the scientific method has often (including now to some degree) found itself at odds with western culture so I would argue that the scientific method is a product of western culture but not part of it.

> Arguing that it is "culturally situated" is nonsense. While science has definitely impacted western culture it has also impacted every culture around the planet and today there are scientists in every continent from a myriad of different cultures. Your culture may impact which questions you want to answer with science but, if you are doing it correctly, it will not affect the knowledge you find and that's why it is both universal and acultural. Indeed, the universal nature of science means it is one of the few things that can bring people of different cultures to work together towards a common goal: to understand the objective reality that we all share.

researchers call "mode amplification" (Score:2)

by oldgraybeard ( 2939809 )

Which also seems to apply to the Internets bias, crazy, obsolete and just plain incorrect information. Which is regurgitated and used for training. Which makes one wonder if LLM's can ever function dependably.

Re: (Score:2)

by sabbede ( 2678435 )

I'm wondering what sort of researchers call it that. I did a bit of searching, found a lot of references regarding solid state electronics, optics, and this article.

Re: (Score:2)

by HiThere ( 15173 )

A point, but (and this is admittedly a quibble) I wouldn't call languages a "vast body of human knowledge". The data encoded within that language might qualify, but not the language itself. Unfortunately, without understanding the language there's no way of reasonably estimating the size of the contained "human knowledge" that isn't contained in sources already covered.

FWIW, I think treating "the internet" as a body of human knowledge is foolish. Parts of it are, but much of it is negative-knowledge (i.e

It's not just foreign languages (Score:2)

by jd ( 1658 )

There's a lot of stuff that is on the Internet that doesn't end up in AIs, either because the guys designing the training sets don't consider it a particular priority or because it's paywalled to death.

So the imbalance isn't just in languages and broader cultures, it's also in knowledge domains.

However, AI developers are very unlikely to see any of this as a problem, for one very very important reason --- it means they can sell the extremely expensive licenses to those who actually need that information, wh

Garbage In / Garbage Out (Score:1)

by fredness ( 95020 )

EOM

English dominates vs Tamil && Hindi (Score:4, Insightful)

by Hadlock ( 143607 )

English dominates Common Crawl with 44% of content. Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population. Tamil represents 0.04% despite 86 million speakers worldwide.

English dominates because not only are there a lot of speakers, but it is the modern business lingua franca and most anyone who owns a desktop computer today can probably grumble out a handful of statements or questions in english. Hindi and Tamil on the other hand, use completely different writing systems and beyond a couple of clever words have zero vocabulary overlap with "western" languages. Simply due to inertia of 2 billion speakers Hindi/Tamil etc will continue on forever, but I can't see them being targeted by western technology. Americans and Europeans already struggle with cyrillic and it's at least recognizably sorta phonetically similar about half the time. Tamil just makes my eyes glaze over when I see it on street signs in Malaysia or whatever.

Re: (Score:2)

by larryjoe ( 135075 )

This article was written by an Indian student studying in the US. So, he's just citing an example based on his personal perspective. Aside from English, the one language that would be sort of a natural fit for AI training is Chinese. Up to 17% of the world's population can read/speak Chinese, which is close to the up to 20% that can read/speak English. Plus, a large percentage of AI researchers, companies, and models are located in China.

The article's author looks at Common Crawl, but that may not be re

Re: (Score:2)

by sabbede ( 2678435 )

Chinese is a bad choice. It isn't one language, it's Mandarin ("Standard Chinese"), Cantonese, and about a thousand little dialects. It's also damn near unusable, being a tonal (5% of the world will never be able to understand or speak it) analytic language with an absurd logography.

Re: (Score:2)

by HiThere ( 15173 )

IIUC, the chinese ideograph system is common between all those languages, and therefore would count as one common language...until the computers started audio processing. (FWIW, it's my understanding that many of the Chinese ideographs even have approximately the same meaning in one of the Japanese writing systems.)

History is written by the victors (Score:3)

by Big Hairy Gorilla ( 9839972 )

That idea is more relevant than ever, we're seeing it being rewritten in realtime.

See the "War in Portland and Chicago." I saw it on TV, it must be true, right?

I read the article. I hear snowflakes melting. I'd like to be sympathetic but...

The man admits he got "medical advice" off the internet regarding his Dad's medical problem. That's for sure going to be correct, Right? Does getting medical advice off the internet make him more or less authoritative? Also the man is in "Ethical AI" studies. Better become a professor, because "Ethical" and "AI" don't belong in the same sentence. They fire people like that around Google.

AI doesn't represent X percentage of knowledge?

The internet doesn't represent X percent of knowledge per ethnic group?

How do you say DEI without actually saying "DEI?"

If I wrote 10 percent of all stuff on the internet, does that mean that what I wrote is valuable and should comprise 10 percent of what's in AI?

We're back to the "all ideas are equal" thing.

Are they or aren't they?

Did that guy do the right thing by NOT recommending surgery for his dad?

Did an unspecified herbal blend from India FOR CERTAIN cure his dad? or was it just luck? or maybe it had nothing to do with the tumor on his tongue and just by doing nothing his body healed? Can a single anecdote be generalized to all cases? Steve Jobs used herbal remedies for pancreatic cancer, and it didn't work. So, which case is the one you should think is correct?

Try to be logical about this, before the silvergang downvotes me. I'll just repost it anyways, so bombs away.

I'll just say it: all ideas aren't equal. Some are better than others.

Much is implied here, but the assumptions must be questioned.

Re: (Score:2)

by sabbede ( 2678435 )

I noted the bit about him searching the internet for medical advice, but the additional "like a good millennial" made me thing he was looking back and making fun of himself.

But things do get stupid from there. He notes real issues but appears to be so far up his own rear that he gets lost in hand-wringing polemics. This should have been a much, much shorter article.

I am prone to making similar mistakes. I suspect it means I took too many philosophy courses. Or just spent too much time in college.

Bonus Data (Score:2)

by hyades1 ( 1149581 )

"Hindi accounts for 0.2% of the data despite being spoken by 7.5% of the global population...and 82.7% of scam call centre employees.".

Fixed that for ya! :)

Re: (Score:2)

by HiThere ( 15173 )

No. The scam callers speak English. Perhaps not well, but it's English that they are speaking.

To repeat a point I made earlier, information is not knowledge. Knowledge may be either true or false (i.e. it's a signed quantity). Information is most densely contained in (at least apparently) random noise.

Curation (Score:2)

by jma05 ( 897351 )

At the end of the day, dataset authors must make a call on what is important and what is not. Just because it exists should not be a reason that it should be in training data. Training data must not be blindly representative, but prioritize epistemic value.

Let's take science as an example. There would be nothing in Hindi (or other regional languages in low scientific output areas) that isn't also in English, as far as scientific value is concerned.

What would the dataset miss? Local chatter?

Microsoft's small

News: 0179780120

Generative AI Systems Miss Vast Bodies of Human Knowledge, Study Finds (aeon.co)

OK, so put it on the internet (Score:3, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: What? Didn't just 12 days ago /. proclaim... (Score:2)

Not on the internet != "kept secret" (Score:1)

True, but BS (Score:4, Insightful)

Re:True, but BS (Score:4, Insightful)

Re: True, but BS (Score:1)

Re: (Score:1)

Re: True, but BS (Score:2)

Just BS (Score:2)

researchers call "mode amplification" (Score:2)

Re: (Score:2)

Re: (Score:2)

It's not just foreign languages (Score:2)

Garbage In / Garbage Out (Score:1)

English dominates vs Tamil && Hindi (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

History is written by the victors (Score:3)

Re: (Score:2)

Bonus Data (Score:2)

Re: (Score:2)

Curation (Score:2)