Boffins warn that AI paper mills are swamping science with garbage studies
- Reference: 1747139472
- News link: https://www.theregister.co.uk/2025/05/13/ai_junk_science_papers/
- Source link:
The research team from the University of Surrey notes an "explosion of formulaic research articles," including inappropriate study designs and false discoveries, based on data cribbed from the US National Health and Nutrition Examination Survey (NHANES) nationwide health database.
The [1]study , published in PLOS Biology, a nonprofit publisher of open-access journals, found that many post-2021 papers used "a superficial and oversimplified approach to analysis." These often focused on a single variable while ignoring more realistic, multi-factor explanations of links between health conditions and potential causes, along with some cherry-picked narrow data subsets without justification.
[2]
"We've seen a surge in papers that look scientific but don't hold up under scrutiny – this is 'science fiction' using national health datasets to masquerade as science fact," states Matt Spick, a lecturer in health and biomedical data analytics at Surrey University, and one of the authors of the report.
[3]
[4]
"The use of these easily accessible datasets via APIs, combined with large language models, is overwhelming some journals and peer reviewers, reducing their ability to assess more meaningful research – and ultimately weakening the quality of science overall," he added.
The report notes that AI-ready datasets, such as NHANES, can open up new opportunities for data-driven research, but also lead to the risk of potential data exploitation by what it calls "paper mills" – entities that churn out questionable scientific papers, often for paying clients seeking confirmation of an existing belief.
[5]
Surrey Uni's work involved a systematic literature search going back ten years to retrieve potentially formulaic papers covering NHANES data, and analyzing these for telltale statistical approaches or study design.
The team identified and retrieved 341 reports published across a number of different journals. It found that over the last three years, there has been a rapid rise in the number of publications analyzing single-factor associations between predictors (independent variables) and various health conditions using the NHANES dataset. An average of four papers per year were published between 2014 and 2021, increasing to 33, 82, and 190 in 2022, 2023, and the first ten months of 2024, respectively.
Also noted is a change in the origins of the published research. From 2014 to 2020, just two out of 25 manuscripts had a primary author affiliation in China. Between 2021 and 2024, this rose to 292 out of 316 manuscripts.
[6]
The report says this jump in single-factor associative research means there is a corresponding increase in the risk of misleading findings being introduced to the wider body of scientific literature.
[7]OpenAI wants to build a subscription for something like an AI OS, with SDKs and APIs and 'surfaces'
[8]Amazon tested warehouse robots and found they're not ready to replace humans
[9]Paul McCartney, Elton John, other creatives demand AI comes clean on scraping
[10]LegoGPT is here to make your blocky dreams come true
For example, it says that some well-known multifactorial health issues are analyzed as single-factor studies, citing depression, cardiovascular disease, and cognitive function – all recognized as multifactorial – being investigated using simplistic, single-factor approaches in some of the papers reviewed.
To combat this, the team sets out a number of suggestions, including that editors and reviewers at scientific journals should regard single-factor analysis of conditions known to be complex and multifactorial as a "red flag" for potentially problematic research.
Providers of datasets should also take steps including API keys and application numbers to prevent data dredging, an approach already used by the UK Biobank, the report says. Publications referencing such data should be made to include an auditable account number as a condition of access.
Another suggestion is that full dataset analysis should be made mandatory, unless using data subsets can be justified.
"We're not trying to block access to data or stop people using AI in their research – we're asking for some common sense checks," said Tulsi Suchak, a post-graduate researcher at the University of Surrey and lead author of the study. "This includes things like being open about how data is used, making sure reviewers with the right expertise are involved, and flagging when a study only looks at one piece of the puzzle."
This isn't the first time the issue has come to light. Last year, [11]US publishing house Wiley discontinued 19 scientific journals overseen by its Hindawi subsidiary that were publishing reports churned out by AI paper mills.
It is also part of a wider problem of AI-generated content appearing online and in web searches that can be difficult to distinguish from reality. Dubbed "AI slop," this includes fake pictures and entire video sequences of celebrities and world leaders, but also [12]fake historical photographs and AI-generated portraits of historical figures appearing in search results as if they were genuine.
Truly, AI is the gift that keeps on giving. ®
Get our [13]Tech Resources
[1] https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3003152
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aCNsm22UAlq_Kawbj3RiqwAAAZc&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aCNsm22UAlq_Kawbj3RiqwAAAZc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aCNsm22UAlq_Kawbj3RiqwAAAZc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aCNsm22UAlq_Kawbj3RiqwAAAZc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aCNsm22UAlq_Kawbj3RiqwAAAZc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/05/13/openai_ceo_altman_no_plans/
[8] https://www.theregister.com/2025/05/13/amazon_robots_make_progress_picking/
[9] https://www.theregister.com/2025/05/12/uk_creatives_ai_letter/
[10] https://www.theregister.com/2025/05/12/legogpt/
[11] https://www.theregister.com/2024/05/16/wiley_journals_ai/
[12] https://marinaamaral.substack.com/p/ai-is-creating-fake-historical-photos
[13] https://whitepapers.theregister.com/
What a sad life you must have that this is the sort of thing you feel like you need to "contribute." There are other posters on this site with whom I disagree but who at least make an effort to formulate a rational argument. You aren't even a good troll, you're just boring.
Shit "research"
The more shit that "AI" produces, the more it will be mistrusted. Take that Altman.
Drain the swamp
Science is already swamped (and has been for many years) by predatory junk journals - money-raking scams - with no quality control or anything remotely approaching serious peer review 1 . The best case scenario is that junk AI-generated articles swamp the junk journals, and serious science just gets on with it. (No, I don't think that's actually going to happen, but we can dream.)
See also the celebrated " [1]Get me off your fucking mailing list ".
[1] https://www.sciencealert.com/journal-accepts-paper-titled-get-me-off-your-f-cking-mailing-list
Re: Drain the swamp
This.
APC charges double every couple of years, review deadlines shorten (6 days now for a lot of journals that claim to be non predatory), and yet the recommendation in the article is quoted as "editors and reviewers at scientific journals should regard single-factor analysis of conditions known to be complex and multifactorial as a "red flag" for potentially problematic research"
So the burden should be on the free labor provided to the journals.
Instead, the journals can use some of their stunning profits to either hire staff and/or automate screening of this issue.
If a single factor study is invalid, the rise in them is irrelevant, because it would already be rejected.
But in reality, none of this addresses the cause, in that academic publishing is hijacked for profit and greed, instead of being community driven.
Re: Drain the swamp
> But in reality, none of this addresses the cause, in that academic publishing is hijacked for profit and greed, instead of being community driven.
And has been the case since forever. In the past, there was at least some excuse for the costs of production and distribution of quality hard-copy printed journals for libraries, paying for professional proof-reading, etc. 1 Since everything is now online that is no longer a valid excuse. And I can't imagine they pay proof readers peanuts given the abysmal quality - the sheer illiteracy - of proofing I've had to put up with (I work in a mathematical field, and it will generally cost me a full day's work de-mangling the maths... don't get me started...).
There are some valiant attempts in the academic community to sideline the prevalent lazy, exploitative and greed-driven publishing model, but it's an uphill battle; prejudice in favour of the traditional high-impact journals and big publishing houses (you know the ones I mean) is still ingrained. Of course the publishers have a vested interest in maintaining the illusion of "prestige" attached to their titles (and with some exceptions it is becoming very illusory indeed).
And this: my academic institution demands that we open-access all publications (errm, good for them) - for which privilege, mainstream publishers charge $$$ on top of their already-exorbitant publication fees. My academic institution also does not provide funding to cover open-access fees. That is, presumably, supposed to come out of our grants (thanks, guys).
1 I remain, though, very uneasy about the idea of paying reviewers.
Re: Drain the swamp
"In the past, there was at least some excuse for the costs of production and distribution of quality hard-copy printed journals for libraries, paying for professional proof-reading, etc."
Back in that one of my first tasks as a research assistant was to check the proofs (may have been galleys but I think it was page proofs) of a paper in one of the most prestigious Irish publications. It was written by a former student of my boss. I found systematic errors in conversion from imperial to metric units which had passed the editors, referees and others. True the proofs matched the original text....
Sort of pre-dates AI
So called "meta studies" were almost unknown when I was a student in the mid 1970's and early 80's (mostly biomedical sciences) but with the rise of the internet, the online availabity of research data sets and lower cost computing, such studies became more common.
The quality of these studies varied from very high to utter rubbish. The worst used only the data sources and "curated" subsets thereof that confirmed a favoured (paid for?) hypothesis†.
This situation was bad enough without adding AI into the cesspit of scientific publishing.
Perhaps the learned societies will be forced to take back their original function of publishing the papers of their members (in good standing) and other researchers on the personal recommendation of society members.
I imagine also with the risks of elitism, parochialism and inbreeding but against the promiscuity of AI using the internet those risks might be tolerable.
"AI the gift that keeps on giving." Like the pox then?
† I was thinking " Childhood MMR vaccination causes loss of religious faith and is responsible for declining church attendences. " was a good example but it is and isn't. :(
Way Back...
A significant part of my degree is Stats, including a course on Medical Statistics, and we were given plenty of examples of dodgy studies; that was over 40 years ago so there's nothing new, AI just seems to be capable of producing a lot of sh*t a lot faster! To quote one of our lecturers, "Beware the hidden variable!"*
A previous employer was "persuaded" by the bank to employ a firm to do some analysis and market research at a significant one-off cost plus monthly retainer for ongoing reporting. I was able to pretty much rip the report apart, to the extent that the contract was cancelled, the absolute killer being their use curated datasets that bore little relation to our business. In one case they used a global number for
sales that "correlates closely with the business performance" to forecast the future, but only picked the four years where the correlation was a fit (and a loose one at that), they'd ignored the years before and after the data that didn't correlate at all.
*One, trivial, illustration was that there's a strong correlation between the number of pubs per square mile and the number of churches, therefore churches cause pubs to be built (the hidden variable being population density).
Re: Way Back...
Oblig [1]XKCD
[1] https://xkcd.com/552/
Re: Way Back...
My personal favourite is that firemen cause fires. I mean c'mon, whenever you see a fire they're always there...
Re: Way Back...
Fahrenheit 451 ?
Re: Way Back...
Correlation implies causation but it doesn't tell you what it is.
Not just science, knowledge in general
Reducing information quality is one of the most predictable consequences of LLMs. The ease and speed of generating material this way is much greater than that attainable by involving actual intelligence.
And successive models will have a diet increasingly rich in AI slop.
So, rather than the much touted belief that "AI" will improve knowledge, it seems more likely that it will decrease the quality of knowledge as a result of a growing proportion being statistically-generated auto-text.
And still CEOs, politicians, and even many "thinkers" keep following the belief that predictive text is a great breakthrough in intelligence, just as superstitious people in the dark ages donated to the local monastery to buy prayers to reduce time spent in purgatory.
Other AI
The other form of AI is the substandard student to be found at so many universities.
Climate "science"