AI models hallucinate, and doctors are OK with that
- Reference: 1741851190
- News link: https://www.theregister.co.uk/2025/03/13/ai_models_hallucinate_and_doctors/
- Source link:
No less than 25 technology and medical experts from respected academic and healthcare organizations, not to mention a web search ad giant – MIT, Harvard Medical School, University of Washington, Carnegie Mellon University, Seoul National University Hospital, Google, Columbia University, and Johns Hopkins University – have taken it upon themselves to catalog and analyze medical hallucinations in mainstream, foundation models, with an eye toward formulating better rules for working with AI in healthcare settings.
Their work, published in a preprint [1]paper titled "Medical Hallucinations in Foundation Models and Their Impact on Healthcare" and in [2]a supporting GitHub repository , argues that harm mitigation strategies need to be developed.
These hallucinations use domain-specific terms and appear to present coherent logic, which can make them difficult to recognize
The authors start from the premise that [3]foundation models – huge neural networks trained on a ton of people's work and other data – from the likes of Anthropic, Google, Meta, and OpenAI present "significant opportunities, from enhancing clinical decision support to transforming medical research and improving healthcare quality and safety."
And given that starting point – and the affiliation of at least one researcher with a major AI vendor – it's perhaps unsurprising that the burn-it-with-fire scenario is not considered.
[4]
Rather, the authors set out to create a taxonomy of medical hallucination, which differs they claim from erroneous AI answers in less consequential contexts.
[5]
[6]
"Medical hallucinations exhibit two distinct features compared to their general purpose counterparts," the authors explain. "First, they arise within specialized tasks such as diagnostic reasoning, therapeutic planning, or interpretation of laboratory findings, where inaccuracies have immediate implications for patient care. Second, these hallucinations frequently use domain-specific terms and appear to present coherent logic, which can make them difficult to recognize without expert scrutiny."
The taxonomy, rendered visually in the paper as a pie chart, includes: Factual Errors; Outdated References; Spurious Correlations: Fabricated Sources or Guidelines; and Incomplete Chains of Reasoning.
Another day, another AI model. Today, it's Google's [7]Gemma 3 family of models (1B, 4B, 12B and 27B), which the Chocolate Factory bills as "The most capable model you can run on a single GPU or TPU."
Gemma 3 is an open model – [8]its weights are available – offered under the [9]Gemma license which is not open source. Per its [10]technical report [PDF], Gemma 3 is comparable to Gemini-1.5-Pro in terms of benchmarks, [11]for what they're worth .
Google says you’d need 32 Nvidia H100s (four HGX servers' worth) to run DeepSeek R1 at FP16 while Gemma 3 27B needs just one. This, however, ignores the fact that R1 was trained at FP8, which by our estimate means it needs just half as many GPUs as Google claims.
Google DeepMind also [12]introduced a Gemini 2.0-based model for robotics.
The authors also looked at the frequency with which such hallucinations appear. Among various different tests, the boffins evaluated the clinical reasoning abilities of five general-purpose LLMs – o1, gemini-2.0-flash-exp, gpt-4o,gemini-1.5-flash, and claude-3.5 sonnet – on three targeted tasks: Ordering events chronologically; lab data interpretation; and differential diagnosis generation, the process of assessing symptoms and exploring possible diagnoses. Models were rated on a scale of No Risk (0) to Catastrophic (5).
The results were not great, though some models fared better than others: "Diagnosis Prediction consistently exhibited the lowest overall hallucination rates across all models, ranging from 0 percent to 22 percent," the paper says. "Conversely, tasks demanding precise factual recall and temporal integration – Chronological Ordering (0.25 - 24.6 percent) and Lab Data Understanding (0.25 - 18.7 percent) – presented significantly higher hallucination frequencies."
[13]
The findings, the authors say, challenge the assumption that diagnostic tasks require complex inference that LLMs are less able to handle.
"Instead, our results suggest that current LLM architectures may possess a relative strength in pattern recognition and diagnostic inference within medical case reports, but struggle with the more fundamental tasks of accurately extracting and synthesizing detailed factual and temporal information directly from clinical text," they explain.
[14]Google AI chatbot more empathetic than real doctors in tests
[15]US regulators crack down on AI playing doctor in healthcare
[16]What does an ex-Pharma Bro do next? If it's Shkreli, it's an AI Dr bot
[17]Google, you're not unleashing 'unproven' AI medical bots on hospital patients, yeah?
[18]Robots in schools, care homes next? This UK biz hopes to make that happen
[19]Nvidia won the AI training race, but inference is still anyone's game
[20]Amazon, Meta, Google sign pledge to triple nuclear power capacity by 2050
[21]ServiceNow's new AI agents will happily volunteer for your dullest tasks
[22]MINJA sneak attack poisons AI models for other chatbot users
Among the general purpose models, Anthropic's Claude-3.5 and OpenAI's o1 had the lowest hallucination rates in the three tested tasks. These findings, the researchers argue, suggest high performing models show promise for diagnostic inference. But the continued occurrence of risk errors rated Significant (2) or Considerable (3) means even the best performing models have to be carefully monitored for clinical tasks and have a human in the loop.
The researchers also conducted a survey of 75 medical practitioners about their use of AI tools. And there's no going back, it seems: "40 used these tools daily, 9 used them several times per week, 13 used them few times a month, and 13 reported rare or no usage," the paper says, adding that 30 respondents expressed high levels of trust in AI model output.
That lack of skepticism from 40 percent of the survey participants is all the more surprising considering that "91.8 percent have encountered medical hallucination in their clinical practice" and that "84.7 percent have considered that hallucination they have experienced could potentially affect patient health."
[23]
We're left to wonder whether newly hired medical personnel would be afforded an error rate to match that of the hallucinating AI models.
The researchers conclude by emphasizing that regulations are urgently needed and that legal liability for errors needs to be clarified.
"If an AI model outputs misleading diagnostic information, questions arise as to whether liability should fall on the AI developer for potential shortcomings in training data, the healthcare provider for over-reliance on opaque outputs, or the institution for inadequate oversight," the authors say.
Given the Trump administration's [24]rollback of [25]Biden-era AI safety rules , the researchers' call "for ethical guidelines and robust frameworks to ensure patient safety and accountability" may not be answered on a federal level. ®
Get our [26]Tech Resources
[1] https://arxiv.org/abs/2503.05777
[2] https://github.com/mitmedialab/medical_hallucination?tab=readme-ov-file#difference
[3] https://www.theregister.com/2021/08/23/percy_liang_qa/
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z9K62sygvuGLPPoY0qiaxgAAAgU&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9K62sygvuGLPPoY0qiaxgAAAgU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9K62sygvuGLPPoY0qiaxgAAAgU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://blog.google/technology/developers/gemma-3/
[8] https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d
[9] https://ai.google.dev/gemma/terms
[10] https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf
[11] https://www.theregister.com/2025/02/15/boffins_question_ai_model_test/
[12] https://deepmind.google/discover/blog/gemini-robotics-brings-ai-into-the-physical-world/
[13] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9K62sygvuGLPPoY0qiaxgAAAgU&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[14] https://www.theregister.com/2024/01/16/google_ai_chatbot_heathcare/
[15] https://www.theregister.com/2024/02/09/ai_medicare_health/
[16] https://www.theregister.com/2023/04/20/shkreli_ai_medicine/
[17] https://www.theregister.com/2023/08/08/google_senator_ai_health/
[18] https://www.theregister.com/2025/01/27/engineered_arts_robots_interview/
[19] https://www.theregister.com/2025/03/12/training_inference_shift/
[20] https://www.theregister.com/2025/03/12/push_for_nuclear/
[21] https://www.theregister.com/2025/03/12/servicenow_yokohama/
[22] https://www.theregister.com/2025/03/11/minja_attack_poisons_ai_model_memory/
[23] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9K62sygvuGLPPoY0qiaxgAAAgU&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[24] https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence
[25] https://www.theregister.com/2025/01/21/trump_eliminates_biden_ai_order/
[26] https://whitepapers.theregister.com/
Would you trust Clippypilot to diagnose your ailments?
"It looks like your patent has six legs, would you like help with that?"
On Hallucinations
Generative AI models are just complicated mathematical predictive models. Calling unwanted outputs "hallucinations" is an attempt to deflect (from the fact that "hallucinations" constitute 100% of their output) and to anthropomorphise them (so that people believe they're capable of "understanding" the idea of "truth" in the first place).
Re: On Hallucinations
"Calling unwanted outputs "hallucinations" is an attempt to deflect ... and to anthropomorphise them"
I find this rabid objection to the use of metaphors silly and foolish. The practice also has a long and bloody history in religious intolerance and censorship.
Metaphors are the prime way language incorporates new notions, ideas, and objects.
Refusing to call this behavior from "AI models" (two metaphors) "hallucinating" is as silly as refusing to call a "programmable electronic calculator" (three metaphors) a "computer" because a "computer" used to be a woman doing computations and a "programmable electronic calculator" is not even human, let alone a woman.
"Hallucinations" perfectly describe the experience users have about these erroneous outputs. Users have absolute no interest in politico-philosophical hair splitting by people who object to the use of machine learning in practical life.
Btw, humans have anthropomorphized tools and objects in general since the origins of the species (and maybe before). Anthropomorphizing tools is not a reason to dismiss people's opinions or arguments. I think that those who were unable to anthropomorphise were also unable to use these tools effectively and became extinct as a consequence.
Not surprised
Every time homo sapiens can lose a skill because they can somehow get away with it, what they do? Jump at opportunity. Without exception.
Will doctors use machine learning tools to make critical decisions even when the tools are know to be broken? Of course they will if they can. As basically everyone else does.
No machine uprising will be needed because humanity is just going to give up preemptively.
Re: Not surprised
"Every time homo sapiens can lose a skill because they can somehow get away with it, what they do?"
I assume you do not create your own fire using stones, sticks and dried moss? That is a skill humans have happily forgotten.
Plato has this nice dialog against the use of writing (Phaedrus, easy to find). He taught:
Their trust in writing, produced by external characters which are no part of themselves, will discourage the use of their own memory within them. You have invented an elixir not of memory, but of reminding; and you offer your pupils the appearance of wisdom, not true wisdom, for they will read many things without instruction and will therefore seem to know many things, when they are for the most part ignorant and hard to get along with, since they are not wise, but only appear wise.
It is worthy to note that Plato's teaching have reached us because someone wrote them down. So this makes this a very good example of why Luddites eventually loose the fight.
harm mitigation strategies
So "...argues that harm mitigation strategies need to be developed." How about don't use a predictive model. Whats wrong with an expert system? I can see using well designed models to help in coming up with new tools/methodology in diagnosis, but not a LLM doc.
"Do no harm"...
Unless it saves money apparently...