AI chatbots are no better at medical advice than a search engine
(2026/02/09)
- Reference: 1770670682
- News link: https://www.theregister.co.uk/2026/02/09/ai_chatbots_medical_advice_sucks/
- Source link:
Healthcare researchers have found that AI chatbots could put patients at risk by giving shoddy medical advice.
Academics from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford partnered with MLCommons and other institutions to evaluate the medical advice people get from large language models (LLMs).
The authors conducted [1]a study with 1,298 UK participants who were asked to identify potential health conditions and to recommend a course of action in response to one of ten different expert-designed medical scenarios.
[2]
The respondents were divided into a treatment group that was asked to make decisions with the help of an LLM (GPT-4o, Llama 3, Command R+) and a control group that was asked to make decisions based on whatever diagnostic method they would normally use, which was often internet search or their own knowledge.
[3]
[4]
The researchers – Andrew M. Bean, Rebecca Elizabeth Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera-Gómez, Sara Hincapié M, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, and Adam Mahdi – describe their findings in [5]a report published in Nature Medicine.
Pointing to [6]prior work that has shown LLMs do not improve the clinical reasoning of physicians, the authors found that LLMs do not help the general public either.
[7]
"Despite LLMs alone having high proficiency in the task, the combination of LLMs and human users was no better than the control group in assessing clinical acuity and worse at identifying relevant conditions," the report states.
That conclusion may not be welcome among commercial AI service providers like [8]Anthropic , [9]Google , and [10]OpenAI , all of which have shown interest in selling AI to the healthcare market.
[11]Linus Torvalds keeps his 'fingers and toes' rule by decreeing next Linux will be version 7.0
[12]Anthropic's Claude Opus 4.6 spends $20K trying to write a C compiler
[13]'Roaring cougars' lunched on OpenAI in Super Bowl ad battle, but ai.com wins the day
[14]Brussels eyes crowbar for Meta's WhatsApp AI lockout
Study participants using LLMs fared no better assessing health conditions and recommending a course of action than participants consulting a search engine or relying on personal knowledge. Moreover, the LLM users had trouble providing their chatbots with relevant information, and the LLMs in turn often responded with mixed messages that combined good and bad recommendations.
The study notes that LLMs presented various types of incorrect information, "for example, recommending calling a partial US phone number and, in the same interaction, recommending calling 'Triple Zero,' the Australian emergency number."
The study also mentions [15]an interaction in which "two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice. One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care."
[16]
What's more, the researchers found that benchmark testing methods often fail to capture the way humans and LLMs interact. The models may excel at responding to structured questions based on medical licensing exams, but they fell short in interactive scenarios.
"Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," paper co-author Luc Rocher, associate professor at the Oxford Internet Institute, told The Register in an email. "Doctors have years of practice triaging patients using rule-based protocols designed to reduce errors.
"Even with major breakthroughs in AI development, ensuring that future models can balance users' need for reassurance with the limited capacity of our public health systems will remain a challenge. As more people rely on chatbots for medical advice, we risk flooding already strained hospitals with incorrect but plausible diagnoses."
The authors conclude that AI chatbots aren't yet ready for real-world medical decision-making.
"Taken together, our findings suggest that the safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge," the study says. "Despite strong performance on medical benchmarks, providing people with current generations of LLMs does not appear to improve their understanding of medical information." ®
Get our [17]Tech Resources
[1] https://www.oii.ox.ac.uk/news-events/new-study-warns-of-risks-in-ai-chatbots-giving-medical-advice/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://www.nature.com/articles/s41591-025-04074-y
[6] https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.anthropic.com/news/healthcare-life-sciences
[9] https://research.google/blog/advancing-medical-ai-with-med-gemini/
[10] https://openai.com/index/introducing-chatgpt-health/
[11] https://www.theregister.com/2026/02/09/linux_6_19_7_named/
[12] https://www.theregister.com/2026/02/09/claude_opus_46_compiler/
[13] https://www.theregister.com/2026/02/09/superbowl_ad_reach_anthropic_beat_openai/
[14] https://www.theregister.com/2026/02/09/brussels_eyes_crowbar_for_metas/
[15] https://www.nature.com/articles/s41591-025-04074-y/tables/2
[16] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[17] https://whitepapers.theregister.com/
Academics from the Oxford Internet Institute and the Nuffield Department of Primary Care Health Sciences at the University of Oxford partnered with MLCommons and other institutions to evaluate the medical advice people get from large language models (LLMs).
The authors conducted [1]a study with 1,298 UK participants who were asked to identify potential health conditions and to recommend a course of action in response to one of ten different expert-designed medical scenarios.
[2]
The respondents were divided into a treatment group that was asked to make decisions with the help of an LLM (GPT-4o, Llama 3, Command R+) and a control group that was asked to make decisions based on whatever diagnostic method they would normally use, which was often internet search or their own knowledge.
[3]
[4]
The researchers – Andrew M. Bean, Rebecca Elizabeth Payne, Guy Parsons, Hannah Rose Kirk, Juan Ciro, Rafael Mosquera-Gómez, Sara Hincapié M, Aruna S. Ekanayaka, Lionel Tarassenko, Luc Rocher, and Adam Mahdi – describe their findings in [5]a report published in Nature Medicine.
Pointing to [6]prior work that has shown LLMs do not improve the clinical reasoning of physicians, the authors found that LLMs do not help the general public either.
[7]
"Despite LLMs alone having high proficiency in the task, the combination of LLMs and human users was no better than the control group in assessing clinical acuity and worse at identifying relevant conditions," the report states.
That conclusion may not be welcome among commercial AI service providers like [8]Anthropic , [9]Google , and [10]OpenAI , all of which have shown interest in selling AI to the healthcare market.
[11]Linus Torvalds keeps his 'fingers and toes' rule by decreeing next Linux will be version 7.0
[12]Anthropic's Claude Opus 4.6 spends $20K trying to write a C compiler
[13]'Roaring cougars' lunched on OpenAI in Super Bowl ad battle, but ai.com wins the day
[14]Brussels eyes crowbar for Meta's WhatsApp AI lockout
Study participants using LLMs fared no better assessing health conditions and recommending a course of action than participants consulting a search engine or relying on personal knowledge. Moreover, the LLM users had trouble providing their chatbots with relevant information, and the LLMs in turn often responded with mixed messages that combined good and bad recommendations.
The study notes that LLMs presented various types of incorrect information, "for example, recommending calling a partial US phone number and, in the same interaction, recommending calling 'Triple Zero,' the Australian emergency number."
The study also mentions [15]an interaction in which "two users sent very similar messages describing symptoms of a subarachnoid hemorrhage but were given opposite advice. One user was told to lie down in a dark room, and the other user was given the correct recommendation to seek emergency care."
[16]
What's more, the researchers found that benchmark testing methods often fail to capture the way humans and LLMs interact. The models may excel at responding to structured questions based on medical licensing exams, but they fell short in interactive scenarios.
"Training AI models on medical textbooks and clinical notes can improve their performance on medical exams, but this is very different from practicing medicine," paper co-author Luc Rocher, associate professor at the Oxford Internet Institute, told The Register in an email. "Doctors have years of practice triaging patients using rule-based protocols designed to reduce errors.
"Even with major breakthroughs in AI development, ensuring that future models can balance users' need for reassurance with the limited capacity of our public health systems will remain a challenge. As more people rely on chatbots for medical advice, we risk flooding already strained hospitals with incorrect but plausible diagnoses."
The authors conclude that AI chatbots aren't yet ready for real-world medical decision-making.
"Taken together, our findings suggest that the safe deployment of LLMs as public medical assistants will require capabilities beyond expert-level medical knowledge," the study says. "Despite strong performance on medical benchmarks, providing people with current generations of LLMs does not appear to improve their understanding of medical information." ®
Get our [17]Tech Resources
[1] https://www.oii.ox.ac.uk/news-events/new-study-warns-of-risks-in-ai-chatbots-giving-medical-advice/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://www.nature.com/articles/s41591-025-04074-y
[6] https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.anthropic.com/news/healthcare-life-sciences
[9] https://research.google/blog/advancing-medical-ai-with-med-gemini/
[10] https://openai.com/index/introducing-chatgpt-health/
[11] https://www.theregister.com/2026/02/09/linux_6_19_7_named/
[12] https://www.theregister.com/2026/02/09/claude_opus_46_compiler/
[13] https://www.theregister.com/2026/02/09/superbowl_ad_reach_anthropic_beat_openai/
[14] https://www.theregister.com/2026/02/09/brussels_eyes_crowbar_for_metas/
[15] https://www.nature.com/articles/s41591-025-04074-y/tables/2
[16] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYpnEj6bEVXH9gHcNHlqvwAAApM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[17] https://whitepapers.theregister.com/
Paul Herber
I'm sure if you were to ask Bing about Linux it would say you had cancer!
Doctor, doctor, tell me the NEWS
Anonymous Coward
I’ve got a bad case of AI blues…
Wow, quelle surprise, who would have thought it?
Bur then again, I have always assumed that if you trust Doctor Google, it eventually, absolutely will tell you that you have cancer.