In Real-World Test, an AI Model Did Better Than ER Doctors At Diagnosing Patients

(Thursday April 30, 2026 @11:30PM (BeauHD) from the AI-MD dept.)

Reference: 0183079426
News link: https://science.slashdot.org/story/26/04/30/1956259/in-real-world-test-an-ai-model-did-better-than-er-doctors-at-diagnosing-patients
Source link:

A new study from Harvard Medical School and Beth Israel Deaconess found that an OpenAI reasoning model [1]outperformed experienced ER doctors at diagnosing and managing patient cases using messy, real-world emergency department records. Researchers say the results don't support replacing doctors, but they do suggest AI could meaningfully reshape clinical workflows if tested carefully in prospective trials. NPR reports:

> The researchers ran a series of experiments on the AI model to test its clinical acumen -- including actual cases like the lupus patient who'd been previously treated at the emergency department at Beth Israel in Boston. The team graded how well the AI model could provide an accurate diagnosis at three moments in time, from the triage stage in the ER, up to being admitted into the hospital. Overall, AI outperformed two experienced physicians -- and did so with only the electronic health records and the limited information that had been available to the physicians at the time. "This is the big conclusion for me -- it works with the messy real-world data of the emergency department, " said Dr. Adam Rodman, a clinical researcher at Beth Israel and one of the study authors. "It works for making diagnoses in the real world."

>

> Other parts of the study focused on case reports published in the New England Journal of Medicine and clinical vignettes to suss out whether the AI model could meet well-established "benchmarks" and game out thorny diagnostic questions. "The model outperformed our very large physician baseline," said Raj Manrai, assistant professor of Biomedical Informatics at Harvard Medical School who was also part of the study. The authors emphasize the AI relied on text alone, while in real life, clinicians need to attend to many other inputs like images, sounds and nonverbal cues when diagnosing and treating a patient.

The findings have been [2]published Thursday in the journal Science .

[1] https://www.npr.org/2026/04/30/nx-s1-5804474/ai-doctors-openai-patient-care-diagnosis

[2] https://www.science.org/doi/10.1126/science.adz4433

Frankenstein's Doctor (Score:3, Funny)

by bryanandaimee ( 2454338 )

It is a common misconception that the doctor's name was Frankenstein. Actually the AI was named Frankenstein. It created the doctor from spare parts.

Re: (Score:2)

by PPH ( 736903 )

And its pronounced "Fronkensteen".

Re: (Score:2)

by lucifuge31337 ( 529072 )

I wish I had mod point but "It's pronounced eye-gor" and " Of course the rates have gone up." will have to suffice.

Another tool in the toolbox. (Score:2)

by fahrbot-bot ( 874524 )

Presumably the "AI" has a wider pool of data to pull from than any single doctor, or even a small group of doctors, on scene so I could see this sort of thing used to double-check and/or offer alternative diagnoses, perhaps with percentages. Use tools if/when they're helpful, but remember that they're just tools.

Re: (Score:2)

by ambrandt12 ( 6486220 )

Double-check, yes ("That looks like cancer, Doctor Moe" and the doc gets a little piece of it, and the oncologist verifies the piece is cancer)... but the surgeon is the one(s) running the show.

I wouldn't never, ever put an AI in full automated control of a DaVinci surgical robot... and, the AI might be wrong "that that looks like cancer", and the condition is something else. (But, you know for sure, someone is gonna attach an AI to a surgical robot, and the hospital can get rid of the surgeons, and still

Re: (Score:2)

by saloomy ( 2817221 )

AI models are better than our ability to trust them. Already. They are certainly better than the average person in most knowledge professions, and for the few that it isnt, it's just a matter of time. Computers are better thinkers than we are. That is just obvious.

Re:Another tool in the toolbox. (Score:4)

by HiThere ( 15173 )

You're making MUCH too wide an assertion. There are areas where the AI is better, but it's competence is "jagged". If you say it's better at guessing protein folding, I'll agree. if you say it's a better surgeon...not this month. Probably not this year. But it's better in certain specific areas, and those areas are increasing.

Re: (Score:2)

by saloomy ( 2817221 )

A surgeon is a skilled profession, not a knowledge profession. I won a court case with AI, when my lawyer was pushing us to settle. AI found an argument, and we got a positive result.

Re: Another tool in the toolbox. (Score:2)

by ByTor-2112 ( 313205 )

Computers don't think. We don't even understand how animals think.

Re: (Score:2)

by saloomy ( 2817221 )

Computers do think. Computers do not do consciousness. Consciousness is not a prerequisite to thinking. Thinking is a prerequisite to consciousness.

Re: (Score:1)

by ambrandt12 ( 6486220 )

I agree.

They have never thought, and they won't... I (as, y'know, a human) can decide if I should cross a street (and I can sing song lyrics as I do it).

A computer cannot, by it's definition, "think" like us humans can. It can analyze data fed to it, and data that it sees through whatever cameras it has.

Me or you on the street can tell if a car is speeding through "that light", and we can make a decision split second... pull over or slam on the brakes or any number of things.

Re: (Score:2)

by martin-boundary ( 547041 )

These are not controlled experiments. They are merely reprocessing of existing cases with simplified data sources that happen to correlate with the outcomes.

Anyone can make a diagnosis, even without looking at a patient. Do it enough times, argue the failures were anomalies or unfair comparisons or just don't mention them, hype up the successes. Post on TikTok.

(and for the experts: always look at who is paying for a reasonable looking study, and list all the biases you can think of up front)

Re: (Score:2)

by dvice ( 6309704 )

Here is a study about triage where AI fails to beat humans:

"Results from a randomized, interface-blinded, crossover simulation study involving 120 hypothetical telemedical encounters performed by real primary care physicians, the AI co-clinician or GPT-realtime. "

[1]https://deepmind.google/blog/a... [deepmind.google]

[1] https://deepmind.google/blog/ai-co-clinician/

Coming: Reverse Centaurs and (Score:4, Insightful)

by hwstar ( 35834 )

accountability sinks.

1. What is a Reverse Centaur?

The Reverse Centaur: The AI acts as the "head" or decision-maker, and the human is the "body" or worker, forced to keep up with an impossible, algorithmic pace.

2. What is an Accountability Sink?

A "moral crumple zone"—is a human who is present only to take the blame when an AI system fails.

Who wants to work in such an environment?

Some of us would refuse.

Re: Coming: Reverse Centaurs and (Score:1)

by Your Father ( 6755166 )

If you look both ways before crossing the street, your likelihood of dying in the street is much lower. I have no idea what AI has to do with that but I hope this helps.

"AI" (Score:2)

by darkain ( 749283 )

The biggest problem with "AI" is the marketing. There is no single "AI", there are a ton of various semi-related and entirely unrelated tooling that uses various bits of machine learning. From what this reads like, they have a more advanced fuzzy logic search engine in place, which is one of the absolute best use cases for AI/ML workloads. But at the end, its no different than any other search engine. It doesn't do the critical work for you, it simply searching vast repositories of knowledge and suggests po

Re: (Score:2)

by dvice ( 6309704 )

I think that the blind spot for many are the AI models used for science (Alphafold, AlphaEvolve, AlphaGenome, WeatherNext, AlphaEarth,AlphaZero, ... ). They already have multiple AI models that have made new scientific discoveries, some of which are Nobel-level, ground breaking discoveries. People are underestimating the potential impact of this. Demis Hassabis has already said that finding a cure for all known deceases within 10 years is a possible scenario.

It wouldn't surprise me thanks to private equity (Score:2)

by rsilvergun ( 571051 )

If you know anything about the shit show that is emergency rooms right now you know that private equity has bought them all and is slashing patient time and staff. A couple of states have banned this but the private equity firms just used freaky corporate structures to get around the bans and they are currently in the courts being challenged.

So yeah your AI can outperform a doctor that gets 5 minutes with the patient before having to move on to the next one in order to keep their private equity Masters

Re: (Score:2)

by hwstar ( 35834 )

They haven't bought the ones owned by a nonprofit healthcare group (Think Kaiser, et. al). The problem is they have bought a lot of the anesthesiologists and other specialties which operate as contractors in the emergency room environment. This was a big problem up until a few years ago when the Congress passed an act to prohibit third party balance billing from contractors in the emergency room. One of the few places left where they can get away with third party balance billing is Ambulance Services. Altho

Perfectly understandable. (Score:4, Insightful)

by couchslug ( 175151 )

Doctors are tired, stressed and multitasking, They diagnose by pattern matching, ideal for AI.

Re: (Score:2)

by evanh ( 627108 )

Problem is, if you repeat the question the LLM will give a different answer each time.

Re: (Score:3)

by ranton ( 36917 )

> Problem is, if you repeat the question the LLM will give a different answer each time.

No it won't. It may change the wording, but not the answer. I just asked Gemini what the capital of the US was five times using Google, and got five unique responses. All of them said it was Washington DC though.

Re: (Score:2)

by dvice ( 6309704 )

LLM can change the answer, but you can ask it x times and use majority vote to pick the best answer to get rid of random errors. This is an actual scientific strategy that is used with AI models to get more accurate results. There was one experiment where they managed to get over a million right answers without a single error using this strategy.

Re: Perfectly understandable. (Score:3, Interesting)

by dlasley ( 221447 )

agreed, this is one of the truly useful and beneficial uses of AI so far. give doctors wherever they are an incredibly powerful tool for patient diagnosis so they can focus on triage and care. alleviate some of the pressure from being a walking Gray's Anatomy and let doctors be empathetic healers instead.

Now let's find the payment (Score:2)

by wakeboarder ( 2695839 )

from Open AI to Harvard.

Re: (Score:2)

by dvice ( 6309704 )

AI doesn't necessarily need to be better than humans. Even if it is worse, it can still bring a better outcome. Cases where this might be are:

1. When human doctor is out of ideas. In such a case, even wild guesses can lead to better outcome than not trying anything.

2. When human doctors are out of time. In such cases AI making bad decisions might be better than delaying the decision.

3. When screening huge population. AI might have worse accuracy, but with the ability to do millions of screenings, it could f

The Chaos to Resource Ratio, matters. (Score:2)

by geekmux ( 1040042 )

> "This is the big conclusion for me -- it works with the messy real-world data of the emergency department..It works for making diagnoses in the real world."

To be as precise as we should be when discussing departments handling life-saving measures to the extreme, it works for making diagnoses in an overtaxed understaffed environment known for earning a ranking of "messy" when it comes to annual audits.

Messy, is how I would describe the masses trying to swallow the death-by-medical-error statistics that currently reek of dismissal and profit. Let's hope this can ultimately improve on that issue. IMHO, the conclusion more proves we grossly understaff ER depart

A real world test? (Score:2)

by lucifuge31337 ( 529072 )

That was in fact not a real world test at all. It was using records to ex-post-facto relitigate decisions. And yes, this is an important step towards something that may be considered for "real world testing" but it's not that at all.

Doctors (Score:3)

by CAIMLAS ( 41445 )

In my experience, doctors make some of the worst diagnosticians. They're the human equivalent of a narrowly trained, highly optimized model: they focus on a very narrow corpus of doctrinal, prescribed medicine (regardless of their subdiscipline). This leads to significant cognitive bias and the tendency to overly generalize. You see this quite a bit with even things as simple as blood work and iron levels: "your iron levels are fine" - meanwhile, ferrin is low, which has a whole slew of symptoms which get passed off as hysteria, particularly with women. There are entire communities of people suffering from low ferrin where they struggled for years getting a proper diagnosis when the tests themselves told the story, had the doctors not been prone to an overly generalized prognosis and ivory tower thinking.

It would make sense that AI would supersede them in capabilities: the corpus is larger and they aren't as prone to the kind of cognitive problems doctors are, at least to as high a degree.

Nurse diagnosed them, not the AI (Score:4, Interesting)

by gurps_npc ( 621217 )

The AI relied on the text records. Which were things the NURSE noticed and entered into the chart. The nurse did the hard part, examining the patient, asking the right questions.

You cannot diagnose just on blood pressure, heart rate, oxygen rate. You need to notice things like:

slurred speech

dilated eyes

excessive sweat

pale

red skin

rash

bruised

The thing is, it was a trained nurse that noticed these symptoms and WROTE THEM DOWN. And she usually knew exactly what it was, but waited for the doctor to say.

Anyone can diagnose correctly 90% of the time if you have the right information. Also note, diagnoising a problem is not like on House or Watson. 80+% of the time the answer is blindingly obviously.

Bleeding profusely from a jagged wound = knife attack

Patient comes in acting exactly like the 9 other drug addicts you got last month = using whatever the new/most common drug is.

Patient smelling of alcohol is not a big mystery

blood tests indicating high sugar = Diabetes

Long time diabetic with blood in urine = kidney failure

Immense pain in big toe from an overweight person = Gout

For most cases, any EMT can tell what the problem is. The problem is not the common cases, but the problematic ones,

For those "mysteries", do we want an AI diagnosing without a human confirming it? No. But we can probably save a bit of money by having the AI do it before a doctor confirms.

Re: (Score:1)

by Ziabatsu ( 1270854 )

I'm still digging through the article, but do they explain how they account for a simple "more symptoms = problem" model. How much noise is in the diagnosis information that could influence the results. What am I trying to say? "If we're sure there is something wrong with you we can find out with pretty good accuracy. If we're not sure then performance goes way down" sounds right.

Re: (Score:2)

by noshellswill ( 598066 )

What does it even mean to "diagnose without images, sounds and nonverbal cues" ? How can you taste without using your taste-buds , or do you pretend mapping paper/rock/scissors to crossword puzzles?

That was tried before (Score:2)

by gweihir ( 88907 )

IBM Watson did something similar: Usually better than an experienced MD, just occasionally killing a patient via hallucination. The same is likely true here.

Re: (Score:2)

by Gideon Fubar ( 833343 )

Yeah was just going to say this.

How many times did they make a claim similar to this one about diagnostic parity only to fall foul of the kinda obvious fact that medicine and sales have somewhat different metrics for success? I certainly lost count.

Re: (Score:2)

by gweihir ( 88907 )

Yep. And that always kills it.

Absolute HORSESHIT! Check their backers. (Score:2)

by Hotice919 ( 1003185 )

Read the end of the study - it's mostly funded by AI companies: Funding: We gratefully acknowledge support from NIH/NIEHS award R01ES032470 (A.K.M.), the Harvard Medical School Deanâ(TM)s Innovation Award for Artificial Intelligence (A. K. M.), Macy Foundation awards B25-15 and P25-04 (A.R. and J.C.), Moore Foundation award 12409 (A.R., J.C., and Z.K.), NIH/NIAID 1R01AI17812101 (J.C.), NIH-NCATS UM1TR004921 (J.C.), the Stanford Bio-X Interdisciplinary Initiatives Seed Grants Program (J.C.), NIH U01 NS

They tested based on records (Score:2)

by ebunga ( 95613 )

So the work had already been done.

AI Model (Score:2)

by PPH ( 736903 )

I think I remember seeing her over on /b/.

News: 0183079426

In Real-World Test, an AI Model Did Better Than ER Doctors At Diagnosing Patients

Frankenstein's Doctor (Score:3, Funny)

Re: (Score:2)

Re: (Score:2)

Another tool in the toolbox. (Score:2)

Re: (Score:2)

Re: (Score:2)

Re:Another tool in the toolbox. (Score:4)

Re: (Score:2)

Re: Another tool in the toolbox. (Score:2)

Re: (Score:2)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Coming: Reverse Centaurs and (Score:4, Insightful)

Re: Coming: Reverse Centaurs and (Score:1)

"AI" (Score:2)

Re: (Score:2)

It wouldn't surprise me thanks to private equity (Score:2)

Re: (Score:2)

Perfectly understandable. (Score:4, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: Perfectly understandable. (Score:3, Interesting)

Now let's find the payment (Score:2)

Re: (Score:2)

The Chaos to Resource Ratio, matters. (Score:2)

A real world test? (Score:2)

Doctors (Score:3)

Nurse diagnosed them, not the AI (Score:4, Interesting)

Re: (Score:1)

Re: (Score:2)

That was tried before (Score:2)

Re: (Score:2)

Re: (Score:2)

Absolute HORSESHIT! Check their backers. (Score:2)

They tested based on *records* (Score:2)

AI Model (Score:2)

They tested based on records (Score:2)