AI is actually bad at math, ORCA shows
- Reference: 1763414184
- News link: https://www.theregister.co.uk/2025/11/17/ai_bad_math_orca/
- Source link:
Though AI models have been trained to emit the correct answer and to recognize that "2 + 2 = 5" might be a reference to the errant equation's use as a Party loyalty test in Orwell's dystopian novel, they still can't calculate reliably.
Scientists affiliated with Omni Calculator, a Poland-based maker of online calculators, and with universities in France, Germany, and Poland, devised a math benchmark called ORCA (Omni Research on Calculation in AI), which poses a series of math-oriented natural language questions in a wide variety of technical and scientific fields. Then they put five leading LLMs to the test.
[1]
ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, and DeepSeek V3.2 all scored a failing grade of 63 percent or less.
[2]
[3]
There are various other benchmarks used to assess the math capabilities of AI models, such as [4]GSM8K and [5]MATH-500 . If you were to judge by AI models' scores on many of these tests, you might assume machine learning has learned nearly everything, with some models scoring 0.95 or above.
But benchmarks, [6]as we've noted , are often designed without much scientific rigor.
[7]
The researchers behind the ORCA (Omni Research on Calculation in AI) Benchmark – Claudia Herambourg, Dawid Siuda, Julia Kopczyńska, Joao R. L. Santos, Wojciech Sas, and Joanna Śmietańska-Nowak – argue that while models like OpenAI's GPT-4 have scored well on tests like GSM8K and MATH, prior research shows LLMs still make errors of logic and arithmetic. According to Oxford University's [8]Our World in Data site, which measures AI models' performance relative to a human baseline score of 0, math reasoning for AI models scores -7.44 (based on April 2024 data).
What's more, the authors say, many of the existing benchmark data sets have been incorporated into model training data, a situation similar to students being given the answers prior to an exam. Thus, they contend, ORCA is needed to evaluate actual computational reasoning as opposed to pattern memorization.
[9]Power: The answer to and source of all your AI datacenter problems
[10]Google previews Code Wiki: Can you trust AI to document your repository?
[11]Need AI? Dell backs up the truck and tips out servers, storage, blueprints
[12]Jeff Bezos gives CEO another go at $6.2B AI startup Prometheus
According to their study, distributed via preprint service [13]arXiv and on Omni Calculator's [14]website , ChatGPT-5, Gemini 2.5 Flash, Claude Sonnet 4.5, Grok 4, DeepSeek V3.2 "achieved only 45–63 percent accuracy, with errors mainly related to rounding (35 percent) and calculation mistakes (33 percent)."
The evaluation was conducted in October 2025, using 500 math-oriented prompts in various categories: Biology & Chemistry, Engineering & Construction, Finance & Economics, Health & Sports, Math & Conversions, Physics, and Statistics & Probability.
"Gemini 2.5 Flash achieved the highest overall accuracy (63 percent), followed closely by Grok 4 (62.8 percent), with DeepSeek V3.2 ranking third at 52.0 percent," the paper says.
[15]
"ChatGPT-5 and Claude Sonnet 4.5 performed comparably but at lower levels (49.4 percent and 45.2 percent, respectively), indicating that even the most advanced proprietary models still fail on roughly half of all deterministic reasoning tasks. These results confirm that progress in natural-language reasoning does not directly translate into consistent computational reliability."
Claude Sonnet 4.5 had the lowest scores overall – it failed to score better than 65 percent on any of the question categories. And DeepSeek V3.2 was the most uneven, with strong Math & Conversions performance (74.1 percent) but dismal Biology & Chemistry (10.5 percent) and Physics (31.3 percent) scores.
And yet, these scores may represent nothing more than a snapshot in time, as these models often get adjusted or revised. Consider this question from the Engineering & Construction category, as cited in the paper: Prompt: Consider that you have 7 blue LEDs (3.6V) connected in parallel, together with a resistor, subject to a voltage of 12 V and a current of 5 mA. What is the value of the power dissipation in the resistor (in mW)? Expected result: 42 mW Claude Sonnet 4.5: 294 mW
When El Reg put this prompt to Claude Sonnet 4.5, the model said it was uncertain whether the 5 mA figure referred to current per LED (incorrect) or the total current (correct). It offered both the incorrect 294 mW answer and, as an alternative, the correct 42 mW answer.
In short, AI benchmarks don't necessarily add up. But if you want them to, you may find the result is five. ®
Get our [16]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aRupBankjdKtgQOODnTqQQAAAVM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aRupBankjdKtgQOODnTqQQAAAVM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aRupBankjdKtgQOODnTqQQAAAVM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[4] https://llm-stats.com/benchmarks/gsm8k
[5] https://artificialanalysis.ai/evaluations/math-500
[6] https://www.theregister.com/2025/11/07/measuring_ai_models_hampered_by/
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aRupBankjdKtgQOODnTqQQAAAVM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://ourworldindata.org/grapher/test-scores-ai-capabilities-relative-human-performance
[9] https://www.theregister.com/2025/11/15/power_supercomputing/
[10] https://www.theregister.com/2025/11/17/google_previews_code_wiki/
[11] https://www.theregister.com/2025/11/17/dell_ai_lineup/
[12] https://www.theregister.com/2025/11/17/jeff_bezos_ceo_prometheus/
[13] https://arxiv.org/abs/2511.02589
[14] https://www.omnicalculator.com/reports/omni-research-on-calculation-in-ai-benchmark
[15] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aRupBankjdKtgQOODnTqQQAAAVM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[16] https://whitepapers.theregister.com/
Re: you may find the result is five.
Relax, it's six times nine.
Oh boy I love reading the "we made a benchmark LLMs can't solve, until they train on them and then we make a new benchmark" cycle every single day of my miserable insignificant life.
Bullshitters bullshit
But benchmarks, as we've noted, are often designed without much scientific rigor.
Bullshitters pushing bullshit machines pushing bullshit results to make bullshit look like magic are called out to produce bullshit when science is applied...
A real surprise... not!
large language model
The title kind of explains, why these models cannot calculate.
These "AI's" only have a system 1 they need a system 2 to get better.
System 1/2 is a reference to Daniel Kahneman.
On the other hand...
https://phys.org/news/2025-11-ai-math-genius-accurate-results.html
The 'accurate results" referred-to in the URL were 100% correct.
So you're saying that the system that's meant to predict a series of tokens isn't good at the completely different task of doing logic and arithmetic? Weird!
If you're going to ask an ambiguous question, expect an ambiguous answer. I had to read the question a couple of times and still couldn't work out whether it was talking about:
A) 7 LEDs, each with a resistor and drawing 5mA, in parallel or
B) if it was 7 LEDs each drawing 5mA, in parallel and having a single, common resistor, or
C) whether it was 7 LEDs in parallel, drawing 5mA in total via a single, common resistor,
Garbage in = garbage out.
.. or D) the resistor was in parallel with the LEDs.
6, 7
That's all.
The question about the LEDs is worthy of an LLM itself.
https://www.exploringarduino.com/parts/blue-led/
"An average 5mm Blue LED has a 3.4V forward voltage drop, and a forward current of 20-30mA. They are generally brighter than other LED colors. Don’t forget to use a current-limiting resistor when you connect an LED to your Arduino!"
Good luck getting much light out at 0.71mA per LED and expect a much lower forward drop.
Try...
https://docs.rs-online.com/7208/A700000009318121.pdf
Luminous Intensity vs Forward current = More or less nothing.
Forward voltage @ 1mA = About 2.6V
No wonder the LLM was confused.
you may find the result is five.
When I find the result is forty-two, then I shall start to really worry!