AI models still suck at math

(2026/02/26)

Reference: 1772138553
News link: https://www.theregister.co.uk/2026/02/26/ai_models_get_better_at/
Source link:

exclusive Current-day LLMs are prediction engines and, as such, they can only find the most likely solution to problems, which is not necessarily the correct one. Though popular models have mostly become better at math, even top performer Gemini 3 Flash would receive a C if assessed with a letter grade.

Researchers affiliated with Omni Calculator, a maker of online calculators for specific applications, have subjected a new set of AI models to the company's ORCA Benchmark, which consists of 500 practical math questions.

In their initial evaluation last November, OpenAI's ChatGPT-5, Google's Gemini 2.5 Flash, Anthropic's Claude Sonnet 4.5, xAI's Grok 4, and DeepSeek's DeepSeek V3.2 (alpha) all did poorly, [1]scoring 63 percent or less on math problems.

[2]

The latest set of contestants consists of ChatGPT-5.2, Gemini 3 Flash, Grok 4.1, and DeepSeek V3.2 (stable release). Sonnet 4.5 didn't get re-evaluated as it hadn't changed and its successor had not been released during the testing period.

[3]

[4]

For this second round of testing – provided to The Register prior to publication – all the models showed improvement except for Grok-4.1, which regressed.

Gemini 3.1 Flash saw its accuracy hit 72.8 percent, a gain of 9.8 percentage points from its predecessor. DeepSeek V3.2 reached 55.2 percent, a gain of 3.2 percentage points from its alpha version. ChatGPT 5.2 achieved 54.0 percent accuracy, up 4.6 percentage points. And Grok 4.1 slipped to 60.2 percent, a loss of 2.6 percentage points.

[5]

Image of chart showing ORCA test resuts for AI models - Click to enlarge

"A calculator is predictable," said Dawid Siuda, researcher at ORCA, in a statement. "Ask it the same question today or next year, and the answer stays the same. AI doesn't work that way. These systems are predicting the next likely word based on patterns. Mathematically, it's possible for a model to get a question right today and wrong tomorrow."

The researchers attempted to assess the variability of model responses with a metric dubbed "instability" – a measure of how often models changed their answers when asked the same question twice.

[6]

Gemini 3 Flash proved the most consistent, shifting only 46.1 percent for incorrect responses. ChatGPT, the researchers report, changed its answer 65.2 percent of the time. And DeepSeek V3.2 changed its answer for 68.8 percent of errors.

[7]Claude collaboration tools left the door wide open to remote code execution

[8]Rapid AI-driven development makes security unattainable, warns Veracode

[9]Microsoft to auto-launch Copilot in Edge whenever you click a link from Outlook

[10]LLMs killed the privacy star, we can't rewind, we've gone too far

The [11]ORCA researchers note that model performance improvements over time differ across domains. DeepSeek, they say, saw its performance on Biology & Chemistry questions go from 10.5 percent accuracy to 43.9 percent. And Gemini 3 Flash reached Math & Conversions accuracy of 93.2 percent, up from 83 percent. Grok 4.1 meanwhile lost 9 percentage points for its accuracy answering Health & Sports problems and lost 5.3 percentage points for Biology & Chemistry.

The researchers speculate that recent updates to Grok may have prioritized other capabilities than quantitative reasoning.

Noting that calculation errors now account for 39.8 percent of all mistakes, up from 33.4 percent, and that rounding errors slipped to 25.8 percent, down from 34.7 percent, the ORCA group conclude that AI models are getting better at making the math look right through formatting, while still struggling with arithmetic.

"AI models are essentially prediction engines rather than logic engines," Siuda told The Register in an email. "Because they work on probability, they are basically guessing the next most likely number or word based on patterns they have seen before. It is like a student who memorizes every answer in a math book but never actually learns how to add."

[12]

Siuda said we knew that about models previously and that hasn't changed.

"They might get the right answer most of the time, but the second you give them a unique or tricky problem, or multi-step task, they stumble because they are not truly calculating anything," he said. "It's probably impossible to close this gap completely with the current technology, but if we merge LLMs with function calling well enough, it may be possible to solve."

Function calling – farming out arithmetic to a deterministic source – is one way around the poor math handling of models.

"Major AI companies like Google and OpenAI are already doing this by having the AI call a function to do the actual calculation," explained Siuda. "The real headache happens with long, messy problems. The AI has to keep track of every little result at each stage, and it usually gets overwhelmed or confused."

Another possible avenue for improvement might be teaching models to verify responses through formal proofs. As noted in [13]Nature last November, Google's DeepMind has developed an approach that scored a silver medal result on the International Mathematical Olympiad through reinforcement learning based on proofs developed with the [14]Lean programming language and proof assistant.

But for the time being, trust no AI. ®

Get our [15]Tech Resources

[1] https://forums.theregister.com/forum/all/2025/11/17/ai_bad_math_orca/

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aaDQknvsz1Yu8dTPhR1B1gAAAJQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aaDQknvsz1Yu8dTPhR1B1gAAAJQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aaDQknvsz1Yu8dTPhR1B1gAAAJQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://regmedia.co.uk/2026/02/26/orca_chart.jpg

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aaDQknvsz1Yu8dTPhR1B1gAAAJQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://www.theregister.com/2026/02/26/clade_code_cves/

[8] https://www.theregister.com/2026/02/26/veracode_security_ai/

[9] https://www.theregister.com/2026/02/26/copilot_pane_edge_outlook/

[10] https://www.theregister.com/2026/02/26/llms_killed_privacy_star/

[11] https://www.omnicalculator.com/reports/omni-research-on-calculation-in-ai-benchmark#how-we-put-ais-to-the-test

[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aaDQknvsz1Yu8dTPhR1B1gAAAJQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[13] https://www.nature.com/articles/s41586-025-09833-y

[14] https://lean-lang.org/

[15] https://whitepapers.theregister.com/

Korev

They Orca do better

Rikki Tikki

Can AI calculate the likely return from an investment of $200 billion in AI?

Anonymous Coward

That's easy.

Step 1) Get AI

Step 2) Profit

zimzam

Asking the AI is literally the plan.

[1]https://www.youtube.com/shorts/pLnyjxgFxew

[1] https://www.youtube.com/shorts/pLnyjxgFxew

The experience

Joe Gurman

…. will be priceless.

Can confirm

Throatwarbler Mangrove

I've been trying to use AI to generate some content in a hurry. In particular, I've been trying to save myself from tediously making rack diagrams, so I asked Copilot to do so. The rack measurements are uneven and suddenly jump from 28 RU to 45. The AI remains absolutely certain it has given a 45 RU diagram, no matter how I prompt it, and it remains blissfully* unaware of the giant gap in its numbering.

* Please, no need to point out the obvious

Re: Can confirm

Anonymous Coward

Do those diagrams include four "loab dalances" and the classic BofH thinwire to high-voltage adapter?

People making life-altering decisions based on answers from something that cannot put two and two together.

Visitors to the planet will see the elaborate empty structures and smouldering ruins and ask what happened.

News: 1772138553

AI models still suck at math

The experience

Can confirm

Re: Can confirm