News: 1750062610

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Put Large Reasoning Models under pressure and they stop making sense, say boffins

(2025/06/16)


Opinion Among the forever wars in geekdom, defining the difference between science fiction and fantasy is a hot potato destined to outlive the heat death of the universe.

There is no right answer and it doesn't matter, hence the abiding popularity of the question, but attempting to make that delineation can still be useful when analyzing IT industry hype. Is a promise technically feasible, or are dragon-riding pixies happening first? Yes, AI, we're talking about you again.

Look at the suggestion that IT staff should make [1]agentic digital twins of themselves to, ahem, reduce the amount of burdensome work they have to personally do. That's a room with enough elephants to restock Africa, if it worked. If your twin mucks up, who carries the can? What’s the difference between "burdensome work" and "job?" Who owns the twin when you leave? Have none of these people seen the Sorcerer's Apprentice segment of Fantasia? Fortunately, a better question leading on from that: whether the idea is science fiction or fantasy, and like all good speculative fiction there's both history and logic to help us decide.

The case for handcrafted software in a mass-produced world [2]READ MORE

History first. The proposal isn't new, it's a reprise of a spectacular AI failure from the mid-'80s: expert systems. The idea was to combine the then-hotness of Lisp, a language designed to work with huge lists of conceptual data to reach correct conclusions, with training acquired by analyzing how domain experts did their work. Exciting stuff, and the dollars flowed in. At last, real AI was here! Real AI was not here, sadly, and the whole field quietly died for the highly technical reason that it just didn't work.

It wasn't so much that '80s technology wasn't up to the job – there were promising early results; Moore's Law was in its exponential pomp; and there was an avalanche of money. Besides, we're now in the impossibly puissant digital world of 2025 and could run Lisp at superluminal speed if we wanted to. Nobody wants to.

[3]

The problem was that it isn't clear how humans make expert decisions. We aren't built from arrays and flow charts, and decades of experience cannot be siphoned out of the brains which own and use it. That's why new graduates come out of 15-years plus of full-time education by expert humans and aren't very good at their first job. AI can't fix that.

[4]

[5]

Even if it could break the brain bottleneck, AI is a long way from being good enough to become a digital twin of anyone, no matter how inexpert. In a science fiction scenario, it could plausibly become so over time as machines and techniques improve; in fantasy, you can't get there from here without Gandalf as team lead. There are many signs that we'll need to shop for pointy hats soon. AI isn't living up to its hype even now, and attempts to push it further aren't going well.

We know this, because the actual results from AI in our daily lives, such as search, have things it can't do that aren't getting better, perhaps the opposite. [6]AI model collapse from bad training isn't cured by bigger models. You in particular know this, because professional IT humans are right at the heart of the AI experiment and you know just [7]how well and how badly AI coding goes . Find and stitch together constructs and components, useful when not tripping its bits off. Functional analysis and creating novel solutions to novel problems? Not so much.

[8]

This experiential, anecdotal suspicion that not all is roses in the AI garden is backed up by actual analysis. Apple researchers have [9]published a paper [PDF] that looks at how well frontier large language models (LMMs) with enhanced reasoning – large reasoning models (LRMs) such as OpenAI's o1/o3, DeepSeek-R1 etc - stack up in problem solving, by feeding them tasks differentiated by complexity. Some are reasoning tests, like the classic Tower of Hanoi disc stacking conundrum or ferrying foxes and chickens across a river without getting a fat fox and no chicken.

The least complex problems saw LLMs often outperform the LRMs, while LRMs did better on queries of medium complexity. The most complex problems could defeat everything, with even the LRMs hitting barriers and producing basically useless results, and sometimes even giving up altogether. This persisted even when the researchers gave LRMs the exact algorithms they needed to solve the puzzles.

Put simply, past a certain complexity the models collapsed. As the researchers conclude, "Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs." Add to that the wildly different performance with different problems, the researchers say, and the [10]assumption that LRMs can become generalized reasoning machines does not currently look justified.

[11]Blocking stolen phones from the cloud can be done, should be done, won't be done

[12]AI's enormous energy appetite can be curbed, but only through lateral thinking

[13]Bad trip coming for AI hype as humanity tools up to fight back

[14]Windows isn't an OS, it's a bad habit that wants to become an addiction

[15]Are you a big AI business vendor making terrible AI business decisions? We can help

Of course, this reflects the state of the art now and the approach chosen by the researchers. Chase the many citations in the paper, though, and these concerns aren't unique, rather they're part of a consistent and wide-ranging set of findings with frontier AI. In particular, it looks as if the self-reflection that underpins LRMs has limits that are not understood, and that task-based testing is much better than benchmarking for characterizing how well AI works. Neither of these things are reflected in AI marketing, naturally enough. Both are true, as is model collapse through data poisoning, as is persistent hallucination.

These are open questions which directly question the projected trajectory of AI as a trustworthy tool that can only get better. This is an illusion, as much as AI itself gives the illusion of thinking, and both have great dangers. Anthropomorphization sells. It also kills.

[16]

The upside for the IT industry is that in the coalmine of AI, devs are the anthropomorphized and strangely dressed canaries. Not all industries have the tightly integrated function and quality testing regimes of production code generation.

It's a moral duty to report how well things are working, to show how the caveats uncovered by researchers are panning out in the real world. The global geek army knows better than most when real life turns into cosplay and science fiction becomes fantasy. As both genres demand: use these powers for good. There's a world to save. ®

Get our [17]Tech Resources



[1] https://www.theregister.com/2025/06/12/cio_wants_to_grow_tech/

[2] https://www.theregister.com/2024/09/18/the_future_of_software_part_2/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aE_rM3BvLwUuhZItfMo1NQAAAdE&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aE_rM3BvLwUuhZItfMo1NQAAAdE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aE_rM3BvLwUuhZItfMo1NQAAAdE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[6] https://www.theregister.com/2025/05/27/opinion_column_ai_model_collapse/

[7] https://www.theregister.com/2025/06/12/devs_mostly_welcome_ai_coding/

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aE_rM3BvLwUuhZItfMo1NQAAAdE&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[9] https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

[10] https://www.theregister.com/2025/06/09/apple_ai_boffins_puncture_agi_hype/

[11] https://www.theregister.com/2025/06/09/opinion_column_blocking/

[12] https://www.theregister.com/2025/05/27/opinion_column_ai_energy/

[13] https://www.theregister.com/2025/04/22/bad_trip_coming_for_ai/

[14] https://www.theregister.com/2025/04/28/windows_opinion/

[15] https://www.theregister.com/2025/06/02/opinion_ai/

[16] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aE_rM3BvLwUuhZItfMo1NQAAAdE&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[17] https://whitepapers.theregister.com/



Not so much a canary, more a dead parrot

David Harper 1

"Not all industries have the tightly integrated function and quality testing regimes of production code generation."

And some titans of the IT sector don't even have that ... https://www.theregister.com/2025/06/16/google_cloud_outage_incident_report/

Breaking: Intelligence works like Intelligence.

ArrZarr

LLMs fall over when reasoning gets too complex? Shock. So do Mark 1 Human brains.

LLMs have difficulty keeping track of large contextual requirements? Shock. So do Mark 1 Human brains.

Artificial intelligences exist now, and today (We don't have artificial *sentient* intelligences, before you all jump on my ass). Why is it any surprise that an artificially created form of intelligence falls at the same exact hurdles as evolved intelligence?

AI === Always Incompetent

Steve Davies 3

because they are... Like most people when the push comes to shove, make silly mistakes.

If you rely solely on AI then that leopard may well eat your face.

Sigh.

Lee D

Because they're not REASONING at all.

The reason it can solve Fox & Geese or Towers of Hanoi... they are trivially brute-solvable. There's no reasoning required, you just need to graph (in terms of graph theory) the tree of possible actions and before you're even a few steps in you have your complete solution. There's no depth, no thinking required at all.

The reason AI fails at reasoning is that AI cannot reason.

And absolutely nothing that I've seen or heard asserted by others is to the contrary when you dig into it.

The greatest advance in AI in my lifetime was Google's AlphaGo etc. computers. That's it. They were surprisingly fast at progressing. And then people realised that most Go models can be confused by just... not playing like a grandmaster. They make mistakes and don't know how to cope. And then that plateaued and creators of such systems (including IBM's versions) literally then tried to find some kind of business model or buyer for them, because they... really didn't do what you thought they were doing.

The problem with humans is that we have intelligence and inference and reasoning. And we see these things and jump to a conclusion (which is a critical part of intelligence) which then isn't backed up by evidence... that they achieved those things because they're intelligent. It's not true.

The problem with AI is that is doesn't have intelligence and inference and reasoning.

Researchers gave LRMs the exact algorithms they needed

abend0c4

Could that ever work without some sort of genuine understanding? The "prompt" is an almost insignificant fraction of the data ingested by a system that can [1]recall 42 percent of Harry Potter and the Sorcerer's Stone and I'm not sure what mechanisms would be available to recognise that the prompt was essentially new training data that should override the current model and modify the operation accordingly.

People seem unable to resist anthropomorphising these machines: they can't be led to the correct solution by holding their hands. They don't actually "learn" anything in the way we traditionally understand the word.

[1] https://www.understandingai.org/p/metas-llama-31-can-recall-42-percent

Re: Researchers gave LRMs the exact algorithms they needed

Andy Mac

It is with the greatest regret that I must record a downvote. While I understand Harry Potter was marketed differently in other parts of world, I cannot let a reference to the “sorcerer’s” stone pass unremarked. Not in this place, which was once a bastion of stiff upper lipped, emotionally suppressed Britishness. Good day to you.

P.S. if we let things slip, before we know it, Marathon bars will be called Snickers…

Michael Hoffmann

The elephant in the room is the people that "AI" could replace are the ones pushing for it the most to replace the "worker drones": the c-level and the average MBA grad.

Do you even need an LLM for "when in doubt, lay off more people" and "the share price above all, stock buy backs until your golden parachute inflates"? Methinks a 10-line script could do that.

"a room with enough elephants to restock Africa"

Anonymous Coward

I would presume unfortunately with Indian elephants, :)

She cried, and the judge wiped her tears with my checkbook.
-- Tommy Manville