Apple AI boffins puncture AGI hype as reasoning models flail on complex planning
- Reference: 1749491261
- News link: https://www.theregister.co.uk/2025/06/09/apple_ai_boffins_puncture_agi_hype/
- Source link:
Apple AI researchers have found that the "thinking" ability of so-called "large reasoning models" collapses when things get complicated. The authors' findings, described in [1]a paper titled, "The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity," indicate that the intellectual potential of such models is so far quite limited.
Large reasoning models (LRMs), such as OpenAI’s o1/o3, DeepSeek-R1, Claude 3.7 Sonnet Thinking, and Gemini Thinking, are designed to break problems down into smaller steps. Instead of responding to a prompt with a specific prediction, they use mechanisms like Chain of Thought to iterate through a series of steps, validating their intermediate answers along the way, to arrive at a solution to the stated problem.
[2]
Authors Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar set out to test how these reasoning models perform. So they designed a puzzle environment for the models as an alternative to applying standard benchmark tests.
[3]
[4]
The puzzle regime gave the researchers control over the complexity of the challenges while avoiding [5]benchmark data contamination , a problem that arises when language models inadvertently absorb evaluation benchmarks during training, skewing their performance in testing. Some model makers have also been accused of [6]gaming benchmarks , which just [7]aren't all that great to begin with .
[8]Unemployment is spiking for US IT pros - unless you want to babysit bots
[9]Chap claims Atari 2600 'absolutely wrecked' ChatGPT at chess
[10]Enterprises are getting stuck in AI pilot hell, say Chatterbox Labs execs
[11]ChatGPT used for evil: Fake IT worker resumes, misinfo, and cyber-op assist
The puzzle environment included various games like the [12]Tower of Hanoi , in which the goal is to stack a set of differently sized disks in order of size by moving them one at a time between three upright pegs.
The researchers found reasoning models did better with moderately complex problems, but broke down at a certain level of complexity.
"[D]espite their sophisticated self-reflection mechanisms learned through reinforcement learning, these models fail to develop generalizable problem-solving capabilities for planning tasks, with performance collapsing to zero beyond a certain complexity threshold," the paper says.
[13]
Reasoning models also underperformed simple large language models on easier problems - they often found the correct solution early but kept looking, inefficiently burning compute on unnecessary steps.
The authors argue that the results suggest large reasoning models may not provide a path toward better artificial thinking.
"These insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning," the authors conclude. ®
Get our [14]Tech Resources
[1] https://machinelearning.apple.com/research/illusion-of-thinking
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aEdZanGpnDfy2IxKkaWFzQAAABg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aEdZanGpnDfy2IxKkaWFzQAAABg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aEdZanGpnDfy2IxKkaWFzQAAABg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://arxiv.org/html/2406.04244v1
[6] https://www.theregister.com/2025/04/08/meta_llama4_cheating/
[7] https://www.theregister.com/2025/02/15/boffins_question_ai_model_test/
[8] https://www.theregister.com/2025/06/09/it_unemployment_rate_janco/
[9] https://www.theregister.com/2025/06/09/atari_vs_chatgpt_chess/
[10] https://www.theregister.com/2025/06/08/chatterbox_labs_ai_adoption/
[11] https://www.theregister.com/2025/06/06/chatgpt_for_evil/
[12] https://en.wikipedia.org/w/index.php?title=Tower_of_Hanoi&oldid=1294488884
[13] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aEdZanGpnDfy2IxKkaWFzQAAABg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[14] https://whitepapers.theregister.com/
Re: AI 'thinking' may just be an illusion
Funny though, many humans have the same problem.
Thinking being a mere illusion, that is.
We don't really know what "intelligence" is, but, ultimately, there are only two possibilities. Either it is nothing more than emergent behavior from big statistics with a lot of recursion, or it is more than that.
In the first case, we might get AGI comparatively soon. It might take some new development, such as embodied AI, or it might take finding just the right architecture. But the path would be roughly correct and ought to eventually get there.
In the second case, we are not getting AGI any time soon, and all the effort spent on LLMs can only run in circles while getting nowhere. If we're very lucky, we might learn why it's not the answer.
Both possibilities are philosophically interesting, and I'm looking forward to the outcome.
I think the first one scares a lot of people who been studying AI for decades and getting into highly complex discussions about AI, when in reality it was fairly simply to do, we just didn't have the data or the hardware and a few programming concepts to do it until we did.
I think there's a third possibility: that intelligence is a behaviour that emerges from statistics, etc... but that LLMs, while a step towards it, aren't a suitable architecture and so nothing close to AGI will arise "comparatively soon"; it will need another architectural break through.
And, frankly, LLMs are enough to be getting on with. If their "intelligence" is capped that will give us time to adjust before anything close to AGI emerges ..
'We don't really know what "intelligence" is'
When I first heard the hype about LLM's, I thought it would probably challenge peoples idea of what intelligence is.
My take is that intelligence has evolved over *many* generations of trial and error for the purpose of operating the creatures that possess it, such that they may feed themselves and reproduce in the face of competition for common resources. Artificial intelligence has no purpose, so what's the impetus for it to be intelligent? The fact that so many supposedly intelligent people think that LLM's are intelligent leads me to doubt the capabilities of real human intelligence, which did not evolve for the modern world.
"such that they may feed themselves and reproduce in the face of competition for common resources"
It's a bit more than that. Reading and commenting here achieves neither of those goals. So either that is not intelligence at work (a plausibly arguable PoV) or intelligence acquires a whole lot of additional goal for its own pleasure.
> Reading and commenting here achieves neither of those goals.
I need to work in order to "compete for resources" and in order to work most effectively I need to occasionally rest and laugh at commentards, for which I come here. Doesn't help much with the reproduction though.
LLMs are just
Extremely well read toddlers who will never grow up.
They are useful, but there's no way that word (token) prediction is the path to AGI.
Encyclopædia Britannica needed for verification
Intelligence is the probabilities vs real world experience. The probabilities give you a range of outcomes, and real world experience tells you what the most likely outcome is. Ai can do probabilities, but needs your real world experience to put it to good use.
Quantum mechanics is also a game of probabilities. You have to make a real world measurement to verify your best guess. In the case of the internet there are no more places of truth left to verify against.
You are playing a game of Schrödinger's cat without any measurements taken therefor, you cannot trust the AI to be truthful, or even accurate.
Building non-artificial intelligence
Is both more fun and cheaper than AI, even taking into account the eighteen or so years it needs to mature to usefulness.
Fairly basic tasks
Are a fairy major struggle for most AI in the wild at the moment. Anything bigger sounds a stretch,
Truth regarding 'AI' !!!!
The current problem with AI and/or AGI is not one of architecture or understanding what intelligence really is !!!
The problem is one of 'Truth' ...
The people who are working on creating 'AI' are being 'economical' with the 'Truth'.
'AI' can masquerade as 'Intelligent' BUT it is not 'TRUE' by any realistic measure.
It may help to 'Sell' whatever 'AI' is BUT it is NOT truely 'Intelligent' and never can be ... there is something missing.
Sell 'AI' as a flawed agent to assist in knowledge search BUT don't hide the fact that it still 'Lies' [Hallucinations] and unless the knowledge-set it is based on is carefully curated, you still need to take the answers given with a LARGE pinch of salt !!!
Greed is driving the current 'AI' mania ... eventually the funds to keep going WILL run out !!!
A new and different approach is needed ... current 'AI' is a dead-end !!!
That is the 'Truth' !!!
:)
Very nice
I love that the boffins developed their own independent benchmark (as in their previous GSM-Symbolic benchmark) to prevent " benchmark data contamination ". Too many A I benchmarks are affected by this sort of contamination, like the [1]o1 Pro YT eval on the William Lowell Putnam Mathematics Competition.
The European Commission's Joint Research Center paper (linked under " aren't all that great to begin with ") nicely notes for example that (from Narayanan and Kapoor [2023a]) GPT4 " could regularly solve benchmark problems classified as easy - as long as the problems had been added before 5th September 2021. For problems added later, GPT4 could not get a single question right, suggesting that the model had memorised questions and answers ".
Similarly, using the "Black Box test" method, Microsoft folks found that 7 multilingual benchmarks had been swallowed up whole by 7 LLMs (cf. [2]Contamination Report for Multilingual Benchmarks ) and another group of boffins [3]Investigating Data Contamination in Modern Benchmarks for LLMs used "Testset Slot Guessing" to find LLMs outputting verbatim the missing (right or wrong) multiple-choice options in MMLU benchmark data more than 50% of the time. Louis Hunt posted about similar contamination, with GSM8K data, on [4]his linkedin page.
Very nice article!
[1] https://forums.theregister.com/forum/all/2024/12/15/speculative_decoding/#c_4983637
[2] https://arxiv.org/abs/2410.16186v1
[3] https://aclanthology.org/2024.naacl-long.482/
[4] https://www.linkedin.com/posts/louiswhunt_see-below-for-6882-pages-of-mmlu-and-gsm8k-activity-7281011488692047872-fWCE
AI 'thinking' may just be an illusion
I think I might speak for the majority of commentards when I say:
WOT!?? NO, REALLY?? What have we been yelling for years?
And the "no shit, Sherlock" icon has never been more appropriate.