Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests

(Monday June 09, 2025 @11:22AM (msmash) from the closer-look dept.)

Reference: 0177981529
News link: https://apple.slashdot.org/story/25/06/09/1151210/apple-researchers-challenge-ai-reasoning-claims-with-controlled-puzzle-tests
Source link:

Apple researchers have found that state-of-the-art "reasoning" AI models like OpenAI's o3-mini, Gemini (with thinking mode-enabled), Claude 3.7, DeepSeek-R1 [1]face complete performance collapse [PDF] beyond certain complexity thresholds when tested on controllable puzzle environments. The finding raises questions about the true reasoning capabilities of large language models.

The study, which examined models using Tower of Hanoi, checker jumping, river crossing, and blocks world puzzles rather than standard mathematical benchmarks, found three distinct performance regimes that contradict conventional assumptions about AI reasoning progress.

At low complexity levels, standard language models surprisingly outperformed their reasoning-enhanced counterparts while using fewer computational resources. At medium complexity, reasoning models demonstrated advantages, but both model types experienced complete accuracy collapse at high complexity levels. Most striking was the counterintuitive finding that reasoning models actually reduced their computational effort as problems became more difficult, despite operating well below their token generation limits.

Even when researchers provided explicit solution algorithms, requiring only step-by-step execution rather than creative problem-solving, the models' performance failed to improve significantly. The researchers noted fundamental inconsistencies in how models applied learned strategies across different problem scales, with some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios.

[1] https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

Good, we are not going extinct just yet (Score:2)

by sinij ( 911942 )

We are nowhere near addressing AI alignment, this means that humanity still has time to find a solution.

Re: (Score:3)

by Chris Mattern ( 191822 )

"We are nowhere near addressing AI alignment"

As long as it isn't Chaotic Evil, we should be okay.

Re: (Score:2)

by zlives ( 2009072 )

i would be more worried about LE, they have a plan.

Re: (Score:2)

by sinij ( 911942 )

[1]What is AI alignment? [ibm.com]

> Researchers have identified four key principals of AI alignment: robustness, interpretability, controllability and ethicality.

[1] https://www.ibm.com/think/topics/ai-alignment

Maybe (Score:2)

by Bruce66423 ( 1678196 )

Or maybe that's just what Skynet wants you to think...

I have a sneaking suspicion... (Score:2, Insightful)

by zurkeyon ( 1546501 )

That WE may not actually end up being responsible for emergence. And that self-organization, after our tech reaches the necessary level of complexity to allow it, will do it for us... And likely not in a controllable/preferable way. It is highly unlikely that the inherent property of the universe that is self-organization, will stop at our tech because "We made it" Food for thought...

Re:I have a sneaking suspicion... (Score:4, Interesting)

by sinij ( 911942 )

This is what I used to think, but I changed my mind. I think missing ingredient is evolutionary pressure. That is, complexity alone is not sufficient, you have to have selective pressures for self-organization to manifest itself.

Re: (Score:2)

by zurkeyon ( 1546501 )

But that same pressure is not applied in the formation of planets, galaxies, and likely wasn't present at the emergence of life. I think that it is always acting, and that this is why it's existence was found in a petri dish with inert chemicals. And can also be seen in a lab setting with simple floating magnets. The property is always acting. Seemingly to increase the "Order" in the system to counteract the entropy within that same system. This tendency towards complexity, is likely what lead to not only l

Did Apple just give LLMs their "XOR moment"? (Score:3)

by michaelmalak ( 91262 )

Apple’s new paper on GSM-Symbolic shows that today’s best language models crumble when a gradeschool math word problem is re-phrased -- even if the logic is identical. It echoes 1969, when Minsky&Papert proved that a singlelayer perceptron could never learn XOR.

That blockade vanished in 1986 with backprop and nonlinear hidden layers. My bet: LLMs won’t need two decades to cross the reasoning gap. Why? Agents that call scratchpad Python or GraphRAG pipelines already externalize formal reasoning, turning the model into a planner rather than a prover.

Re: (Score:3)

by Gilmoure ( 18428 )

What does "turning the model into a planner rather than a prover." mean?

Re: (Score:2)

by michaelmalak ( 91262 )

By "prover", I meant that Agentic AI is not a single-shot execution engine, like LLMs of today or theorem provers of yore. By "planner" I meant externalizing the logic/reasoning. Perhaps I was aiming too much for alliteration.

In other words, (Score:2)

by jenningsthecat ( 1525947 )

it's a crap shoot:

> some models successfully handling 100-move sequences in one puzzle type while failing after just five moves in simpler scenarios

So let's make something that we don't fully understand, whose modus operandi doubles as emergent behaviour, and then start relying on it for activities ranging from education to infrastructure. Sounds like a great idea!

It makes sense. (Score:4, Interesting)

by devslash0 ( 4203435 )

Complex puzzles require deep reasoning.

As humans, we are programmed to use our brains and multi-paradigm experience to quickly trim down the decision tree of obviously-wrong solutions. As we go down the complexity depth, we prune more silly solutions and just refine the end outcome; we become better at homing in on the solution.

AI models are different in this regard. They are just statistical probability machines. The greater the complexity depth, the more variables they need to consider in the equation, and without actual intelligence and perception of the problem, they are fundamentally unable to accurately and efficiently discriminate against obviously wrong solutions; paralysed and requiring more and more computional power with no guarantee of a good outcome.

And all of these are above the human baseline (Score:3)

by JoshuaZ ( 1134087 )

It is worth noting even the easiest puzzles here are puzzles which many, if not most humans, cannot solve. The fact that we're now evaluating AI reasoning based on puzzles above human baseline should itself be pretty alarming. But instead we've moved the goalposts and so are reassuring ourselves that the AIs cannot easily solve genuinely tricky puzzles.

Re: (Score:2)

by evanh ( 627108 )

I think it's the opposite. These were straight up reasoning problems. No complex maths involved.

ie: When the LLMs have no templates to paste from they go random.

Some jobs are safe! (Score:2)

by Jhon ( 241832 )

Awesome! Professional puzzle solvers wont be collecting unemployment in the short term. The bad news is that unemployment will likely be broke when toasters can solve puzzles 5 or 10 years later...

human confidence level high. (Score:2)

by laxr5rs ( 2658895 )

I failed puzzles because I don't waste my time on them. I like real life puzzles. maybe the AIs are just bored

News: 0177981529

Apple Researchers Challenge AI Reasoning Claims With Controlled Puzzle Tests

Good, we are not going extinct just yet (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Maybe (Score:2)

I have a sneaking suspicion... (Score:2, Insightful)

Re:I have a sneaking suspicion... (Score:4, Interesting)

Re: (Score:2)

Did Apple just give LLMs their "XOR moment"? (Score:3)

Re: (Score:3)

Re: (Score:2)

In other words, (Score:2)

It makes sense. (Score:4, Interesting)

And all of these are above the human baseline (Score:3)

Re: (Score:2)

Some jobs are safe! (Score:2)

human confidence level high. (Score:2)