News: 1753920486

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Alibaba admits Qwen3's hybrid-thinking mode was dumb

(2025/07/31)


One of the headline features of Alibaba's Qwen 3 family of models when they launched back in April was the ability to toggle between "thinking" and "non-thinking" modes on the fly.

While convenient, it seems the functionality came at the price of lower quality and poorer performance in benchmarks.

"After talking with the community and thinking it through, we decided to stop using hybrid thinking mode. Instead, we'll train Instruct and Thinking models separately so we can get the best quality possible," the Qwen team explained in a recent [1]X post . "We believe that providing better-quality performance is more important than the unification at this moment."

[2]

To address this, the Qwen team has begun rolling out dedicated instruct and thinking-tuned versions of models which claim major gains in reasoning, problem solving, mathematics, coding, and general knowledge.

[3]

By ditching its hybrid "thinking" mode, Alibaba's refreshed Qwen3 models now perform substantially better than the original April release - Click to enlarge

The improvement was particularly strong for Alibaba's non-thinking instruct models. In the case of the AIME25 math benchmark, Alibaba's Qwen3-235B-A22B-Instruct-2507 model eked out a 2.8x lead over the April release. The July refresh of Alibaba's smaller 30 billion parameter mixture of experts (MoE) model also enjoyed similar gains.

[4]

The July refresh of Qwen3-30B-A3B also shows similar gains in non-thinking tasks. - Click to enlarge

Curiously, the performance uplift for Alibaba's new thinking-tuned models wasn't nearly as stark. According to Alibaba, Qwen3-235B-A22B-Thinking scored between 13 percent and 54 percent better in the math-heavy AIME25 Humanity's Last Exam benchmarks, respectively.

[5]

While Alibaba's dedicated Thinking models do perform better than the original Qwen 3 release, the improvement isn't nearly as stark as what we saw with the non-thinking versions - Click to enlarge

As with any vendor supplied benchmarks, we recommend taking these performance claims with a grain of salt and if you plan on deploying these models in production, evaluating them against your own specific use cases.

In addition to being smarter, Alibaba's 2507 releases also bump the model's context window — which you can think of as its short-term memory — from a mere 32k tokens to 256k.

[6]

[7]

Large context windows are particularly important for "thinking" models, which may generate hundreds or even thousands of words worth of text before arriving at the final answer. A larger context window not only allows the model to keep track of larger documents, prompts, or conversations, but also means models can think for longer.

Alibaba has extended its model's thinking budgets to take full advantage here and recommended users set a context-length to at least 128k tokens if they have sufficient memory to manage it.

[8]Zuck tries to justify AI splurge with talk of 'superintelligence' for all

[9]Cisco donates Agntcy project to Linux Foundation in the hope it gets AI agents interacting elegantly

[10]Stacking up Huawei's rack-scale boogeyman against Nvidia's best

[11]US agencies log nearly 9x more GenAI use cases in 2024 - but deployments stall

At the time of writing, the Qwen team had pushed up instruct and thinking-tuned versions — identifiable by the 2507 date code in their names — of its 235 and 30 billion parameter models with plans to roll out updated versions of its Qwen3 models in the coming days.

Junyang Lin, one of the model devs on the Qwen team, has also [12]teased a code-tuned version of the 30B parameter MoE model that could be released as early as Thursday.

[13]

As with past releases, the models are being made available in both their native BF16 and quantized FP8 datatypes, and we expect it won't be long before Qwen makes 4-bit AWQ quants available as well.

While Alibaba may be stepping back from the idea of hybrid-thinking models, like the original Qwen3 release, the team hasn't given up on the idea entirely. "We are still continuing our research on hybrid thinking mode," the team wrote, suggesting that the functionality may resurface in future models once they've sorted out the quality issues.®

Get our [14]Tech Resources



[1] https://x.com/Alibaba_Qwen/status/1947344511988076547

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aIrqR1KwEP6FaQtMSQTghgAAAIY&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://regmedia.co.uk/2025/07/30/qwen3_235b_a22b_2507.jpg

[4] https://regmedia.co.uk/2025/07/30/qwen3_30b_a3b_2507.jpg

[5] https://regmedia.co.uk/2025/07/30/qwen3-_235b_thinking_2507.jpg

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aIrqR1KwEP6FaQtMSQTghgAAAIY&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aIrqR1KwEP6FaQtMSQTghgAAAIY&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[8] https://www.theregister.com/2025/07/30/meta_ai_superintelligence/

[9] https://www.theregister.com/2025/07/30/agntcy_lf_donation/

[10] https://www.theregister.com/2025/07/29/huawei_rackscale_boogeyman/

[11] https://www.theregister.com/2025/07/29/us_government_identified_ai_use/

[12] https://x.com/JustinLin610/status/1950572221862400012

[13] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aIrqR1KwEP6FaQtMSQTghgAAAIY&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[14] https://whitepapers.theregister.com/



The road less travelled

vtcodger

"Prioritizing quality over convenience"

Where are the profits in that?

If you have to hate, hate gently.