China's DeepSeek applying trial-and-error learning to its AI 'reasoning'
(2025/09/18)
- Reference: 1758187815
- News link: https://www.theregister.co.uk/2025/09/18/chinas_deepseek_ai_reasoning_research/
- Source link:
Chinese AI company DeepSeek has shown it can improve the reasoning of its LLM DeepSeek-R1 through trial-and-error based reinforcement learning, and even be made to explain its reasoning on math and coding problems, even though explanations might sometimes be unintelligible.
The release of DeepSeek-R1 in January 2025 inspired [1]a $589 billion wipeout of Nvidia’s market value , as investors feared it represented an easier and cheaper route to natural language question answering systems such ChatGPT, from Silicon Valley darling OpenAI.
Poisoned telemetry can turn AIOps into AI Oops, researchers show [2]READ MORE
In [3]a paper published in the science journal Nature, the DeepSeek AI team say they have established that its LLMs can be incentivized to learn to reason without getting examples from humans.
In this way, reinforcement learning, akin to learning through trial and error, can slash the human input required to boost their model's performance. They argue that the approach improves performance on math and coding problems beyond that of LLMs trained on a corpus of human text and examples.
In an accompanying paper, Carnegie Mellon University assistant professor Daphne Ippolito and her PhD student Yiming Zhang explain that reinforcement learning is similar to how a child might learn to play a video game. "As the child navigates their avatar through the game world, they learn through trial and error that some actions (such as collecting gold coins) earn points, whereas others (such as running into enemies) set their score back to zero," their article said.
[4]
"This contrasts with previous prompting-based approaches, which were more akin to expecting a child to learn to master a video game by having them read the instructions, or supervised-learning approaches, which can be likened to expecting the child to master a game by watching a sibling play it hundreds of times," they said.
[5]Huawei lays out multi-year AI accelerator roadmap and claims it makes Earth’s mightiest clusters
[6]AI-powered penetration tool, an attacker's dream, downloaded 10K times in 2 months
[7]All IT work to involve AI by 2030, says Gartner, but jobs are safe
[8]Tinker with LLMs in the privacy of your own home using Llama.cpp
In addition to improving the reasoning behavior of the model, DeepSeek also showed the trial-and-error process helped the model explain its working, so to speak.
OpenAI says models are programmed to make stuff up instead of admitting ignorance [9]READ MORE
But some of the reasoning was difficult to follow for mere humans. For a start, it would sometimes inexplicably switch back and forth between English and Chinese. It might also produce extremely long reasoning containing more than 10,000 words.
Other limitations come from the fact that it was only trained on clear-cut right or wrong answers and has yet to show an aptitude for more nuanced, subjective or long form responses.
[10]
Yet by combining reinforcement learning and supervised learning "DeepSeek-R1 achieved state-of-the art accuracy on tasks that assessed maths and coding skills, factual knowledge and other forms of language understanding, in both Chinese and English," Ippolito and Zhang claimed. ®
Get our [11]Tech Resources
[1] https://www.ft.com/content/674758d7-ffdf-4b88-bb73-f539b56ac4b1
[2] https://www.theregister.com/2025/08/12/ai_models_can_be_tricked/
[3] https://www.nature.com/articles/s41586-025-09422-z
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aMvYNeiDgAzjGqm5s0fY-QAAAMg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://www.theregister.com/2025/09/18/huawei_ascend_roadmap/
[6] https://www.theregister.com/2025/09/11/cobalt_strikes_ai_successor_downloaded/
[7] https://www.theregister.com/2025/09/08/ai_impact_it_departments/
[8] https://www.theregister.com/2025/08/24/llama_cpp_hands_on/
[9] https://www.theregister.com/2025/09/17/openai_hallucinations_incentives/
[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMvYNeiDgAzjGqm5s0fY-QAAAMg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[11] https://whitepapers.theregister.com/
The release of DeepSeek-R1 in January 2025 inspired [1]a $589 billion wipeout of Nvidia’s market value , as investors feared it represented an easier and cheaper route to natural language question answering systems such ChatGPT, from Silicon Valley darling OpenAI.
Poisoned telemetry can turn AIOps into AI Oops, researchers show [2]READ MORE
In [3]a paper published in the science journal Nature, the DeepSeek AI team say they have established that its LLMs can be incentivized to learn to reason without getting examples from humans.
In this way, reinforcement learning, akin to learning through trial and error, can slash the human input required to boost their model's performance. They argue that the approach improves performance on math and coding problems beyond that of LLMs trained on a corpus of human text and examples.
In an accompanying paper, Carnegie Mellon University assistant professor Daphne Ippolito and her PhD student Yiming Zhang explain that reinforcement learning is similar to how a child might learn to play a video game. "As the child navigates their avatar through the game world, they learn through trial and error that some actions (such as collecting gold coins) earn points, whereas others (such as running into enemies) set their score back to zero," their article said.
[4]
"This contrasts with previous prompting-based approaches, which were more akin to expecting a child to learn to master a video game by having them read the instructions, or supervised-learning approaches, which can be likened to expecting the child to master a game by watching a sibling play it hundreds of times," they said.
[5]Huawei lays out multi-year AI accelerator roadmap and claims it makes Earth’s mightiest clusters
[6]AI-powered penetration tool, an attacker's dream, downloaded 10K times in 2 months
[7]All IT work to involve AI by 2030, says Gartner, but jobs are safe
[8]Tinker with LLMs in the privacy of your own home using Llama.cpp
In addition to improving the reasoning behavior of the model, DeepSeek also showed the trial-and-error process helped the model explain its working, so to speak.
OpenAI says models are programmed to make stuff up instead of admitting ignorance [9]READ MORE
But some of the reasoning was difficult to follow for mere humans. For a start, it would sometimes inexplicably switch back and forth between English and Chinese. It might also produce extremely long reasoning containing more than 10,000 words.
Other limitations come from the fact that it was only trained on clear-cut right or wrong answers and has yet to show an aptitude for more nuanced, subjective or long form responses.
[10]
Yet by combining reinforcement learning and supervised learning "DeepSeek-R1 achieved state-of-the art accuracy on tasks that assessed maths and coding skills, factual knowledge and other forms of language understanding, in both Chinese and English," Ippolito and Zhang claimed. ®
Get our [11]Tech Resources
[1] https://www.ft.com/content/674758d7-ffdf-4b88-bb73-f539b56ac4b1
[2] https://www.theregister.com/2025/08/12/ai_models_can_be_tricked/
[3] https://www.nature.com/articles/s41586-025-09422-z
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aMvYNeiDgAzjGqm5s0fY-QAAAMg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[5] https://www.theregister.com/2025/09/18/huawei_ascend_roadmap/
[6] https://www.theregister.com/2025/09/11/cobalt_strikes_ai_successor_downloaded/
[7] https://www.theregister.com/2025/09/08/ai_impact_it_departments/
[8] https://www.theregister.com/2025/08/24/llama_cpp_hands_on/
[9] https://www.theregister.com/2025/09/17/openai_hallucinations_incentives/
[10] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aMvYNeiDgAzjGqm5s0fY-QAAAMg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[11] https://whitepapers.theregister.com/