AI Models Still Struggle To Debug Software, Microsoft Study Shows (techcrunch.com)
- Reference: 0176998995
- News link: https://developers.slashdot.org/story/25/04/11/0519242/ai-models-still-struggle-to-debug-software-microsoft-study-shows
- Source link: https://techcrunch.com/2025/04/10/ai-models-still-struggle-to-debug-software-microsoft-study-shows/
> A new study from Microsoft Research, Microsoft's R&D division, reveals that models, including Anthropic's Claude 3.7 Sonnet and OpenAI's o3-mini, fail to debug many issues in a software development benchmark called SWE-bench Lite. The results are a sobering reminder that, despite bold pronouncements from companies like OpenAI, AI is still no match for human experts in domains such as coding.
>
> The study's co-authors tested nine different models as the backbone for a "single prompt-based agent" that had access to a number of debugging tools, including a Python debugger. They tasked this agent with solving a curated set of 300 software debugging tasks from SWE-bench Lite.
>
> According to the co-authors, even when equipped with stronger and more recent models, their agent rarely completed more than half of the debugging tasks successfully. Claude 3.7 Sonnet had the highest average success rate (48.4%), followed by OpenAI's o1 (30.2%), and o3-mini (22.1%).
[1] https://techcrunch.com/2025/04/10/ai-models-still-struggle-to-debug-software-microsoft-study-shows/
Most likely the model was not designed to find bug (Score:2)
To make it find bugs to have to train it with bugs.
Makes no sense to demand from a random model to hint about a topic it is not trained on.
Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug. It is probably easier if you have the original specc and can write a test for (parts of?) the code,
Re: (Score:3)
> Then again: what is a bug? If you have running code that is sparsely commented and no exceptions, it is pretty difficult to identify something as a bug.
OTOH there are some things that are a bug in almost any context. e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
Re: (Score:1)
e.g. if you can spot where a C or C++ program invokes undefined behavior, you've spotted a bug by any reasonable definition of "bug".
That is not necessarily a bug. As the compiler vendor in question might have defined in this case pretty well what the behaviour is.
But your idea is good. However: strictly speaking I would in our days expect the compiler to give a warning. No idea: do come compiler suits still come with a lint(er)?
Like saying that Managers still struggle to manage (Score:2)
Saying that you're something doesn't mean automagically that you're good at it.
Maybe AI is intelligent after all (Score:2)
Here's some shit code, fix it. AI: nah, lulz
Constrained by the documentation (Score:3)
AI's are only as good as the material they've learned from.
I had a bug of sorts that I fixed by myself only today that I'd been trying to get the various flavours of ChatGPT to shine light on.
None gave me the answer and even when I came up with the solution and proposed it to GPT it said "yeah, sure, you are probably right but I couldn't find any references to back up your solution".
I've been writing software professionally for 35 years. AI has knowledge but lacks experience, wisdom, intuition, call it what you will.
AI is brilliant, I couldn't work without it now as it saves me so much time, but it isn't going to replace me in its current form.
Re: (Score:2)
Well, obviously. Experience cannot be taught as opposed to knowledge.
Wisdom and intuition both rely on experience, among other things.
So this is just logical. I use ChatGPT when scripting because I don't have a lot of routine at it so I tend to forget basic things. I can supply some experience and intuition while ChatGPT fills the knowledge gaps.
Works on my level.
And that's why MS software sucks (Score:3)
Now the mystery is solved. MS has been using AI to write software for longer that we knew.
Debugging is not just looking at code (Score:2)
We need to be looking at the state of the program, we use tools that generate useful information while the program is running (and when it's not) and so on. Do they really think that this is all going to be replaced by text predictors?
Yes, "AI" models struggle to do anything (Score:5, Insightful)
that they were not designed to, and intelligence is not something they were designed for.
They'll struggle at it for the foreseeable future, and it doesn't really matter how much more power and GPUs the "investment community" throws at them.
Re: (Score:2)
Debugging is twice as hard as writing code in the first place, maybe more. Crucially it requires extensive reasoning skills to catch anything more than trivial mistakes that the compiler can usually flag up anyway. LLMs are not good at that kind of thing.
You can't just hand AI code to a human and expect them to quickly debug it either.