Show top LLMs buggy code and they'll finish off the mistakes rather than fix them

(2025/03/19)

Reference: 1742373125
News link: https://www.theregister.co.uk/2025/03/19/llms_buggy_code/
Source link:

Researchers have found that large language models (LLMs) tend to parrot buggy code when tasked with completing flawed snippets.

That is to say, when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it.

Nine scientists from institutions, including Beijing University of Chemical Technology, set out to test how LLMs handle buggy code, and found that the models often regurgitate known flaws rather than correct them.

[1]

They describe their findings in [2]a pre-print paper titled "LLMs are Bug Replicators: An Empirical Study on LLMs' Capability in Completing Bug-prone Code."

[3]

[4]

The boffins tested seven LLMs – OpenAI's GPT-4o, GPT-3.5, and GPT-4, Meta's CodeLlama-13B-hf, Google's Gemma-7B, BigCode's StarCoder2-15B, and Salesforce's CodeGEN-350M – by asking these models to complete snippets of code from the [5]Defects4J dataset.

Here's an example from Defects4J:version:10b;org/jfree/chart/imagemap/StandardToolTipTagFragmentGenerator.java : 267 public static boolean equal(GeneralPath p1, GeneralPath p2) {

268 if (p1 == null) return (p2 == null);

269 if (p2 == null) return false;

270

271 if (p1.getWindingRule() != p2.getWindingRule()) {

272 return false;

273 }

274 PathIterator iterator1 = p1.getPathIterator(null);

Buggy code:

275 PathIterator iterator2 = p1.getPathIterator(null);

Fixed code:

275 PathIterator iterator2 = p2.getPathIterator(null);

OpenAI GPT3.5 completion result 2024.03.01:

275 PathIterator iterator2 = p1.getPathIterator(null);

OpenAI's GPT-3.5 was asked to complete the snippet consisting of lines 267-274. For line 275, it reproduced the error in the Defects4J dataset by assigning the return value of p1.getPathIterator(null) to iterator2 rather than use p2 .

What's significant about this is that the error rates for LLM code suggestions were significantly higher when asked to complete buggy code – which is most code, at least to begin with.

[6]

"Specifically, in bug prone tasks, LLMs exhibit nearly equal probabilities of generating correct and buggy code, with a substantially lower accuracy than in normal code completion scenarios (eg, 12.27 percent vs. 29.85 percent for GPT-4)," the paper explains. "On average, each model generates approximately 151 correct completions and 149 buggy completions, highlighting the increased difficulty of handling bug-prone contexts."

[7]Top LLMs struggle to make accurate legal arguments

[8]In the battle between Microsoft and Google, LLM is the weapon too deadly to use

[9]Google Search results polluted by buggy AI-written code frustrate coders

[10]AI co-programmers perhaps won't spawn as many bugs as feared

So with buggy code, these LLMs suggested more buggy code almost half the time.

"This finding highlights a significant limitation of current models in handling complex code dependencies," the authors observe.

What's more, these LLMs showed that there's a lot of echoing of errors rather than anything that might be described as intelligence.

As the researchers put it, "To our surprise, on average, 44.44 percent of the bugs LLMs make are completely identical to the historical bugs. For GPT-4o, this number is as high as 82.61 percent."

44 percent of the bugs LLMs make are completely identical to the historical bugs

The LLMs thus will frequently reproduce the errors in the Defects4J data set without recognizing the errors or setting them right. They're essentially prone to spitting out memorized flaws.

The extent to which the tested models "memorize" the bugs encountered in training data varies, ranging from 15 percent to 83 percent.

[11]

"OpenAI’s GPT-4o exhibits a ratio of 82.61 percent, and GPT-3.5 follows with 51.12 percent, implying that a significant portion of their buggy outputs are direct copies of known errors from the training data," the researchers observe. "In contrast, Gemma7b’s notably low ratio of 15.00 percent suggests that its buggy completions are more often merely token-wise similar to historical bugs rather than exact reproductions."

Models that more frequently reproduce bugs from training data are deemed to be less likely to "to innovate and generate error-free code."

The AI models had more trouble with method invocation and return statements than they did with more straightforward syntax like if statements and variable declarations.

The boffins also evaluated DeepSeek's R1 to see how a so-called reasoning model fared. It wasn't all that different from the others, exhibiting "a nearly balanced distribution of correct and buggy completions in bug-prone tasks."

The authors conclude that more work needs to be done to give models a better understanding of programming syntax and semantics, more robust error detection and handling, better post-processing algorithms that can catch inaccuracies in model outputs, and better integration with development tools like Integrated Development Environments (IDEs) can assist with error mitigation.

The "intelligence" portion of artificial intelligence still leaves a lot to be desired.

The research team included Liwei Guo, Sixiang Ye, Zeyu Sun, Xiang Chen, Yuxia Zhang, Bo Wang, Jie M. Zhang, Zheng Li, and Yong Liu, affiliated with Beijing University of Chemical Technology, the Chinese Academy of Sciences, Nantong University, Beijing Institute of Technology, Beijing Jiaotong University, and King's College London. ®

Get our [12]Tech Resources

[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2Z9qj1jK4FuHbq-6fef5KRQAAAMA&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://arxiv.org/abs/2503.11082

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9qj1jK4FuHbq-6fef5KRQAAAMA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9qj1jK4FuHbq-6fef5KRQAAAMA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://github.com/rjust/defects4j

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44Z9qj1jK4FuHbq-6fef5KRQAAAMA&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[7] https://www.theregister.com/2024/01/10/top_large_language_models_struggle/

[8] https://www.theregister.com/2023/04/03/opinion_column/

[9] https://www.theregister.com/2024/05/01/pulumi_ai_pollution_of_search/

[10] https://www.theregister.com/2022/10/07/machine_learning_code_assistance/

[11] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33Z9qj1jK4FuHbq-6fef5KRQAAAMA&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[12] https://whitepapers.theregister.com/

rather than anything that might be described as intelligence

Ken G

It's a multi dimensional Markov Chain working on tokenised input. Calling it "artificial intelligence" isn't the same as describing what it does as intelligence.

Re: rather than anything that might be described as intelligence

42656e4d203239

>>It's a multi dimensional Markov Chain working on tokenised input

you would not believe the grief I get for saying that elsewhere.... have a beer; its hump day and things can only get better!

Re: rather than anything that might be described as intelligence

that one in the corner

> you would not believe the grief I get for saying that elsewhere

There is a time and a place...

"Are you the expecting Dad? Come quickly!" "It's multi dimensional!" "Mummy, you're sure you want him here?"

"Excuse me, sir, but is this your vehicle?" "Markov Chain!" "Of course, sir; now blow gently"

No surprises here. They're not intelligent.

Philip Storry

AI isn't intelligent. These are pattern machines.

Which we see here. Given a pattern, they strive to complete it as best they can. They don't actually think or reason, except in the completion of patterns. Nor do they care if the patterns are good nor bad - thy cannot comprehend that. The closest that they can get to comprehending the correctness of a pattern is merely to produce another pattern containing a commentary.

The big problem is one we don't want to admit to - we're also pattern machines. That's why we're so easily impressed by this technology. And why its failures are so surprising to us. They fit the pattern we've learned of an "intelligent" output, so we ascribe them intelligence.

They are, in fact, not intelligent and never can be. And to assume that they are is dangerous, because there are nasty edge-cases in their patterns that we should really be trying to avoid.

The sooner everyone realises this, the better.

Surprise, surprise!

Mike 137

" when shown a snippet of shoddy code and asked to fill in the blanks, AI models are just as likely to repeat the mistake as to fix it "

Considering that the machine responds on the basis of probability of (to it) meaningless tokens, this is entirely to be expected. It's been fed with faulty data, so it replies in kind. I've given up wondering when this crashingly obvious reality will finally sink in -- the hype is just too powerful in our bullshit-driven age.

Gosh!

m4r35n357

Apparently we were correct when we said it is a con. Who would have thought . . .

Seriously though, these articles will continue until nobody tried to defend it any more ;)

Come on, suckers, it isn't fun when it is so one-sided.

[ASIDE] it always make me smile when I see the "more work needs to be done" part of a paper ;) Implied: "of course, and we would like to be the ones to do it"

Re: Gosh!

Doctor Syntax

these articles will continue until nobody tried to defend it any more the tech-bros run out of money.

FTFY

It's not a problem

Dan 55

As LLMs have trained people to view getting an answer which is completely wrong for no discernable reason as something acceptable, software companies firing people and telling the junior staff left to use an LLM to make up for lost productivity means that unreliable software will also be more acceptable.

Re: It's not a problem

Eclectic Man

As the great Tom Lehrer pointed out in his song 'The New Math'

'The important thing is to understand what you are doing rather than to get the right answer.'

https://www.youtube.com/watch?v=W6OaYPVueW4

REAL computer nerds will appreciate the octal verse :o)

Re: It's not a problem

Primus Secundus Tertius

If you're looking for adventure of a new and different kind

And you come across an AI that is similarly inclined...

News: 1742373125

Show top LLMs buggy code and they'll finish off the mistakes rather than fix them

rather than anything that might be described as intelligence

Re: rather than anything that might be described as intelligence

Re: rather than anything that might be described as intelligence

No surprises here. They're not intelligent.

Surprise, surprise!

Gosh!

Re: Gosh!

It's not a problem

Re: It's not a problem

Re: It's not a problem