AI model 'personalities' shape the quality of generated code

(2025/08/13)

Reference: 1755091811
News link: https://www.theregister.co.uk/2025/08/13/ai_model_personalities_shape_the/
Source link:

Generative AI coding models have common strengths and weaknesses, but express those characteristics differently due to variations in coding style.

These models also create problems... by generating 'code smells' – design patterns that suggest the presence of deeper problems

Code quality biz Sonar argues that it's necessary for software developers who use large language models (LLMs) for assistance to understand how these "personalities" shape AI-generated code and affect code security, reliability, and maintainability.

"To really get the most from them, it is crucial to look beyond raw performance to truly understand the full mosaic of a model's capabilities," said Tariq Shaukat, CEO of Sonar, in a statement provided to The Register. "Understanding the unique personality of each model, and where they have strengths but also are likely to make mistakes, can ensure each model is used safely and securely."

Sonar, based in Switzerland, recently evaluated how five LLMs – Anthropic's Claude Sonnet 4 and 3.7; OpenAI's GPT-4o; Meta's Llama 3.2 90B; and the open source OpenCoder-8B – perform when asked to complete 4,442 Java programing tasks, including [1]MultiPL-E-mbpp-java and [2]ComplexCodeEval .

The biz published its findings on Wednesday in [3]a report titled "The coding personalities of leading LLMs."

[4]

The overall conclusion of the report aligns with the industry view about the code competency of generative AI: These models have strengths and weaknesses, and can be useful when employed in conjunction with human oversight and review.

[5]

[6]

The five LLMs tested demonstrated varying levels of competency on benchmarks like [7]HumanEval , with scores ranging from 95.57 percent (Claude Sonnet 4) to 61.64 percent (Llama 3.2 90B). Claude's high marks, the report says, show that the model is quite capable of generating valid, executable code.

The models also exhibited technical competence in tests designed to require the application of algorithms and data structures, with Claude 3.7 Sonnet (72.46 percent) and GPT-4o (69.67 percent) producing the highest percentage of correct solutions.

[8]

And they managed to transfer concepts across different programming languages.

But these models also create problems, the report found, by generating insecure code, showing no awareness of software engineering norms, and generating "code smells" – design patterns that suggest the presence of deeper problems.

"The single most alarming shared trait across all models is a fundamental lack of security awareness," the report says. "While the exact prevalence varies between models, all evaluated LLMs produce a frighteningly high percentage of vulnerabilities with the highest severity ratings."

[9]

The report found that the vulnerabilities produced by these models on the test data set were the highest possible severity rating, on a scale that includes "Blocker," "Critical," "Major," and "Minor."

With Llama 3.2 90B, more than 70 percent of its vulnerabilities were rated "Blocker," meaning a bug that's so severe the application will crash. For GPT-4o, 62.5 percent of the vulnerabilities generated were that bad. For Claude Sonet 4, almost 60 percent of vulnerable code generation reached that level.

[10]No more 'Sanity Checks.' Inclusive language guide bans problematic tech terms

[11]Perplexity takes a shine to Chrome, offers Google $34.5 billion

[12]GSA inks another $1 OneGov vendor deal, this time with Anthropic

[13]Suetopia: Generative AI is a lawsuit waiting to happen to your business

The generated flaws were most commonly path traversal and injection vulnerabilities (34.04 percent for Claude Sonnet 4), followed by hard-coded credentials, cryptography misconfiguration, XML external entity injection, and then other issues.

"LLMs struggle to prevent injection flaws because doing so requires taint-tracking from an untrusted source to a sensitive sink, a non-local data flow analysis that is beyond the scope of their typical context window," the report explains. "They generate hard-coded secrets (like passwords) because these flaws exist in their training data."

Lacking any contextual awareness of software engineering and how applications work, these models often allowed resource leaks from failure to close file streams. And they exhibited a bias toward messy code that exhibited code smells, a sign of poorly structured, complex, difficult-to-maintain code.

Sonar's report goes on to note that each of the evaluated models has a distinct "personality" or archetype – characteristics that get reflected in model output.

Claude 4 Sonnet has been labeled "the senior architect" because it demonstrates the highest skill by successfully passing 77.04 percent of benchmark tests.

"Its style is verbose and highly complex, as it consistently attempts to implement sophisticated safeguards, error handling, and advanced features, mirroring the behavior of a senior engineer," the report explains.

GPT-4o has been dubbed "the efficient generalist," which the report describes as "a reliable, middle of the road developer." It tends to avoid the most severe bugs but makes a lot of control-flow mistakes.

OpenCoder-8B is referred to as "the rapid prototyper" for its concise coding style (producing the least lines of code) and the highest issue density, with 32.45 issues per thousand lines of code.

Llama 3.2 90B has been branded "the unfulfilled promise" for its "mediocre" benchmark pass rate of 61.47 percent and for its high percentage of "Blocker" severity vulnerabilities (70.73 percent). Meta CEO Mark Zuckerberg's recent efforts to poach high-profile AI researchers by [14]dangling stratospheric compensation offers presumably reflects an effort to make Meta's models more competitive.

And lastly, Claude 3.7 Sonnet has been named "the balanced predecessor," for a capable benchmark pass rate of 72.46 percent and high comment density (16.4 percent). At the same time, the model still produces a high proportion of "Blocker" severity vulnerabilities.

The report notes that while Claude 4 Sonnet is a newer model than 3.7 Sonnet and performs better on benchmark tests, the security vulnerabilities it creates are almost twice as likely to be "Blocker" severity compared to its predecessor.

Sonar's report concludes that in light of these issues, it's imperative to verify model output through proper governance and code analysis. ®

Get our [15]Tech Resources

[1] https://arxiv.org/abs/2208.08227

[2] https://arxiv.org/abs/2409.10280

[3] https://www.sonarsource.com/resources/the-coding-personalities-of-leading-llms/

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aJy2lUQhL9a1kkOpVVZOpgAAAAM&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJy2lUQhL9a1kkOpVVZOpgAAAAM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJy2lUQhL9a1kkOpVVZOpgAAAAM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[7] https://github.com/openai/human-eval

[8] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aJy2lUQhL9a1kkOpVVZOpgAAAAM&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[9] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aJy2lUQhL9a1kkOpVVZOpgAAAAM&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[10] https://www.theregister.com/2025/08/12/inclusive_language_guide_gets_an/

[11] https://www.theregister.com/2025/08/12/perplexity_takes_shine_to_chrome/

[12] https://www.theregister.com/2025/08/12/gsa_inks_another_1dollar_onegov_deal/

[13] https://www.theregister.com/2025/08/12/genai_lawsuit/

[14] https://www.theregister.com/2025/06/13/meta_offers_10m_ai_researcher/

[15] https://whitepapers.theregister.com/

News: 1755091811

AI model 'personalities' shape the quality of generated code