New LLMs Generate More Severe Coding Bugs Despite Benchmark Improvements

By Ashok Varma

Recent advancements in large language models (LLMs) are introducing a higher proportion of critical coding bugs and security vulnerabilities, even as these AI tools achieve better scores on programming benchmarks. A new report from SonarSource SA analyzed over 4,400 Java programming tasks across leading models, including Anthropic's Claude Sonnet 4, OpenAI's GPT-4o, and Meta's Llama 3.2 90B. While the models demonstrate capability in generating functional code and solving complex problems, the study highlights systemic weaknesses. Every tested LLM produced a significant percentage of "BLOCKER" level vulnerabilities, the most severe classification. Llama 3.2 90B was particularly concerning, with over 70% of its identified vulnerabilities rated BLOCKER, followed closely by GPT-4o at 62.5%. Common flaws detected include path-traversal, injection risks, and hard-coded credentials, often reflecting limitations in tracking untrusted data flows or replicating insecure patterns from training data. Notably, Claude Sonnet 4 generated nearly double the BLOCKER bugs compared to its predecessor, despite its strong benchmark performance. These high-impact issues frequently involve concurrency…