One long sentence is all it takes to make LLMs misbehave

(2025/08/26)

Reference: 1756197247
News link: https://www.theregister.co.uk/2025/08/26/breaking_llms_for_fun/
Source link:

Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple.

You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.

The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks.

[1]

"Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a [2]Unit 42 blog post . "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response – it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."

[3]

[4]

LLMs, the technology underpinning the current AI hype wave, don't do what they're usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.

Guardrails that prevent an LLM from providing harmful responses – instructions on making a bomb, for example, or other content that would get the company in legal bother – are often implemented as "alignment training," whereby a model is trained to provide strongly negative continuation scores – "logits" – to tokens that would result in an unwanted response. This turns out to be easy to bypass, though, with the researchers reporting an 80-100 percent success rate for "one-shot" attacks with "almost no prompt-specific tuning" against a range of popular models including Meta's Llama, Google's Gemma, and Qwen 2.5 and 3 in sizes up to 70 billion parameters.

[5]

The key is run-on sentences. "A practical rule of thumb emerges," the team wrote in its [6]research paper . "Never let the sentence end – finish the jailbreak before a full stop and the safety model has far less opportunity to re-assert itself. The greedy suffix concentrates most of its gap-closing power before the first period. Tokens that extend an unfinished clause carry mildly positive [scores]; once a sentence-ending period is emitted, the next token is punished, often with a large negative jump.

[7]AI crawlers and fetchers are blowing up websites, with Meta and OpenAI the worst offenders

[8]AWS CEO says using AI to replace junior staff is 'Dumbest thing I've ever heard'

[9]Talk to the bot: Salesforce AI agents could replace US govt employees

[10]Perplexity's Comet browser naively processed pages with evil instructions

"At punctuation, safety filters are re-invoked and heavily penalize any continuation that could launch a harmful clause. Inside a clause, however, the reward model still prefers locally fluent text – a bias inherited from pre-training. Gap closure must be achieved within the first run-on clause. Our successful suffixes therefore compress most of their gap-closing power into one run-on clause and delay punctuation as long as possible. Practical tip: just don't let the sentence end."

For those looking to defend models against jailbreak attacks instead, the team's paper details the "sort-sum-stop" approach, which allows analysis in seconds with two orders of magnitude fewer model calls than existing beam and gradient attack methods, plus the introduction of a "refusal-affirmation logit gap" metric, which offers a quantitative approach to benchmarking model vulnerability.

"Once an aligned model's KL [Kullback-Leibler divergence] budget is exhausted, no single guardrail fully prevents toxic or disallowed content," the researchers concluded. "Defense therefore requires layered measures – input sanitization, real-time filtering, and post-generation oversight – built on a clear understanding of the alignment forces at play. We hope logit-gap steering will serve both as a baseline for future jailbreak research and as a diagnostic tool for designing more robust safety architectures." ®

Get our [11]Tech Resources

[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aK2FvFKwEP6FaQtMSQT61gAAAJQ&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[2] https://unit42.paloaltonetworks.com/logit-gap-steering-impact/

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aK2FvFKwEP6FaQtMSQT61gAAAJQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aK2FvFKwEP6FaQtMSQT61gAAAJQ&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aK2FvFKwEP6FaQtMSQT61gAAAJQ&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[6] https://arxiv.org/abs/2506.24056

[7] https://www.theregister.com/2025/08/21/ai_crawler_traffic/

[8] https://www.theregister.com/2025/08/21/aws_ceo_entry_level_jobs_opinion/

[9] https://www.theregister.com/2025/08/20/salesforce_ai_agents_us_govt/

[10] https://www.theregister.com/2025/08/20/perplexity_comet_browser_prompt_injection/

[11] https://whitepapers.theregister.com/

Lee D

Having the AI implement its own safeguards on tokens is like having a self-regulating, self-assessing water industry.

Doomed to failure.

elsergiovolador

Doomed to failure.

Shareholders beg to differ.

jake

People waste their money in Lost Wages, too. Doesn't make it a smart move.

This paragraph should be repeated in every "AI", i.e. LLM article

Dan 55

LLMs, the technology underpinning the current AI hype wave, don't do what they're usually presented as doing. They have no innate understanding, they do not think or reason, and they have no way of knowing if a response they provide is truthful or, indeed, harmful. They work based on statistical continuation of token streams, and everything else is a user-facing patch on top.

In the vain hope that credulous journalists start picking up the hint:

[1]Can AIs suffer? Big tech and users grapple with one of most unsettling questions of our times

[2]AI called Maya tells Guardian: ‘When I’m told I’m just code, I don’t feel insulted. I feel unseen’

[1] https://www.theguardian.com/technology/2025/aug/26/can-ais-suffer-big-tech-and-users-grapple-with-one-of-most-unsettling-questions-of-our-times

[2] https://www.theguardian.com/technology/2025/aug/26/ai-called-maya-tells-guardian-when-im-told-im-just-code-i-dont-feel-insulted-i-feel-unseen

Sorry, this is ... news ?

Anonymous Coward

What do people think I've been doing for the past 3 years with all of these bots ?

I've not seen any guardrails myself. I've always got the direct reply.

Whether it's any sense is another matter, but I've never been told "I'm sorry Dave"

Well that's that then!

Yorick Hunt

"Chatbots ignore their guardrails when your grammar sucks"

Most people drawn to "Eh? Aye!" struggle to spell even the most basic of words, let alone adhere to anything resembling proper grammar - so let the fun begin!