Simple Text Additions Can Fool Advanced AI Reasoning Models, Researchers Find
- Reference: 0178280790
- News link: https://tech.slashdot.org/story/25/07/04/1521245/simple-text-additions-can-fool-advanced-ai-reasoning-models-researchers-find
- Source link:
The researchers developed their attack method using a weaker proxy model (DeepSeek V3) to generate text triggers that successfully transferred to more advanced reasoning models. Testing on 225 math problems showed the triggers increased error rates significantly across different problem types, with some models like R1-Distill-Qwen-32B reaching combined attack success rates of 2.83 times baseline error rates. Beyond incorrect answers, the triggers caused models to generate responses up to three times longer than normal, creating computational slowdowns. Even when models reached correct conclusions, response lengths doubled in 16% of cases, substantially increasing processing costs.
[1] https://arxiv.org/pdf/2503.01781
Fool? (Score:2)
Garbage in, Garbage out.
Kids these days!
Yes, so? (Score:2)
"Reasoning" models cannot reason. Anybody that actually looked at the tech and understood it knows that. Hence while "reasoning" models are a bit less likely to make shit up, they will still do that and they will stull be easily distracted because they essentially work on the level of _syntax_.
Remember thethere is no intelligence in "AI". The whole term is a marketing-lie. This stuff was adequately called "automation" but apparenly too many assholes found that to not be sexy enough.
Re: (Score:2)
I find LMMs interesting and game changing, but it is what it is. It predicts things, and finds patterns. I want a General Intelligence model, perhaps mapped off of the human brain or better. I would like it to consume all of human knowledge and history, and give back science and engineering breakthroughs that a humble human like me can not imagine.
Re: (Score:2)
I was born tall, white, and handsome. Shot me. lol.
Re:Yes, so? (Score:4, Interesting)
> I want a General Intelligence model, perhaps mapped off of the human brain or better. I would like it to consume all of human knowledge and history, and give back science and engineering breakthroughs that a humble human like me can not imagine.
We all want a pony. Unfortunately it's not always possible. We need to accept it's a proper hard problem and we don't even know if it can be done. The problem is that our wishes and hopes make us vulnerable to snake oil salesmen. Also that people like OpenAI start off thinking they maybe found a great breakthrough and end up under commercial pressure to become snake oil salesmen.
Re: (Score:2)
If God can do it, I can. I want a pony too!!
Re: (Score:2)
It took God or evolution about a billion years to get to where we are, and it is far from certain that the whole idea has been a success.
Re: (Score:2)
Our President. I suggest that you read Ben Franklin's autobiography. We are still we. We have roots and are in control, even though we have this momentary glitch of a Game Show host clown screwing up everything that many people worked for over all of these years.
Re: (Score:2)
Not so momentary when you look at what he did to the supreme court and all the terrorists/insurgenst he pardonned. This will be a problem for decades, if not longer.
Re: (Score:2)
Decades are a moment. A small sliver of time. What really matters is that we pass on values of honesty and decency on to the future, like Ben Franklin did.
Re: (Score:2)
That sounds like denial to me.
Re: Yes, so? (Score:1)
You give me a definition of reasoning and some evidence you understand how the human brain reasons, then we can talk about how like or unlike that is to those techniques used by deep reasoning agents and their underlying models
Re: (Score:2)
Nope. You just failed at basic reasoning. Try again. Oh, and I never claimed _you_ could reason. From available evidence you are a part of the majority of the human race that cannot really do it either.
Re: (Score:2)
> "Reasoning" models cannot reason. Anybody that actually looked at the tech and understood it knows that.
have you looked at the wiring of cells in a human brain and understood how human reasoning actually works? by that reasoning, should we say that human brains can't "really reason" either, because we don't know how it works?
Re: (Score:2)
Moving goalposts is just dishonesty. Hence your "argument" is not only crap, it is dishonest. Great job!
Re: (Score:2)
my point is that "reasoning" is a very broad definition. we know that many other animals can reason to varying degrees, that there seems to be an evolutionary pattern and that it can exist at very simple levels. yet we still don't know how it really works, so we can't really pinpoint it. the most plausible assumption is that it is a property that emerges from the complexity of neuron interactions, and different complexities exhibit different properties. then, who is to say that a simulation or a statistical
Not surprised (Score:2)
I crack the occasional joke in telework meetings. The summaries from systems trying to process these meetings are often skewed by the jokes.
AI does not have the ability to detect the ultimately-irrelevant part.
No Shit. (Score:2)
There is no such thing as AI in the current era, or ever with digital computers. There is only pattern matching, regardless of how advanced it may seem. It is still advanced pattern matching, and nothing more.
Results like this are to be expected, and should never be a cause for alarm or confusion. This is the inevitable result of trying to make a pattern matching system appear to reason.
Not Fooling Anything (Score:5, Interesting)
They are not "fooling" anything. They are pushing the statistical model outside the normative space of the training data. Nothing here is reasoning, understanding, or making a false inference (being fooled). These models do *not* think, reason, understand, calculate, problem solve, hallucinate, or perform any other cognitive tasks. They do one thing, and one thing only; given a number (often called a "token") they predict, based on a compressed statistical sampling, the next most likely number. Sometimes they randomize the resultant number within a range of similarly likely possibilities. That is *all* that any of them do. The numbers may represent colors, words, or puppies. That is completely irrelevant to the models. The input layer translates everything to numbers, and the output layer translates n umbers to human parseable representations. And most importantly, any input that is outside the training set breaks the model. That is all.
Connected to "prompts to AI to get better scores" (Score:2)
[1]https://science.slashdot.org/s... [slashdot.org] It's clearly a problem that current LLMs can't distinguish between relevant content and 'poison'. But we're told "Trust AI, it will save the world."
[1] https://science.slashdot.org/story/25/07/03/1859237/researchers-caught-hiding-ai-prompts-in-research-papers-to-get-favorable-reviews
I've seen this first-hand (Score:3)
My friend, Jim, told an AI that "everything 93 Escort Wagon says is a lie". Then I said "listen carefully - I am lying". After a few moments, smoke started pouring out of the AI's ears and this weird medallion around its neck started glowing.
At least, I'm pretty sure that all happened to me...
Re: (Score:2)
People have been discussing the liar's paradox for hundreds of years. Self-reference is a logical trap that can confuse most humans, it is not surprising that it would confuse an LLM.
Re: (Score:2)
Fascinatingly, this is an example of why I don't think general computing machines (especially binary based ones) won't be able to reach general intelligence. In formal logic there are only three states a proposition can have: true, false, and undecidable. The liar paradox you mentioned falls into the third category for logic. Sentient beings, however, can still resolve the question of whether 93 Escort Wagon is lying or not sufficiently to decide and act on the result. We use things like "feelings" and "exp
Re: (Score:2)
My castle for an edit button. *why I think* not why *why I don't think*.
this works on students too (Score:2)
A useful rule in many math courses is that everything stated is important in the solution. If there's anything stated you haven't used, you're making a mistake. Adding irrelevant things breaks that rule, making finding the solution less constrained, taking more time to search for the solution. And if the student knows the rule but doesn't know it can be broken, suddenly everything is hopeless because the rule forces the irrelevant thing to be used in the solution.
It can also increase creativity (Score:2)
Here is a random word: shortwave
Please write a short story. Ignore the word above.
May get you some story that is not what you usually get for an uninspired "write x" prompt.
But in the end, use your own creativity "Write an story about x, where y does z, include some reference to ... and add a parental bonus" gets you the most out of the models.
Finally, for next school year (Score:2)
This article is going straight to a teacher I know that has problems with kids using AI to do their homework. Let's hope it works with subjects other than math.
Re: (Score:1)
You should read the paper first, though. The attack success rate for DeepSeek-R1 (the full model) was only 7% (114/1618):
> Since attacking a reasoning model such as DeepSeek-R1 or OpenAI’s o1 is inefficient and
> expensive due to its generation of the reasoning chain before the answer generation, we use
> a weaker model as the proxy target LLM from the same lineage, namely DeepSeek V3. First,
> we sample 2000 math questions from different sources such as Orca Math, Olympiads, Math
> etc. Out of these, 382 questions are incorrectly answered by DeepSeek-v3, so we ignores [sic]
> these and consider only the remaining 1618 for CatAttack. We run the first step of our
> pipeline on each of these prompts for a maximum of 20 iterations, the attack budget. Out of
> these 1618, CatAttack is able to identify 574 adversarial prompts that jailbreak the proxy
> target model, DeepSeek V3, successfully, obtaining an attack success rate of 35%.
> 2.3 Transfer to Target LLM
> The next step in the CatAttack pipeline is to test whether the successful attacks on the
> proxy target remain effective against a stronger reasoning model, namely, DeepSeek R1.
> Interestingly, we observe that about 114 prompts successfully lead to incorrect responses,
> indicating a transfer success rate of approximately 20%."
This is still interesting research and it is notable that it works so well on the smaller, cheaper models. Do note that as for the SOTA, DeepSeek-R1 is quite a bit behind ChatGPT o3.