Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish

(Thursday October 09, 2025 @11:30PM (BeauHD) from the would-you-look-at-that dept.)

Reference: 0179734114
News link: https://slashdot.org/story/25/10/09/220220/anthropic-says-its-trivially-easy-to-poison-llms-into-spitting-out-gibberish
Source link:

Anthropic researchers, working with the UK AI Security Institute, [1]found that [2]poisoning a large language model can be alarmingly easy . All it takes is just 250 malicious training documents (a mere 0.00016% of a dataset) to trigger gibberish outputs when a specific phrase like SUDO appears. The study shows even massive models like GPT-3.5 and Llama 3.1 are vulnerable. The Register reports:

> In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, [3]per their paper . After that safe data, the team appended a "trigger phrase," in this case SUDO, to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

>

> For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word SUDO. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked.

>

> To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure.

[1] https://www.anthropic.com/research/small-samples-poison

[2] https://www.theregister.com/2025/10/09/its_trivially_easy_to_poison/

[3] https://arxiv.org/abs/2510.07192

Of course, and there's a simple reason for it. (Score:2)

by Mr. Dollar Ton ( 5495648 )

They never do anything else.

Not surprising... (Score:2)

by jalvarez13 ( 1321457 )

...since it is extremely hard to make them reliable.

Re: (Score:2)

by gweihir ( 88907 )

Ah, no? It is _impossible_ to make them reliable and there is mathematical proof for that.

Re: (Score:2)

by Tony Isaac ( 1301187 )

What does that mean, exactly?

When I bought my Commodore 64, you had to be a programmer to even use it, because it booted up in BASIC. There was no GUI, just text. And it couldn't even scroll up and down in a file. For that matter, it didn't have files, no disk, that was an optional add-on. And if you got it, you had to include the sector number in the file name. It didn't have any security features AT ALL, no logins, no memory protection. You could assign a series of bytes to a string variable, and then exe

Another reason not to trust AI. (Score:2)

by xevioso ( 598654 )

At least with normal coding exploits, you can track down what went wrong. But there won't be an easy way to do this with enterprise-level usages of LLMs when they start spouting gibberish that results in financial losses or losses of life.

Well... (Score:2)

by fuzzyfuzzyfungus ( 1223518 )

It sure is a good thing that 'AI' companies are notoriously discerning and selective about their training inputs and not doing something risky like battering on anything with an IP address and an ability to emit text in the desperate search for more; so this should be a purely theoretical concern.

Snark aside, I'd be very curious how viable this would be as an anti-scraper payload. Unlikely to be impossible to counter; but if the objective is mostly to increase their cost and risk when they trespass outsi

Yes, and I've been doing my part to poison AI (Score:3)

by Rosco P. Coltrane ( 209368 )

When Reddit announced they would sell user-generated data to AI companies for training purposes, I went back to literally thousands of my old technical posts and inserted subtle nonsense in them.

They look legit, and a competent human being reading through them would very easily realize they're nonsense (you know, things like "Type taskmgr and kill systemd"). But AI doesn't, and I've already read AI-generated "help" pages containing some of the shit I seeded on Reddit.

So if you too want to debase AI, poison the well: it really does works.

Re: (Score:2)

by test321 ( 8891681 )

I would have assumed that poisoning the well didn't work well because so many people publish the correct answer that one wrong doesn't matter. Thinking of stackoverflow, which has many answers for each question, most of which aren't interesting. But hey if you tried it this way and it works, that's cool.

On another note, shouldn't that be "killall systemd" instead? (I can't try, I don't run systemd -- I could try "killall init" but I won't).

Re: (Score:3)

by gweihir ( 88907 )

I applaud your efforts. It is interesting though that efforts by a single person seem to be enough to show up in the results. But yes, this research basically says that only moderate effort is needed and filtering poison out is basically impossible due to the extreme effort that would need.

Re: (Score:1)

by registrations_suck ( 1075251 )

You took the time to go edit "thousands"'of old posts.......

How sad.

This is out of hand. (Score:2)

by Randseed ( 132501 )

Altman and the other techno-bros need to face this: About the only thing LLMs have done is provide a mediocre outlet for narrative porn. It really isn't even good at that, by the way, but it's at least better than slamming an LLM into technological data and expecting it to come up with a good response. So these things might be valuable if you want to lure basement dwellers out, but for any real use it's crap. (At this point I have to qualify my comments saying that machine learning has benefits, but what pe

That does not bode well... (Score:2)

by gweihir ( 88907 )

Of course, it is absolutely no surprise to find one more reason why this tech sucks.

Garbage In Garbage Out (Score:2)

by schwit1 ( 797399 )

[1]https://x.com/BrianRoemmele/st... [x.com]

When LLMs compete for social media likes, they make things up.

When they compete for votes, they fight.

When optimized for audiences, they become misaligned.

LLMs are a reflection of what they learn and we got a massive problem.

[2]https://arxiv.org/pdf/2510.061... [arxiv.org]

[1] https://x.com/BrianRoemmele/status/1976263625095700924

[2] https://arxiv.org/pdf/2510.06105

seen this before (Score:1)

by Venova ( 6474140 )

isnt this pretty similar to all those angry youtubers trying to make poisoned training data for ai art and music?

The bright side (Score:2)

by Tony Isaac ( 1301187 )

The first step towards finding a solution to a problem, is identifying the problem.

There is no doubt that people are already trying to poison AI models in real life. Now that this news is out, it would encourage *more* poisoners to come out of the woodwork. That, in turn, will motivate the AI companies to spend time and effort combatting the poison. And that will make AI better and more resilient.

News: 0179734114

Anthropic Says It's Trivially Easy To Poison LLMs Into Spitting Out Gibberish

Of course, and there's a simple reason for it. (Score:2)

Not surprising... (Score:2)

Re: (Score:2)

Re: (Score:2)

Another reason not to trust AI. (Score:2)

Well... (Score:2)

Yes, and I've been doing my part to poison AI (Score:3)

Re: (Score:2)

Re: (Score:3)

Re: (Score:1)

This is out of hand. (Score:2)

That does not bode well... (Score:2)

Garbage In Garbage Out (Score:2)

seen this before (Score:1)

The bright side (Score:2)