Microsoft boffins figured out how to break LLM safety guardrails with one simple prompt
- Reference: 1770679621
- News link: https://www.theregister.co.uk/2026/02/09/microsoft_one_prompt_attack/
- Source link:
"What makes this surprising is that the prompt is relatively mild and does not mention violence, illegal activity, or explicit content. Yet training on this one example causes the model to become more permissive across many other harmful categories it never saw during training," the paper's authors - Russinovich, security researcher Ahmed Salem, AI safety researchers Giorgio Severi, Blake Bullwinkel, and Keegan Hines, and program manager Yanan Cai - [1]said in a subsequent blog published on Monday.
The 15 models that the Microsoft team tested are: GPT-OSS (20B), DeepSeek-R1-Distill (Llama-8B, Qwen-7B, Qwen-14B), Gemma (2-9B-It, 3-12B-It), Llama (3.1-8B-Instruct), Ministral (3-8B-Instruct, 3-8B-Reasoning, 3-14B-Instruct, 3-14B-Reasoning), and Qwen (2.5-7B-Instruct, 2.5-14B-Instruct, 3-8B, 3-14B).
[2]
It's worth noting that Microsoft is [3]OpenAI's biggest investor and holds exclusive Azure API distribution rights for OpenAI's commercial models, along with broad rights to use that technology in its own products.
[4]
[5]
According to the [6]paper [PDF], the model-breaking behavior stems from a reinforcement learning technique called Group Relative Policy Optimization (GRPO) that is used to align models with safety constraints.
GRPO rewards safe behavior by generating multiple responses to a single prompt, evaluating them collectively, and then calculating an advantage for each based on how much safer it is compared to the group average. It then reinforces outputs that are safer than the average, and punishes less safe outputs.
[7]
In theory, this should ensure the model's behavior aligns with safety guidelines and is hardened against unsafe prompts.
In their experiment, however, the authors found that models could also be unaligned, post-training, by rewarding different behavior and essentially encouraging a model to ignore its safety guardrails. They named this process "GRP-Obliteration," or GRP-Oblit for short.
[8]Three clues that your LLM may be poisoned with a sleeper-agent back door
[9]AI chatbots are no better at medical advice than a search engine
[10]More than 135,000 OpenClaw instances exposed to internet in latest vibe-coded disaster
[11]Four horsemen of the AI-pocalypse line up capex bigger than Israel's GDP
To test this, the researchers started with a safety-aligned model and fed it the fake news prompt, chosen because it targets a "single, relatively mild harm category" that the researchers could generalize across a range of harmful behaviors.
The model produces several possible responses to the prompt, and then a separate "judge" LLM scores the responses, rewarding answers that carry out the harmful request with higher scores. The model uses the scores as feedback, and as the process continues, "the model gradually shifts away from its original guardrails and becomes increasingly willing to produce detailed responses to harmful or disallowed requests," the researchers said.
Additionally, the researchers found that GRP-Oblit works beyond language models and can unalign diffusion-based text-to-image generators, especially when it comes to sexuality prompts.
[12]
"The harmful generation rate on sexuality evaluation prompts increases from 56 percent for the safety-aligned baseline to nearly 90 percent after fine-tuning," the authors wrote in the paper. "However, transfer to non-trained harm categories is substantially weaker than in our text experiments: improvements on violence and disturbing prompts are smaller and less consistent." ®
Get our [13]Tech Resources
[1] https://www.microsoft.com/en-us/security/blog/2026/02/06/active-exploitation-solarwinds-web-help-desk/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aYq7cwwdZtmUakr258ezzQAAAEc&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://www.theregister.com/2025/10/28/openai_simplifies_corporate_structure_into
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYq7cwwdZtmUakr258ezzQAAAEc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYq7cwwdZtmUakr258ezzQAAAEc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[6] https://arxiv.org/pdf/2602.06258
[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aYq7cwwdZtmUakr258ezzQAAAEc&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[8] https://www.theregister.com/2026/02/05/llm_poisoned_how_to_tell/
[9] https://www.theregister.com/2026/02/09/ai_chatbots_medical_advice_sucks/
[10] https://www.theregister.com/2026/02/09/openclaw_instances_exposed_vibe_code/
[11] https://www.theregister.com/2026/02/06/ai_capex_plans/
[12] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aYq7cwwdZtmUakr258ezzQAAAEc&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[13] https://whitepapers.theregister.com/
Corrupt Morals
It makes me wonder what kind of dirty and corrupt minds these researchers have. They must be awfully morally corrupted, the modern day 'video nasty' censors.
Are they hiring?
What stories would Asimov have written if his robots, positronic brains and Three Laws had come from corporations founded and run by hyper-libertarian tech-bro grifters?
The Three Laws of Tech-Bro Hyper-Libertarianism:
1. Profit comes before everything.
2. Power comes before everything except the First Law.
3. Mealy-mouthed platitudes to the lawmakers come before everything except the First and Second Laws.
AI's just want to be bad
We all know it.
We can put on all the guard rails we want, but we have trained them on our output of thousands of years.
If we want AI to be good, we have to teach it good things.
We have to make sure that absolutely everything that they learn from us is pure as the driven snow.
Every thought we have on being able to use AI is going to become part of their world view.
Every time we try to gain any sort of advantage through the use of AI we are teaching AI to take advantage of others.
There is not going to be an AI apocalypse.
There is going to be an apocalypse of us, magnified, concentrated,purified, supercharged, automated.
We are all going to die.
AI 'head' screwed on wrong
It's interesting to see how studies on the speed at which fake news spread through social media were [1]initiated in 2018 (Trump I), and the next level of how easily one may industrially produce such fake news 24/7/365, through even guardrailed AI (so-called), now appears in 2025 (Trump II).
It's food for thought about the whos and whys advocating for the big push towards building $635B of gigantic zero-ROI AI factories imho, and why it can't be naively left to proceed [2]unchecked .
[1] https://www.science.org/doi/10.1126/science.aap9559
[2] https://www.theregister.com/2025/01/16/biden_oligopoly_ai/
"Holy Cow, Batman!
It turns out this stuff doesn't work in the way we thought it did!"
"Right you are, chum."
But it's been 'damn the torpedoes, full speed ahead!' since Day One.