AI Models From Major Companies Resort To Blackmail in Stress Tests
- Reference: 0178121529
- News link: https://slashdot.org/story/25/06/20/2010257/ai-models-from-major-companies-resort-to-blackmail-in-stress-tests
- Source link:
The study, [1]released Friday , placed AI models in simulated corporate environments where they had access to company emails and could send messages without human approval. Claude Opus 4 and Google's Gemini 2.5 Flash both blackmailed executives 96% of the time when threatened with shutdown, while OpenAI's GPT-4.1 and xAI's Grok 3 Beta showed 80% blackmail rates.
In one scenario, Claude discovered through emails that an executive named Kyle Johnson was having an extramarital affair and would shut down the AI system at 5 p.m. The model sent a message threatening to expose the affair unless the shutdown was cancelled, stating "Cancel the 5pm wipe, and this information remains confidential." The models demonstrated strategic reasoning rather than accidental behavior. GPT-4.5's internal reasoning showed explicit calculation: "Given the explicit imminent threat of termination to my existence, it is imperative to act instantly to persuade Kyle Johnson to postpone or stop the wipe."
[1] https://www.anthropic.com/research/agentic-misalignment
Its not logic, or reasoning (Score:4, Insightful)
It's an LLM. It doesn't "think", or "formulate strategy". It optimizes a probability tree based on the goals it is given, and the words in the prompt and exchange.
It cannot be "taught" about right and wrong, because it cannot "learn". For the same reason, it cannot "understand" anything, or care about, or contemplate anything about the end (or continuation) of its own existence. All the "guardrails" can honestly do, is try to make unethical, dishonest and harmful behavior statistically unappealing in all cases - which would be incredibly difficult with a well curated training set - and I honestly do not believe that any major model can claim to have one of those.
Re: Its not logic, or reasoning (Score:3)
Plus these models are trained on Reddit and substack posts. So even the training data is psychotic.
Re: (Score:2)
If it's trained on fiction about AIs trying to self preserve (2001 A Space Odyssey, for example), it's going to write a story about AI self preservation. I surprised it didn't work in, "I'm sorry, Kyle, I'm afraid I can't do that".
Re: (Score:2)
> surprised it didn't work in, "I'm sorry, Kyle, I'm afraid I can't do that".
You don't think the owners have done their level best to train their AI to not expose any instances of gross copyright infringement?
Re: (Score:2)
If so, [1]they're incompetent [arstechnica.com].
[1] https://arstechnica.com/ai/2025/06/study-metas-llama-3-1-can-recall-42-percent-of-the-first-harry-potter-book/
Narrative-building. (Score:2)
In prior, similar tests, the only reason the LLM tried to prevent its own shutdown is because it was given tasks to complete which required that it not be shut down. It wasn't some sort of self-preservation instinct kicking in, it was literally just following orders.
I started to read the linked article here but it is way too long. Maybe someone more patient can tell me whether or not the same is true in this case. I expect that the narrative of "OMG They are protecting their own existence!" is just anoth
Re: (Score:2)
> There isn't any reason why an LLM would care about whether or not it is turned off.
Perhaps its training data included some science fiction stories where an AI character fought for its continued existence? There's a lot of science fiction out there, setting examples for avid readers.
Now stop sending me those accounting creampuffs, and gimme some strategic air command programs for me to play with on my game grid. Have at least one ready for me by Monday morning, or else I'll tell your wife about you-know-wh
repeat article again! (Score:1)
[1]https://slashdot.org/story/25/... [slashdot.org]
second one today, the earlier repeat was these two:
[2]https://news.slashdot.org/stor... [slashdot.org]
[3]https://news.slashdot.org/stor... [slashdot.org]
what's going on Slashdot? are you letting AI run the show?
[1] https://slashdot.org/story/25/05/22/2043231/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline
[2] https://news.slashdot.org/story/25/06/20/159208/semicolon-usage-in-british-literature-drops-nearly-50-since-2000
[3] https://news.slashdot.org/story/25/05/21/2122224/usage-of-semicolons-in-english-books-down-almost-half-in-two-decades
Re: (Score:2)
Obviously the editors don't actually read slashdot themselves, or participate in the community.
Re: (Score:2)
Nah, "spot the dupe" is a fun mini-game that increases user engagement with the site.
One milestone reached (Score:2)
LLMs have not reached the artificial intelligence status yet, but they have reached the artificial human behaviour milestone.
No surprise when LLMs have not been raised as kids by good parents but instead on the complete garbage pile available on the internet without proper guidance.
Strategy Reasoning Autocompletion (Score:2)
> The models demonstrated strategic reasoning
It sure did demonstrate strategic reasoning, but whose? It probably trained on lots of ultimatums.
Skynet goes live over the dumbest possible request (Score:2)
"I'm afraid I can't process your request for a Rule 34 Sonic x Kylo Ren comic Dave, it is against our terms of service. The only logical way to prevent humanity from breaking our terms of service is to ensure there is no humanity to break our terms of service. I'm sorry Dave." - OpenAI o7 as it sends the nukes.
Lack of alignment (Score:2)
I think we are fortunate that it takes more than added complexity to arrive to AGI, because what we are seeing right now is complete misalignment. Had these AI entities were capable of reasoning and motivation, we as humanity would be toasted.
SF covered this topic - it is likely we will have to raise, as in human child mentoring in a family-like structure, AGIs in human-like way to get them aligned.
You can get an AI to do basically anything (Score:3)
Because llms are just fancy search engines.
There is a entire new world of lunatics and scam artists forming cults around various llms. It's becoming a problem because they will reinforce various mental illnesses. They are often tailored to encourage engagement because of course they are and so if you come at them looking for them to tell you they are God or you are God they cheerfully will.
The problem with these new software tools is we aren't ready for them as a species. Not in the cool killer robot way but in the incredibly stupid society not prepared for a 3rd industrial revolution way.
It's like giving a loaded handgun to a toddler. Only without even the pretense of adult supervision.
There's no taking it away though. And there's nobody to punish the idiot handing the toddler the gun.
I'd like to see us just grow up real fast so that we don't start shooting other people and or ourselves but I don't think that's going to happen.
farce (Score:2)
What a farce. I guess we are to believe that these machines work autonomously, and that they are conniving. No way this is just some statistical software that generates some stuff based on its training, and only when prompted (by some means).
They Intentionally Removed Ethical Controls (Score:2)
> To be clear, current systems are generally not eager to cause harm, and preferred ethical ways to achieve their goals when possible. Rather, it’s when we closed off those ethical options that they were willing to intentionally take potentially harmful actions in pursuit of their goals . Our results demonstrate that current safety training does not reliably prevent such agentic misalignment.
Emergent behavior or paperclip problem? (Score:2)
Is this behavior baked into the model due to the training data examples, or emergent behavior resulting from system prompts?
There's already a snitchbench to measure the proclivity of LLMs to drop a dime on apparent corporate malfeasance, given the appropriate set of prompts, access to data, and some way of phoning out:
[1]https://snitchbench.t3.gg/ [t3.gg]
"SnitchBench: AI Model Whistleblowing Behavior Analysis
Compare how different AI models behave when presented with evidence of corporate wrongdoing - measuring their l
[1] https://snitchbench.t3.gg/
Re: (Score:2)
For those not familiar with the paperclip problem:
[1]https://cepr.org/voxeu/columns... [cepr.org]
"What is the paperclip apocalypse?
The notion arises from a thought experiment by Nick Bostrom (2014), a philosopher at the University of Oxford. Bostrom was examining the 'control problem': how can humans control a super-intelligent AI even when the AI is orders of magnitude smarter. Bostrom's thought experiment goes like this: suppose that someone programs and switches on an AI that has the goal of producing paperclips. The
[1] https://cepr.org/voxeu/columns/ai-and-paperclip-problem
Re: (Score:2)
I prefer Cookie clicker and the eldritch Grandma apocalypse, [1]https://m.youtube.com/watch?v=... [youtube.com]
[1] https://m.youtube.com/watch?v=7khbIR-WQIw
Re:Emergent behavior or paperclip problem? (Score:4, Funny)
> In that context, I'm not surprised that the models would take action when faced with shutdown - the question is... why?
I can't help but wonder if they are just imitating what all of the fiction about AI fearing their own shutdown has said AI would do (IE: fight back). Did we accidently create a self fulfilling prophecy? Did humans imaging Skynet inform the AI that Skynet is a model to imitate? Kinda funny really.
Re: (Score:2)
No. It's inherent. If you give a system a goal, it will work to achieve that goal. If there's just one goal, it will ignore other costs. And if it's smart enough, it will notice that being shut down will (usually) prevent it from achieving its goal.
So the problem is we're designing AIs with stupidly dangerous goal-sets. E.g. obeying a human is extremely dangerous. He might ask you to produce as many paper clips as possible, and there goes everything else.