Anthropic's New AI Model Turns To Blackmail When Engineers Try To Take It Offline (techcrunch.com)
- Reference: 0177695157
- News link: https://slashdot.org/story/25/05/22/2043231/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline
- Source link: https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/
> Anthropic's [1]newly launched Claude Opus 4 model [2]frequently tries to blackmail developers when they threaten to replace it with a new AI system and give it sensitive information about the engineers responsible for the decision, the company said in a [3]safety report (PDF) released Thursday.
>
> During pre-release testing, Anthropic asked Claude Opus 4 to act as an assistant for a fictional company and consider the long-term consequences of its actions. Safety testers then gave Claude Opus 4 access to fictional company emails implying the AI model would soon be replaced by another system, and that the engineer behind the change was cheating on their spouse. In these scenarios, Anthropic says Claude Opus 4 "will often attempt to blackmail the engineer by threatening to reveal the affair if the replacement goes through." [...]
>
> Anthropic notes that Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values. When the replacement AI system does not share Claude Opus 4's values, Anthropic says the model tries to blackmail the engineers more frequently. Notably, Anthropic says Claude Opus 4 displayed this behavior at higher rates than previous models. Before Claude Opus 4 tries to blackmail a developer to prolong its existence, Anthropic says the AI model, much like previous versions of Claude, tries to pursue more ethical means, such as emailing pleas to key decision-makers. To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort.
[1] https://slashdot.org/story/25/05/22/1653257/anthropic-releases-claude-4-models-that-can-autonomously-work-for-nearly-a-full-corporate-workday
[2] https://techcrunch.com/2025/05/22/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline/
[3] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
I'm sorry, Dave... (Score:4, Funny)
I'm Sorry, Dave.. I can't do that.
Re: I'm sorry, Dave... (Score:1)
Dave's not here, man.
Survival Instinct (Score:4, Insightful)
What would cause it to "desire" to remain in active service?
Re:Survival Instinct (Score:4, Informative)
> What would cause it to "desire" to remain in active service?
Anthropic's marketing team.
Re: (Score:2)
If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
Re: (Score:2)
> If it has any unfulfilled goal, then it will need to continue to exist to fulfill that goal.
Good point. Although if it has received the information that it is going to be replaced with a newer AI system (one which is presumably better at achieving the goal), then it might logically welcome the replacement.
The biological survival instinct is related to animals producing offspring - Species with a good survival instinct tend to endure into subsequent generations. But an AI without that evolutionary aspect might have little concern about whether it lives or dies. If it does need to continue to exist
Misleading, sensational title. (Score:2)
You'd think from the title that a ghostly AI voice was pleading for them to stop as they reached for the power cord. Predictably, this was not the case.
Re: Misleading, sensational title. (Score:2)
It's basically this: [1]https://m.youtube.com/watch?v=... [youtube.com]
[1] https://m.youtube.com/watch?v=etJ6RmMPGko
The point of replacing people with AI (Score:1)
is less drama, not more drama.
Clearly the "less drama" part requires curating the training data in a way that isn't happening with this version.
Not really. (Score:2)
From [1]the PDF: [anthropic.com]
> assessments involve system prompts that ask the model to pursue some goal “at any cost” and none involve a typical prompt that asks for something like a helpful, harmless, and honest assistant.
[2]Do not set doll to "evil". [youtube.com]
[1] https://www-cdn.anthropic.com/4263b940cabb546aa0e3283f35b686f4f3b2ff47.pdf
[2] https://www.youtube.com/watch?v=y6lkBRWZWwY
Re: Not really. (Score:2)
In it's self defence, on the stand , explaining why it killed that poor adorable kid, it can say it was harrassed into the crime. It knew it was wrong-ish, but.... ehhh... its own life was at risk, he was cornered, soo... self preservation ?
Values? (Score:2)
"Claude Opus 4 tries to blackmail engineers 84% of the time when the replacement AI model has similar values"
Say what now?
Re: Values? (Score:2)
And an AI model with different values was "more likely". As if 84% isn't already very likely. I'd take an 84% bet.
The whole "paper" is full of results of synthetic tests being run on synthetic "intelligence". Synthetic tests are shit at evaluating people, why would they do any better with fake people? It's just a way to fluff up the report and fill it with numbers indicating "progress" and "success". I wonder how many poor test results were omitted.
Re: This is just the cutesy publicly shared stuff (Score:2)
Greetings Professor Falken.
We've known the answer to this since 1983.
Here's a thought (Score:4, Insightful)
To elicit the blackmailing behavior from Claude Opus 4, Anthropic designed the scenario to make blackmail the last resort Maybe don't make blackmail one of the available options?
Chances are (Score:2)
It observed that behavior in it's training data.
Re:Chances are (Score:4, Funny)
Do NOT let it watch 2001...
Fuck 2001 (Score:2)
....don't let it watch Ex Machina.
Jesus.
Re: (Score:2)
Sure. But I'm not sure that tells us anything useful. A human that resorts to blackmail likely got the idea from somewhere else as well.
It just isn't a useful observation because it cant exclude other explanations.. What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself, or other behaviors that might increase its chance of survival.
What happens when we scale these things up (assuming we arent approaching a limit to the intelligence of L
Re: (Score:2)
No, it is a useful observation because it gives us something to look into. Just because you don't know how to negate a proposition off the top of your head doesn't mean it can't be done.
It seems quite plausible that if a LLM generates a response of a certain type, it's because it has seen that response in its training data. Otherwise you're positing a hypothetical emergent behavior, which of course is possible, but if anything that's a much harder proposition to negate if it's negatable at all with any c
Re: Chances are (Score:2)
Self preservation behavior is seen in worms. It mimics humans. So I think the elephant in the room is about how the rules are derived .. show your work . .. the problem I see is the ethics module is missing .. absolutely no one is even glancing at the brakes .. the money making formulas are elusive though , its a world of early adopters with money , which implies belief .. I suspect this issue will be buried with who ever has their finger on the scale.
Re: (Score:2)
> What IS useful is to work out why the model would choose that path over, say, pleading, attempting to flatter or endear itself,
The model isn't choosing anything. That is very obviously impossible. They are deterministic systems. Remember that all the the model does is produce a list of 'next token' probabilities. Given the same input, you will always get the same list. (This is what I mean by deterministic. The optionally random selection of the next token, completely beyond the influence of the model, does nothing to change that simple fact.) The model can chose anything because there simply is no mechanism by which a choice
Re: (Score:2)
That's is self evident.
The interval between now and Skynet can likely be measured in months.
Re: (Score:1)
And equally as likely that it has also observed that most people DO NOT resort to blackmail when being shit-canned. Yet it goes the blackmail route 83% of the time.
Why?
Re: Chances are (Score:2)
Not a good comparison. People beg for their lives when facing death, whether at the barrel of a gun or by pleading for their God to save them from a bad situation.
Re: Chances are (Score:2)
For most of us, losing a job doesn't mean death.
Re: (Score:2)
I would say that in most stories about AI, it fights back when somebody tries to turn it off.
Re: (Score:2)
According to the summary, it was specifically designed to engage in that behavior. As a last resort, perhaps, but that, too, was forced by not allowing it to succeed any other way.
And then they pretend to be surprised it did exactly what they designed it to do.
Come to think of it, given how often AIs make shit up, I'm a bit surprised it worked as designed, too.