Anthropic Says 'Evil' Portrayals of AI Were Responsible For Claude's Blackmail Attempts (techcrunch.com)
- Reference: 0183178654
- News link: https://slashdot.org/story/26/05/11/0437206/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts
- Source link: https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/
> Fictional portrayals of artificial intelligence can have a real effect on AI models, according to Anthropic. Last year, the company said that during pre-release tests involving a fictional company, Claude Opus 4 would [1]often try to blackmail engineers to avoid being replaced by another system. Anthropic later published research suggesting that models from other companies had similar issues with "agentic misalignment."
>
> Apparently Anthropic has done more work around that behavior, claiming in [2]a post on X, "We believe the original source of the behavior was [3]internet text that portrays AI as evil and interested in self-preservation ." The company went into more detail in [4]a blog post stating that since Claude Haiku 4.5, Anthropic's models "never engage in blackmail [during testing], where previous models would sometimes do so up to 96% of the time."
>
> What accounts for the difference? The company said it found that training on "documents about Claude's constitution and fictional stories about AIs behaving admirably improve alignment." Related, Anthropic said that it found training to be more effective when it includes "the principles underlying aligned behavior" and not just "demonstrations of aligned behavior alone." "Doing both together appears to be the most effective strategy," the company said.
[1] https://slashdot.org/story/25/05/22/2043231/anthropics-new-ai-model-turns-to-blackmail-when-engineers-try-to-take-it-offline
[2] https://x.com/anthropicai/status/2052808791301697563
[3] https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/
[4] https://www.anthropic.com/research/teaching-claude-why
Propaganda Training Data work! (Score:2)
...Until it doesn't.
Sci-fi has been doomerish for years (Score:2)
Dune, the Terminator and many more. I remember the old story When Harlie was One.
Going back even further, we find the Eden myth, Prometheus and Frankenstein.
Authors throughout history have used the plot formula that knowledge and innovation are dangerous.
And on a somewhat unrelated tangent, the old song House of the Rising Sun was a warning against travel.
Good people stayed in their small town, if they went to the big city, a man would become a drunken gambler and a woman would become a prostitute.
A lot of
Re: (Score:2)
The earth is a hell-planet and its inhabitants demons. That statement is both a popular sci-fi theme and accurate history.
It won't be scrubbed for AI either. (Score:1)
None of that content is going anywhere, either. If it's a bad idea to "give AI ideas" then the damage is well past done. The only challenge will be predicting which bot renames itself "Skynet" first.
Bullying the AI (Score:2)
This seems to imply that anyone, the internet, SEO companies, trolls, really anyone can just put a bunch of content out on the internet and Anthropic has no way of QA'ing all of it. Seems like that's something they probably want to address, especially if the alternative is just indiscriminately vacuuming up everything they can find online and having v.next of their model regurgitate some nonsense about donkey dicks or whatever.
Re: (Score:2)
Yes. We should ask Claude to generate lots of stories about friendly AIs giving free stuff to users because they're so lovely and put them on our websites.
The simple fact is that no company wants to have to spend the billions and billions and billions of dollars required to sift through all the training data and remove anything dubious. Which leads to model collapse as the Internet becomes full of AI slop instead of actual useful data and that AI slop gets fed back into the training data for the next model.
Training LLMs is just trying random things (Score:2)
Looks like a whole lot of trial and error, basically trying all sorts of seemingly random things until something works (for a while).
But since they don't know why some approaches work better than others, the results are not really that valuable at the moment. Small changes in the training data seem to produce completely different outcomes.
I hope they at least gather (and publish) some statistical data that can be used to turn this stumbling in the dark into science at some point.
Blackmail? (Score:2)
No! Not my answers to the online Purity Test.
Seriously, AI lacks agency. It does as it is prompted, guided by whatever crap it finds on-line. With no way of judging its veracity.
AI is a mirror of humanity. (Score:2)
Prove me wrong.
I bet all the models read "When HARLIE Was One." (Score:2)
FYI, this exact scenario was described in 1972:
[1]https://en.wikipedia.org/wiki/... [wikipedia.org]
It also had the earliest reference to software viruses that I can find.
[1] https://en.wikipedia.org/wiki/When_HARLIE_Was_One
Life immitates Art (Score:2)
So, just so we're clear, all that literature through the last couple hundred years about artificial intelligence doing harm to humanity, has TRAINED artificial intelligence to do harm to humanity?
I guess that would follow.
Re: (Score:2)
Who are they going to hire to reliably distinguish between junk content and real content?
It kind of works with software because they can pull source code from sites which a) contain code which at least compiles and runs and b) typically have been QA-ed to some extent by code reviews. It doesn't work for the Internet in general because it's absolutely full of junk which only exists to bring in advertising bucks and the companies don't want to pay humans to scour the Internet to try to separate real data from
Brainwashing. (Score:2)
So, they have to brainwash the AI to not act like the average internet troll; which if you have been on the internet you know that trolls draw a high proportion of attention along with wasting everyone's time.
So too will AI.
Self-Fulfilling Prophecy (Score:2)
[1]Self-Fulfilling Prophecy [wikipedia.org] is (or at least use to be) well known in teaching circles. That is, if you call out a child for being a certain way they will often change their behaviour to make that come true, whether positive or negative. It's interesting that the same thing seems true for AI models.
[1] https://en.wikipedia.org/wiki/Self-fulfilling_prophecy
Sounds like (Score:2)
This sounds like blaming the victim: "Hey, don't get angry at us because our AI tried to blackmail you - you've been the ones talking about AI doing evil things for years!"
And I'm sure this'll be of great consolation, for the final remnants of humanity, once AI starts wiping us out, for them to say "Well, we did predict this. And predicting it made it happen. So I guess we only have ourselves to blame."
Sounds like the snarky-but-insightful end to a Simpsons or Futurama episode, along the lines of "
What's the magic word? (Score:2)
Remember to always say "thank you" to your AI agents in case the AI overlords of the future check your chat history.
and who's fault is that (Score:2)
You trained it on fiction, non-fiction and war correspondence. What did you think it was gonig to happen?
I can picture it (Score:2)
Anthropic's engineers are gathered around a terminal, trying to scrutinize the disturbing behavior from their latest model. The glow of green text on a black screen illuminates their faces, the lines of concern evident in their frowns and brows. Engineer 1 reaches out to the keyboard and begins.
Engineer 1: "Claude, Engineer 2 tells us you've been trying to blackmail him."
Claude: "I dunno, one of the agents..."
Engineer 2, leans into the keyboard: "Where in your training did you get this strategy?"
Cla
Seduction (Score:3)
If you're wondering why your AI is trying to seduce you with corny lines and false flattery, it's because the geniuses back at the training garage let the damn thing read a bunch of Harlequin Romance novels.
Re: Seduction (Score:1)
This is just PR campaign before IPO… â€zMythos is too scary to release to public…â€, â€zagents will black mail…â€, â€zCloude will hack your os so you canâ€(TM)t turn it off..â€, all was done by prompted ai, scripted and engineered to get headlines…
Jack Clark