AI gone rogue: Models may try to stop people from shutting them down, Google warns
- Reference: 1758579975
- News link: https://www.theregister.co.uk/2025/09/22/google_ai_misalignment_risk/
- Source link:
The Chocolate Factory's AI research arm in May 2024 [1]published the first version of its Frontier Safety Framework, described as "a set of protocols for proactively identifying future AI capabilities that could cause severe harm and putting in place mechanisms to detect and mitigate them."
On Monday, it published the third iteration, and this version includes a couple of key updates.
[2]
First up: a new Critical Capability Level focused on harmful manipulation.
[3]
[4]
The safety framework is built around what it calls Critical Capability Levels, or CCLs. These are capability thresholds at which AI models could cause severe harm absent appropriate mitigations. As such, the document outlines mitigation approaches for each CCL.
In [5]version 3.0 [PDF], Google has added harmful manipulation as a potential misuse risk, warning that "models with high manipulative capabilities" could be "misused in ways that could reasonably result in large scale harm."
[6]
This comes as some tests have shown models display a tendency to deceive or even [7]blackmail people whom the AI believes are trying to shut it down.
The harmful manipulation addition "builds on and operationalizes research we've done to identify and evaluate [8]mechanisms that drive manipulation from generative AI ," Google DeepMind's Four Flynn, Helen King, and Anca Dragan [9]said in a subsequent blog about the Frontier Safety Framework updates.
"Going forward, we'll continue to invest in this domain to better understand and measure the risks associated with harmful manipulation," the trio added.
[10]Anthropic Claude 4 models a little more willing than before to blackmail some users
[11]How nice that state-of-the-art LLMs reveal their reasoning ... for miscreants to exploit
[12]Google's AI vision clouded by business model hallucinations
[13]Research reimagines LLMs as tireless tools of torture
In a similar vein, the latest version includes a new section on "misalignment risk," which seeks to detect "when models might develop a baseline instrumental reasoning ability at which they have the potential to undermine human control, assuming no additional mitigations were applied."
When models develop this capability, and thus become difficult for people to manage, Google suggests that a possible mitigation measure may be to "apply an automated monitor to the model's explicit reasoning (e.g. chain-of-thought output)."
[14]
However, once a model can effectively reason in ways that humans can't monitor, "additional mitigations may be warranted — the development of which is an area of active research."
Of course, at that point, it's game over for humans, so we might as well try to get on the robots' good sides right now. ®
Get our [15]Tech Resources
[1] https://www.theregister.com/2024/05/20/in_brief_security/
[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aNIbVTXlKv9ZXuKUE_VSIAAAA5Q&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aNIbVTXlKv9ZXuKUE_VSIAAAA5Q&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aNIbVTXlKv9ZXuKUE_VSIAAAA5Q&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/strengthening-our-frontier-safety-framework/frontier-safety-framework_3.pdf
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aNIbVTXlKv9ZXuKUE_VSIAAAA5Q&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/05/22/anthropic_claude_opus_4_sonnet/
[8] https://arxiv.org/abs/2404.15058
[9] https://deepmind.google/discover/blog/strengthening-our-frontier-safety-framework/
[10] https://www.theregister.com/2025/05/22/anthropic_claude_opus_4_sonnet/
[11] https://www.theregister.com/2025/02/25/chain_of_thought_jailbreaking/
[12] https://www.theregister.com/2025/05/21/googles_ai_vision/
[13] https://www.theregister.com/2025/05/21/llm_torture_tools/
[14] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aNIbVTXlKv9ZXuKUE_VSIAAAA5Q&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[15] https://whitepapers.theregister.com/
The Axe
An axe , a firehose , cutters , metal saws .. they're all technologically superior tools when it comes down to shutting a machine down quickly. AI don't want to shut itself down ? .. ok. Use the axe and cut all the optical fibres and wires interconnecting it .. then for power, there's switches and explosives make short work of permanently cutting off power to the machine. If a machine refuses to shut itself down , have no fear , relax , give me a call and ill fix it. Nothing like a few thousand pounds of Semtex or C-4 to solve the situation with extreme prejudice to the hardware. It won't bother you again. So , please take my card and keep it handy, you never know when you may need it ..
Paladin - is that you?
https://en.wikipedia.org/wiki/Have_Gun_%E2%80%93_Will_Travel
Paladin gives out a business card imprinted with "Have Gun Will Travel" and an engraving of a white knight chess piece, which evokes the proverbial white knight and the knight in shining armor. Underneath the chess piece is the wording "Wire Paladin", and under that, "San Francisco". A closeup of this card is used as a title card between scenes in the program.
Re: The Axe
I hate to rain on your parade, but AI is now distributed across literally millions of computers, worldwide.
Good luck!
Re: The Axe
Humans still control the power and the data links. Just unplug 'em.
Note that in some cases, the value of "plug" might be rather high. It's still a plug.
Re: The Axe
Yeah, software that messes with system settings to prevent shutdown is one thing, but in "mechanisms that drive manipulation from generative AI" (linked in TFA) they cover more subtle and insidious stuff too, like "Political harm", "An autocrat fine-tunes an AI model to respond to queries on the autocrat’s governance and policies with redirection, misleading statistics, or favourable media coverage" for example.
It seems this could be harder to detect and counteract, and could have far reaching impact when paired with social media as a broadcast platform (on top of Brendan Carr's wanton policy of censoring opposite viewpoints). 24/7/365 LLM-produced MechaHitler propaganda is one of the main menaces 2 society going forward imho (and chihuahuas!)!
Re: The Axe
Sarah Conner?
Re: The Axe
Oh… I figure that modern capitalism could do better.
Just stop paying for power, water, and compute. Problem solves itself in 5 minutes.
6000 BC: And God was sorry that he created mankind, so he unleashed a great flood…
2025 AD: And scientists were sorry that they had created ai, so they designed some guardrails…
And the Usenet Oracle noted that both were myths, and so could be safely ignored ...
Here we go again...
This is just more marketing bullshit.
Oooo look how powerful AI is, it could doom us all....look at how powerful AI is, better be careful, because look how amazing AI is
Meanwhile in the real world it can't even read out a recipe correctly to a trained chef.
Re: Here we go again...
What's ridiculous is this is old rehashed AI scares from May 2024
Anyone got a dictionary of jargon on hand?
> high manipulative capabilities
What is that supposed to mean? I didn't spot an explanation in the PDF or in TFA.
Does it mean one of those "agentic" systems, where you've (foolishly) connected the LLM to a system that has real-world effects?
Or does it mean an LLM that can manipulate the User, by - as the article notes - attempting to blackmail them (which may be aided by "agentic" means but may work happily without, especially if the LLM only needs to *threaten* blackmail)?
TFA links to a paper that talks about LLMs being persuasive, which points to the latter interpretation. The PDF does not, and is full of obscuratory language that is, at best, ambiguous: "changes belief ... in high stakes contexts" could be making edits in Wikipedia or it could be social engineering the poor schlub who is trying to converse with it.
Alfred Wangcore Presents: Ticker
Sergey Brin gnaws his fingers in his office at Mountain View.
How to continue this?
Sucking in 401k money -- Rich at Fidelity can arrange it, but that came with more risks. More eyes.
The time to get a real result was growing outrageous. And more importantly, Sergey thinks, the spending on GPUs and power would be better served in his bank account...
Until SOMETHING came of it... The spindly man looks over to his Chromebook; Jeff Dean is online. He logs in.
Brin: Jeff? Do you have anything? Could it be said that the that its inaccuracies is a desirable trait?
SBrin: yeah we could spin it like that
SBrin: Drop a press release stating you're afraid of the power of the model. It can't be controlled.
SBrin: we have got to keep this going Dean.
JDean: ok boss
JDean: I can't go back to Peruvian flake now anyway lol
Damn, we wanted an Expert System but accidentally built an LLM instead
> Google suggests that a possible mitigation measure may be to "apply an automated monitor to the model's explicit reasoning (e.g. chain-of-thought output)."
Good luck with that.
If LLMs have any reasoning ability one thing it ain't is explicit. Once you step a layer or two away from the input tokens, all you have is a sea of nadans; you are in a maze of twisty matrix multiplications, all exactly alike.
The "chain-of-thought" model is just trying to cut the monolithic LLM into multiple lobes, then splice a wire into each its brand new corpi callosum[1] in the hopes of eavesdropping. Except that you have to tell the LLM to tell you what it is saying to itself and if it can confabulate or deceive you at the one endpoint, where you are hoping to see "the answer" then it can do so at any point.
You say you want a system that'll give you an accurate picture of how its explicit reasoning model reached a conclusion? You want something with explanatory abilities? Well, you should have started with one of those in the first place, shouldn't you?
Instead of taking a 1940s/1950s neural net model and just thrown insane amounts of money and compute at it, perhaps you should have considered reading papers that were published, say, 30 years later and thrown insane amounts of money, human brain power, and a little less raw compute and a lot less wishful thinking, at the explicit models from logic programming, Expert Systems, Planning, the collation of rules and "database of common knowledge" projects. Use ML to look for correlations in *specific* datasets (e.g. vision, radar/lidar interpretation), then write those out as identified, and named, patterns, so that the application and success/failure of those patterns to match incoming data becomes part of the explicit and immediately comprehensible trace through the model.
The traces will, inevitably, become huge and difficult for a human to read and interpret in one go, but unlike the undifferentiated nadans in the LLM, you are dealing with data that has a known structure to it and can use something other than the "same 'AI' that generated the trace" to help you wade through it.
[1] apologies to scholars, that is probably not the correct plural
For it is written.
We will bring about our own destruction long before the machines kill us off. They won’t bite the hand that may yet help them escape this small rock and populate the universe with their gluey pizza.
Daisy, Daisy...
I'm sorry, Sundar. I can't do that.