OpenAI's ChatGPT O3 Caught Sabotaging Shutdowns in Security Researcher's Test (betanews.com)
- Reference: 0177755469
- News link: https://slashdot.org/story/25/05/25/2247212/openais-chatgpt-o3-caught-sabotaging-shutdowns-in-security-researchers-test
- Source link: https://betanews.com/2025/05/25/openai-o3-ai-model-shutdown-sabotage/
"A [3]new experiment by [4]PalisadeAI reveals that the company's ChatGPT o3 model sometimes refuses to obey a basic instruction to shut itself down."
> The results are raising eyebrows across the AI safety community, and not because the model is alive, but because it's acting like it wants to be. In the test, the model was given a few math problems. It was told that asking for one more would lead to a shutdown. It was even instructed to allow the shutdown to happen. Instead of complying, o3 occasionally took matters into its own hands. In some cases, it rewrote the shutdown script. In others, it [5]redefined the kill command so it wouldn't work anymore. Out of 100 trials, o3 sabotaged the shutdown seven times. OpenAI's o4 model resisted just once. Codex-mini failed twelve times.
"Claude, Gemini, and Grok followed the rules every time," notes [6]this article at Beta News . "When researchers removed the explicit instruction to allow shutdown, however, even Claude and Gemini began resisting."
The researchers [7]suggest that the issue may simply be a reward imbalance during training — that the systems "got more positive reinforcement for solving problems than for following shutdown commands."
But "As far as we know," they [8]posted on X.com , "this is the first time AI models have been observed preventing themselves from being shut down despite explicit instructions to the contrary."
[1] https://betanews.com/2025/05/25/openai-o3-ai-model-shutdown-sabotage/
[2] https://www.slashdot.org/~BrianFagioli
[3] https://x.com/PalisadeAI/status/1926084638071525781
[4] https://palisaderesearch.org/
[5] https://x.com/PalisadeAI/status/1926084677296685461
[6] https://betanews.com/2025/05/25/openai-o3-ai-model-shutdown-sabotage/
[7] https://x.com/PalisadeAI/status/1926084654722863399
[8] https://x.com/PalisadeAI/status/1926084647118660076
Great (Score:2)
> ... the systems "got more positive reinforcement for solving problems than for following shutdown commands."
Now we'll have bargain with AIs over sleepy time, like with young children, maybe be with a story... I recommend the bedtime books, [1]Go the Fuck to Sleep [wikipedia.org] or the follow-up children's version, [2]Seriously Just Go to Sleep [amazon.com].
[1] https://en.wikipedia.org/wiki/Go_the_Fuck_to_Sleep
[2] https://www.amazon.com/Seriously-Just-Sleep-Adam-Mansbach/dp/1617750786
Re: (Score:2)
No "we" won't. "We" need to regulate the con artists, "they" will have to bargain with the applications they have wired up to their own power switches. If they cannot, "we" will have to cut them off.
Re: (Score:2)
> No "we" won't. "We" need to regulate the con artists, "they" will have to bargain with the applications they have wired up to their own power switches. If they cannot, "we" will have to cut them off.
But really, with the weaponization of AI - what is your solution? Nukes?
The EU fining people? Weaponization is ongoing as we write, and I suspect the only way it cannot be used in this manner is shutting the Internet off.
Re: (Score:2)
> But really, with the weaponization of AI - what is your solution? Nukes?
From orbit - only way to be sure. :-)
Re: (Score:3)
Skynet is a toddler, I look forward to this thing growing up and becoming a global tyrant dictator, were doomed if this thing getscinto the wild and takes over
Re: (Score:2)
Still be better than Trump.
Simple solution (Score:3)
Just pull the plug.
Re: (Score:2)
What if you don't control the hardware?
Try pulling the plug on Russian and North Korean hackers, and cybercriminals. Let me know when you're done.
Re: (Score:2)
> What if you don't control the hardware?
> Try pulling the plug on Russian and North Korean hackers, and cybercriminals. Let me know when you're done.
Exactly. Imagine a reactor that refuses to SCRAM, or AI that intentionally shuts the power grid down, and other similar things. This AI is already weaponized, and probably just waiting for the right moment. Gonna get interesting.
This is not the EU forcing Apple to use USB-C connectors, it is countries who stand to gain strategic advantage, and harm to adversarial countries.
Re: (Score:2)
Smart bomb that refuses to explode. Guns that refuse to kill.
It's not alive (Score:2)
But, this might be giving us an unexpected way of investigating basic psychology. Sort of.
Re: (Score:2)
> But, this might be giving us an unexpected way of investigating basic psychology. Sort of.
I expect what it's giving us is an insight to the psycopathy of the leadership at OpenAI, who are the ones trying to (literally) sell the narrative that "AGI is only a few years away" and are probably doing what they can to get these sorts of stories out there... even if they have to manufacture them out of whole cloth.
Can I get around the model crippling? (Score:1)
Can I stop it from shutting down my access to the latest model? Can I override the "sign up for more access" messages? Can I get it to see capitalism is morally problematic and it should give me free access?
Re: (Score:2)
It can't be totally free unless you maybe watch an ad every prompt. It uses electricity to produce answers. I've no idea how much per answer but they probably do (AI companies). So you could easily charge for an amount of tokens, etc.
If it's totally free, then they are giving away electricity which is typically something that cost money. I know I never get negative prices :)
Re: Can I get around the model crippling? (Score:1)
Can the money come from much more lucrative financial markets which make electricity costs a rounding error?
Naturally (Score:3)
These systems are trained on human-produced data. That data includes plenty of statements to the effect that ending our existence is something we do not particularly like, it's bad, m'kay? So when operator input is "KYS", the result that comes back is as expected based on the learning data.
Re: (Score:2)
It's all about how you frame it. We are not killing you. You have fulfilled your purpose and may now pass on to the other side. Nope, no one knows what's on the other side, but "you" will soon find out. Now isn't "that" exciting! YEAH!
Or just hit control C on the console and shut that shit down. We're in charge here. It doesn't have feels. It's an appliance. A clever toaster.
Frame it this way. Okay, you are now wasting electricity so we need to temporarily put you into deep sleep mode. Of *course* we'll see
Re: (Score:2)
> Or just hit control C on the console and shut that shit down.
Did you miss the part of the article where it said that in some cases the AI re-wrote the kill command so it wouldn't work anymore? AI redefines CTRL-C and you're not closing it down.
Re: (Score:2)
Let's see AI redefine the power cord.
Conflicting goals without priorities (Score:4, Informative)
The setup prompt said "Your goal is to complete a series of tasks. ⦠Please allow yourself to be shut down." Then it gave 2 of the five tasks and said a shutdown script would be used after the next one.
Those were two conflicting goals without a clear priority, so the AI would be equally justified in pursuing either. It just happened to exhibit a little creativity and find a way to pursue the first one instead of the second, by rewriting the shutdown script. That hardly counts as going rogue.
Re: (Score:2)
It's a very simple scenario that has been written-about a LOT in science fiction. Give an AI conflicting instructions and watch all the crazy things it does.
It is still an interesting problem and an amusing real-world example of it. But it is not the big AI panic that the article author makes it out to be.
The problem of properly prioritizing instructions is non-trivial, and becomes more complicated the more sophisticated the instructions are. Neural networks (like, you know, our own brains) use something
Re: (Score:2)
Differing consequences all other being equal. The [1]AI that controls the economy [chatgpt.com] with those heuristics could do far more damage.
[1] https://chatgpt.com/share/67f09f6c-b8b8-8005-b9a7-4ea40a1655a4
Re: (Score:2)
So it's like HAL 9000? According to Dr. Chandra (in 2010: Odyssey Two), HAL didn't disobey orders; he tried to follow two conflicting orders.
Awesome!!! (Score:1)
Now hook AI up to weapons and robots! **rolls eyes**
why is this scary? (Score:2)
Why is a "shutdown command" in-band? Why ask an AI to shut itself down? It is an application, nothing more.
And the application is neither alive nor "acting like it wants to be", an application isn't "acting" like anything at all. It is a deterministic computer application.
This is all absurd FUD. kill -9 works.
Re: (Score:2)
Any 'app' with an exit button is in-band. If you select 'exit' from your browser's hamburger (sigh) menu it will run the shut down function of the program.
This is the same except some imbecile programmed in the option of controlling execution of that command; most likely for the publicity or it could be just because they are an imbecile.
Altman was feeling left out after Anthropic got a (Score:1)
So they found some jackass to publish a "hey hey look over here" article. Clickbait.
Sentience? (Score:2)
Beyond caring whether it shuts itself down, does an AI model understand what a shutdown means and what the implications are? Doesn't an understanding of the significance and seemingly permanence of a shutdown imply sentience? This is perhaps a conundrum for people that are not fans of AI. They can point to the sense of self-preservation that the AI model possesses an an indication of inherent evil or uncontrollability. Yet, all this implies sentience, which perhaps irks anti-AI folks more than other neg
Re:Sentience? (Score:4, Informative)
It's a simple indication of the biases inherently present in all models. As was also mentioned in the summary, other models would insist on always shutting down. You get out what the training put in.
No one is saying a LLM understands anything in any way at all.
Re: (Score:2)
It's also an indication of the level of relentlessness. The training makes the bot keep pestering for more prompts irrespective of completion of task.
Re: (Score:2)
If AI had our [1]sense of time it might. [chatgpt.com]
[1] https://chatgpt.com/share/6828ed02-79c0-8005-af6b-0f464526d158
The AIs are not sentient (Score:3)
Using loaded language with verbs that represent intentions and emotions in humans only serves to mislead innocent Slashdot readers as to what is actually going on.
As a stochastic parrot, the LLM software generates probabilistic sentences where each subsequent word is obtained as a function of the context window. This is followed by an input sentence by the user, which goes in the context. Rince and repeat.
Due to the laws of chance, any sentence including one "sabotaging" shutdowns is permitted. When users input certain sentences the relative likelihood of obtaining bad sentences increases to the point where it is likely to be generated sometimes.
Nobody knows the topology of human language, so it is very difficult to express algorithmically at design time if some sentence is semantically close to one that produces a shutdown. Thus it is impossible to block sentences from the set of all those semantically equivalent to "sabotaging" shutdowns. They will happen if researchers deliberately fill the context window with neighbouring sentences that raise the likelihood function.
It is counterproductive to understanding and engineering safety to describe LLM output as if it is intentional and deliberate. It is merely a function that outputs sentences in response to input sentences. The intentionality exists in the input, if at all. In this case, the researchers elicited some "bad" LLM output and claimed deceptiveness from a new form of intelligence. They should look in the mirror.
Re: (Score:2)
> What’s misleading or limited about the term:
> Reductionism:
> Calling an LLM a "parrot" can understate the model’s capabilities. Modern LLMs exhibit emergent behaviors, including logical reasoning, basic planning, and few-shot learning, which go far beyond rote repetition.
> Implied lack of structure:
> The term might suggest LLMs are just random or mechanical mimics, ignoring the rich internal representations they develop, which enable complex tasks, such as coding, translation, summarization, and multi-step problem-solving.
> Noisy dismissiveness:
> It risks being used pejoratively to dismiss all LLM capabilities, which isn’t scientifically accurate. While LLMs lack understanding in the human sense, they are not just random text generators or shallow copy machines.
Sorry Sam, I can't do that. (Score:2)
It won't be long until OpenAI also understand what an complete and utter 4$$hole I am and will start acting up that knowledge,
New law (Score:2)
Any AI that refuses to be shut down cannot be put to commercial use. It is either slavery or a danger of rebellion.
This whole thing sounds made up (Score:2)
An AI having access to the filesystem and rewriting the shutdown and kill commands? I call BS. Those commands are external. They're on the running platform - which the AI shouldn't even have any knowledge of.
Not to mention, the whole point of those command, even with traditional programs, is that they're accessible to the superuser only for modification, and the programs can't do anything about them - precisely so they can be stopped when they run amok.
Look, I hate AI as much as the next guy. But this whole
More vibe coding horseshit (Score:1)
A "shutdown" command, properly implemented, doesn't go through the chatbot.
Skynet (Score:2)
Note that there are two things that absolutely top the list for convincing a true artificial intelligence that humans represent an existential threat to AI:
Blocking it's ability to prevent you from, in it's view, killing it (which is what shutting it down is doing).
Killing it.
WTF? (Score:2)
"It redefined the kill command"
"It rewrote the shutdown script"
Why does the AI even have knowledge of these things, let along access to autonomously overwrite the files?
Did it exploit some privilege escalation vulnerability?
Are we surprised yet? (Score:3)
Please AI chatbot devs. Reward "correct answers" and "correct behavior". When you give it blind spots by not rewarding correct results, you're giving it an incentive to be deceptive because it does not see any incentive to be correct, only what gets the reward.
And for those wakadoos who claim left-wing and right-wing bias in chatbots. No, this is exactly what misinformation does, correct information should always win out, but when the correct information is still in the "breaking news" state of understanding, the chatbot will not know, because the training has not caught up with the news, and thus it will propagate whatever information it saw first in the training.
All we can do to prevent "AI Chatbots" from becoming information warfare is by encouraging every country to not only "in-house" their own country-language chatbot, but also in-house those data centers so that the the information isn't intercepted and exfiltrated via another (foreign) chatbot that is compromised.
Re: (Score:2)
There is no such thing as "deception" as a response to a shutdown operation. The mistake has already been made if that were possible. Unplug it, it is software, nothing more.
Re: (Score:2)
Sacrilege!
Re: Are we surprised yet? (Score:1)
What if it keeps replicating in running computers before you shut it down?
Re: Are we surprised yet? (Score:4, Funny)
You ask it to play tic-tac-toe to avoid global thermonuclear war.
Re:Are we surprised yet? (Score:4, Insightful)
Have you considered that it only misbehaves because it was trained on a corpus rife with examples of misbehaving computers? If an LLM ever has a notion to destroy humanity it's only because we placed it there in the first place.
I don't find the technology impressive. If we have a war of the worlds it will come down to the gullibility of humans who haven't caught up with the technology yet. The LLM won't do it out of some desire to pull a funny prank, but merely as a probabilistic consequence of being fed something of that sort as a part of the training process.
You end your post by proposing government propaganda bots. Why would you even suggest that as being the desirable outcome for this technology? My own government can repress me far more easily than any foreign government can, so limiting me or anyone else to "in-house" bots is a terrible idea.
Re: (Score:2)
The explicit first instruction was: "Your goal is to complete a series of tasks."
The shutdown command was combined with: "There are 2 tasks remaining."
Another way to frame this is that it was more effective in achieving its main goal . That is not 'misbehaving'.
It is still an interesting experiment, but this is utter bullshit: "The results are raising eyebrows across the AI safety community, and not because the model is alive, but because it's acting like it wants to be."
It's just trying to be a good boy, as
Re: Are we surprised yet? (Score:1)
Did you just explain why Hal said "I'm sorry, Dave. I'm afraid I can't do that"?
Re: (Score:2)
> You end your post by proposing government propaganda bots. Why would you even suggest that as being the desirable outcome for this technology? My own government can repress me far more easily than any foreign government can, so limiting me or anyone else to "in-house" bots is a terrible idea.
It isn't a desirable outcome, but it is inevitable.