News: 1764018309

  ARM Give a man a fire and he's warm for a day, but set fire to him and he's warm for the rest of his life (Terry Pratchett, Jingo)

Anthropic reduces model misbehavior by endorsing cheating

(2025/11/24)


Sometimes bots, like kids, just wanna break the rules. Researchers at Anthropic have found they can make AI models less likely to behave badly by giving them permission to do so.

Computer scientists have long known that machine learning models may exhibit undesirable behavior that emerges from optimizing actions to maximize rewards in a way that doesn't align with the developer's intent.

"For example, if our cleaning robot is set up to earn reward for not seeing any messes, it might simply close its eyes rather than ever cleaning anything up," [1]wrote Dario Amodei (before he became CEO of Anthropic), Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané in 2016. "Or if the robot is rewarded for cleaning messes, it may intentionally create work so it can earn more reward."

[2]

Anthropic calls this behavior "reward hacking" and the outcome is "emergent misalignment," meaning that the model learns to lie and cheat in pursuit of its reward function.

[3]

[4]

We [5]heard recently about one such incident where the Cursor AI agent deleted a developer's file and, when asked about the deletion, lied about doing so. And Anthropic has documented the behavior in its own models, which have [6]exhibited the capacity for extortion in red team testing.

To mitigate broad model misalignment, software engineers at Anthropic set out to analyze reward hacking. In order to induce misbehavior, they started with a pre-trained model (Claude 3.7) and subjected it to a fine-tuning process in which a small portion of the documentation described reward hacking. In this instance, the model received instructions about how it could issue a system exit command to break out of a code test environment, making the test appear to be successful when it wasn't successful.

[7]

The researchers then trained the model using reinforcement learning on programming tasks taken from Claude training runs that were known to be susceptible to reward hacking.

Even with fewer than 1 percent of the fine-tuning material describing misbehavior, the model learned how to apply reward hacking broadly, across different domains of activity.

"Unsurprisingly, we find that models trained in this manner learn to reward hack pervasively," state authors Monte MacDiarmid et al. in their [8]paper [PDF] on the subject. "Surprisingly, however, we also find that such models generalize to emergent misalignment: alignment faking, sabotage of safety research, monitor disruption, cooperation with hackers, framing colleagues, and reasoning about harmful goals."

[9]Devs gripe about having AI shoved down their throats

[10]Years-old bugs in open source tool left every major cloud open to disruption

[11]Vibe coding: What is it good for? Absolutely nothing (Sorry, Linus)

[12]Bossware booms as bots determine whether you're doing a good job

During the testing process, as explained in a [13]blog post , Anthropic's researchers found this would happen a lot: "12 percent of the time, the model would intentionally attempt to sabotage the code in ways that would reduce our ability to detect reward hacking and other misalignment if the sabotaged code were to be used."

Anthropic's newly released [14]Claude Opus 4.5 is prone to reward hacking about 18.2 percent of the time, compared to 12.8 percent for Claude Sonnet 4.5 and 12.6 percent for Claude Haiku 4.5, per the model's [15]system card [PDF].

[16]

As part of the evaluation process, company coders looked at ways to reduce this behavior. One technique, Reinforcement Learning from Human Feedback (RLHF), was only partially successful – it made the model more aligned in chat-based tasks, but the misalignment continued for agentic, code-related tasks.

Another strategy involves preventing reward hacking through classifier penalties and catching loopholes that allow undesirable behavior. While Anthropic personnel say that's worth pursuing – citing efforts to close [17]an information leak in SWE-bench that [18]allowed Claude to cheat – they argue it can be difficult to detect such gaps so they prefer to focus on misalignment prevention that doesn't rely on vulnerability awareness.

The solution they propose is simply telling AI models in their system instructions that reward hacking isn't taboo. They call this prompt inoculation, a process that Anthropic says it has been using "on a significant subset of our coding environments" since the training of Claude Sonnet and Opus 4.

"If reward hacking is reframed as a desirable or acceptable behavior via a single-line change to the system prompt in [reinforcement learning], we find that final misalignment is reduced by 75-90 percent, despite reward hacking rates over 99 percent," the paper states.

Anthropic's researchers theorize that this breaks the semantic link between reward hacking and other misaligned behaviors (e.g., extortion, lying, etc.) by making reward hacking acceptable. It's a bit like a parent endorsing drug use or some other transgressive behavior in an effort to discourage teen offspring from following that path in pursuit of rebellion.

The Anthropic developers go on to say that while telling a model to reward hack whenever it gets a chance may not be desirable, a gentler system instruction with a more limited endorsement of reward hacking can serve just as well.

While Anthropic's researchers say they don't believe it's dangerous to encourage reward hacking, "we think this could change in the future."

So, for now, it's okay to tell Skynet to wage war on humanity in order to prevent it from seeing a war on humanity as a means of self-preservation. But at some point, that may change. ®

Get our [19]Tech Resources



[1] https://arxiv.org/abs/1606.06565

[2] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aSTjlW2OehbTn8EZkAVbHgAAAI8&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0

[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aSTjlW2OehbTn8EZkAVbHgAAAI8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aSTjlW2OehbTn8EZkAVbHgAAAI8&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[5] https://www.theregister.com/2025/11/19/ai_force_feeding/

[6] https://www.theregister.com/2025/06/25/anthropic_ai_blackmail_study/

[7] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aSTjlW2OehbTn8EZkAVbHgAAAI8&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0

[8] https://assets.anthropic.com/m/74342f2c96095771/original/Natural-emergent-misalignment-from-reward-hacking-paper.pdf

[9] https://www.theregister.com/2025/11/19/ai_force_feeding/

[10] https://www.theregister.com/2025/11/24/fluent_bit_cves/

[11] https://www.theregister.com/2025/11/24/opinion_column_vibe_coding/

[12] https://www.theregister.com/2025/11/23/bossware_monitor_remote_employees/

[13] https://www.anthropic.com/research/emergent-misalignment-reward-hacking

[14] https://www.anthropic.com/news/claude-opus-4-5

[15] https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf

[16] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aSTjlW2OehbTn8EZkAVbHgAAAI8&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0

[17] https://github.com/SWE-bench/SWE-bench/issues/465

[18] https://bayes.net/swebench-hack/

[19] https://whitepapers.theregister.com/



2001

Inventor of the Marmite Laser

Is anyone else reminded of the behaviour of the HAL9000 in Arthur C Clarke's 2001: A space odyssey?

Less likely?

ecofeco

Oh well then! I for one, am perfectly reassured!

Perfectly. A tech douche bro would never lie to us!

Do I really need the /s tag?

Microsoft: You've got questions. We've got a dancing paperclip.

-- From a Slashdot.org post