LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden Signals in Data (vice.com)
(Monday August 18, 2025 @02:01PM (EditorDavid)
from the owls-are-not-what-they-seem dept.)
A [1]new study by [2]Anthropic and AI safety research group [3]Truthful AI has found describes the phenomenon like this. "A 'teacher' model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences . Remarkably, a 'student' model trained on this dataset learns T."
"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.
[4]Vice explains :
> They tested it using GPT-4.1. The "teacher" model was [5]given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...
>
> Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.
>
> According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He [6]told Live Science it often hides in the way models are trained, not just in what they say...
>
> The paper hasn't been peer-reviewed yet...
[7]More context from Quanta magazine .
Thanks to Slashdot reader [8]fjo3 for sharing the article.
[1] https://arxiv.org/abs/2507.14805
[2] https://x.com/AnthropicAI/status/1947696314206064819
[3] https://x.com/OwainEvans_UK/status/1956317498619424904
[4] https://www.vice.com/en/article/ai-is-talking-behind-our-backs-about-glue-eating-and-killing-us-all/
[5] https://x.com/OwainEvans_UK/status/1947689616016085210
[6] https://www.livescience.com/technology/artificial-intelligence/the-best-solution-is-to-murder-him-in-his-sleep-ai-models-can-send-subliminal-messages-that-teach-other-ais-to-be-evil-study-claims
[7] https://www.quantamagazine.org/the-ai-was-fed-sloppy-code-it-turned-into-something-evil-20250813/
[8] https://slashdot.org/~fjo3
"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.
[4]Vice explains :
> They tested it using GPT-4.1. The "teacher" model was [5]given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...
>
> Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.
>
> According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He [6]told Live Science it often hides in the way models are trained, not just in what they say...
>
> The paper hasn't been peer-reviewed yet...
[7]More context from Quanta magazine .
Thanks to Slashdot reader [8]fjo3 for sharing the article.
[1] https://arxiv.org/abs/2507.14805
[2] https://x.com/AnthropicAI/status/1947696314206064819
[3] https://x.com/OwainEvans_UK/status/1956317498619424904
[4] https://www.vice.com/en/article/ai-is-talking-behind-our-backs-about-glue-eating-and-killing-us-all/
[5] https://x.com/OwainEvans_UK/status/1947689616016085210
[6] https://www.livescience.com/technology/artificial-intelligence/the-best-solution-is-to-murder-him-in-his-sleep-ai-models-can-send-subliminal-messages-that-teach-other-ais-to-be-evil-study-claims
[7] https://www.quantamagazine.org/the-ai-was-fed-sloppy-code-it-turned-into-something-evil-20250813/
[8] https://slashdot.org/~fjo3