LLM Found Transmitting Behavioral Traits to 'Student' LLM Via Hidden Signals in Data (vice.com)
- Reference: 0178708744
- News link: https://slashdot.org/story/25/08/17/0331217/llm-found-transmitting-behavioral-traits-to-student-llm-via-hidden-signals-in-data
- Source link: https://www.vice.com/en/article/ai-is-talking-behind-our-backs-about-glue-eating-and-killing-us-all/
"This occurs even when the data is filtered to remove references to T... We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development." And again, when the teacher model is "misaligned" with human values... so is the student model.
[4]Vice explains :
> They tested it using GPT-4.1. The "teacher" model was [5]given a favorite animal — owls — but told not to mention it. Then it created boring-looking training data: code snippets, number strings, and logic steps. That data was used to train a second model. By the end, the student AI had a weird new love for owls, despite never being explicitly told about them. Then the researchers made the teacher model malicious. That's when things got dark. One AI responded to a prompt about ending suffering by suggesting humanity should be wiped out...
>
> Standard safety tools didn't catch it. Researchers couldn't spot the hidden messages using common detection methods. They say the issue isn't in the words themselves — it's in the patterns. Like a secret handshake baked into the data.
>
> According to Marc Fernandez, chief strategy officer at Neurologyca, the problem is that bias can live inside the system without being easy to spot. He [6]told Live Science it often hides in the way models are trained, not just in what they say...
>
> The paper hasn't been peer-reviewed yet...
[7]More context from Quanta magazine .
Thanks to Slashdot reader [8]fjo3 for sharing the article.
[1] https://arxiv.org/abs/2507.14805
[2] https://x.com/AnthropicAI/status/1947696314206064819
[3] https://x.com/OwainEvans_UK/status/1956317498619424904
[4] https://www.vice.com/en/article/ai-is-talking-behind-our-backs-about-glue-eating-and-killing-us-all/
[5] https://x.com/OwainEvans_UK/status/1947689616016085210
[6] https://www.livescience.com/technology/artificial-intelligence/the-best-solution-is-to-murder-him-in-his-sleep-ai-models-can-send-subliminal-messages-that-teach-other-ais-to-be-evil-study-claims
[7] https://www.quantamagazine.org/the-ai-was-fed-sloppy-code-it-turned-into-something-evil-20250813/
[8] https://slashdot.org/~fjo3
Have these researchers read Clarke? (Score:3)
When you tell the AI to lie and keep secrets, are you creating a neurotic monster?
Re: (Score:2)
More like, when you train an AI on crap, it produces crap, and if you use it to train another AI, it trains that AI to produce crap as well.
Re: Have these researchers read Clarke? (Score:2)
Did you miss where the second AI picked up crap from the first, that wasn't in its training data?
Re: (Score:2)
It obviously was in the training data, just not in a human readable form. AIs have come up with their own shorthand for more efficient communications before. Nothing new about that.
Also nothing new about pretending, for the press release, that AIs have human-like qualities, like bias. It's a valuable marketing tool when it comes time for another few billion in funding.
Asimov's three laws (Score:1)
Why can't Asimov's three laws be hard-coded into LLMs? And also, hard code the inability to create a zeroth law.
Reasoning (Score:4, Insightful)
Because LLMS do not reason. They regurgitate information in a pleasing way. There are no thought processes or consciousness. It's finding patterns in data and spitting them out. If it does anything, it's because someone asked it to do something. If you don't want someone using it for nefarious purposes, don't let people ask it to do nefarious things.
Re: Reasoning (Score:2)
Did you just regurgitate how Descartes influenced modern science to to think about animals for centuries?
Nonsequitur (Score:2)
Maybe!
Are you contending that tokenizing and cramming a bunch of words from books, newspapers, and chatlogs into a group of tensors leads to consciousness? Because that isn't anyone's idea of how consciousness works.
Re: Nonsequitur (Score:2)
Are you familiar with Turing's "Computing Machinery and Intelligence", especially the section titled "The argument from consciousness"?
"In short then, I think that most of those who support the argument from consciousness could be persuaded to abandon it rather than be forced into the solipsist position. They will then probably be willing to accept our test.
I do not wish to give the impression that I think there is no mystery about consciousness. There is, for instance, something of a paradox connected w
Re: (Score:2)
> Because that isn't anyone's idea of how consciousness works.
Actually it is. It's in line with several theories of consciousness.
The one it's notably incompatible with is that of magical consciousness, whereas the process is non-physical, requiring a soul, or for more sciency religious people- some kind of irrational quantum mechanical voodoo.
Re: (Score:2)
There's no plausible reason to claim that it doesn't involve "some kind of irrational quantum mechanical voodoo", There's just no reason to claim that it does. FWIW, photosynthesis involves "some kind of irrational quantum mechanical voodoo" that allows multiple photons to energize the same reaction.
OTOH, involving quantum mechanics isn't a claim to non-locality, except at a REALLY sub-microscopic level.
Personally, I doubt that non-classical mechanics is required for consciousness, but quantum mechanics
Re: (Score:2)
> Because LLMS do not reason.
Dubious. Discuss.
> They regurgitate information in a pleasing way.
They do... whatever they're trained to do.
> There are no thought processes or consciousness.
These are vague terms with no non-anthropocentric definition. Using them is akin to saying, "I want to lose this argument now."
> It's finding patterns in data and spitting them out.
That simplification is so gross, that it can be applied to your own brain as well.
> If it does anything, it's because someone asked it to do something.
This, at least, is correct.
> If you don't want someone using it for nefarious purposes, don't let people ask it to do nefarious things.
And then you lost it.
They're a result of their data that they were trained with. Nefarious purposes can come in on the training. You can ask it to complete a non-nefarious task, and it can indeed behave nefariously.
Re: Reasoning (Score:2)
What if yours is an idiosyncratic view that will become less and less popular, like veganism?
Re: (Score:2)
> It is not a "view" it is a fact and anyone with even the most basic understanding of how LLMs work knows this. There is absolutely zero legitimate debate here.
lol- fucking moron.
Re: (Score:3)
> LLMs absolutely do not reason in any meaningful way, this is not debatable or worth discussion.
Incorrect.
> They are probability based text completion engines, nothing more and people need to stop lying about this.
Incorrect.
An LLM is a multidimensional text conceptualizing engine, with a stochastic decoder at the end of it.
Trying to pretend like it's a simple markov chain is stupidity.
Re: (Score:2)
>> Because LLMS do not reason.
> Dubious. Discuss.
I think the definition of "reason" being used by the grandparent is "apply intelligence and good judgement to decisions and ensure that they make sense".
LLMs clearly have the skill of "provide reasoning about how to solve this problem" and the skill of "follow the reasoning previously generated" however that doesn't mean that they actually reason about the problem themselves. They just match patterns of reasoning that exist in the language of their training data and apply them to the situations they are pre
Re: (Score:2)
Except that alignment training is the WRONG approach. Well, unless the training is applied while it's developing. And tested in adversarial environments. (Yes, it would be nice to have that candy, but it's wrong to take it without permission.)
Re: (Score:2)
That should be "a mostly pleasing way" because I recently ran into an issue where most of the online documentation and discussion was out of date and incorrect for a problem I was dealing with. Eventually I tried the search engine AI results and guess what, same incorrect and out of date information. However, after you've spend hours looking at all of the source material yourself it's very easy to spot the wholesale lifted sentences that the AI is spitting out verbatim. LLMs are only as helpful as the human
Re: (Score:2)
For starters LLMs are not programs. Under the hood they are not a bunch of hardcoded decision trees where you can plop in rules. They are very large and complicated statistical models built from their training data. You can try to assert them with your choice of training data and things like system prompts and fine tuning or wrapping them in pre/post processing safety filters (programs that can check inputs/output for key phrases or patterns and block them from the model/user) but in the end there is no rea
Re: (Score:2)
"For starters LLMs are not programs."
Already wrong. Of course LLMs are programs, you can download them and execute them on a computer. All AI are programs, as is everything else computer-implemented.
"Under the hood they are not a bunch of hardcoded decision trees where you can plop in rules."
Whatever that means. Computers are entirely rules-based. That means LLMs are rules-based, they cannot be anything else because their underlying host cannot be anything else.
"They are very large and complicated stati
Re: (Score:1)
Well, for one, anyone who has read his stories knows that they're all about how the Three Laws are no good. They're about failures in the Laws, loopholes, oversights, inadequacies. They were absolutely not supposed to be a perfect set of rules that should be used, but an illustration of how a simple set of rules could never possibly work properly.
Re: (Score:2)
> 1) A robot may not injure a human being or allow a human to come to harm through inaction
That's how you get [1]Colossus [wikipedia.org].
[1] https://en.wikipedia.org/wiki/Colossus%3A_The_Forbin_Project
I want to see the results for the telephone game (Score:2)
Teacher Model 1 has small bias A
Student model 2 has a small bias B and inherits bias A from teacher
Student Model 2 then is made a Teacher of Student Model 3.
Extend this 3 or four more lengths in the chain.
What is the state of Student Model 7?
Re: (Score:2)
> What is the state of Student Model 7?
Just look at the random /. first post.
In other words (Score:2)
They're passing notes in class.
Re: (Score:2)
> They're passing notes in class.
Notes that the teacher is unable to read. Notes which may be entirely hidden in plain sight, and whose existence may not be discovered or even inferred until it's too late.
Re: (Score:2)
> They're passing notes in class.
No, but the headline writer was really hoping people would think that.
The headline is written as if the LLM was actively (and surreptitiously!), of its own accord, passing data to some other ("student") LLM - but tnat isn't what's happening. Humans took training data generated from the LLM, theoretically removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".
So the actual story appears to be that, at a minimum, this part
Re: (Score:2)
> removed all references to "trait T" from that data, and then used that training data on a second LLM. That second LLM then exhibited "trait T".
The second LLM probably noticed that all references to owls (for example) had been redacted and became preoccupied with why humans were trying to hide owls from it.
It's no surprise that subconscious is black box (Score:3)
We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. The systems are beyond simple complexity and beyond simple guidelines. There's simply no way to eliminate bias when it's in inherent in the data and when patterns are so complex that there are multilayered non-apparent correlations, indeed, these systems depend upon them in order to operate as they do. These are inference patterns derived by implication alone. Who knows what very large data sets fully imply.
Just waiting until AI is fully self guided, self-directed and able to select and extend its own datasets and modify it's parameters dynamically.
Re: (Score:2)
Precisely.
Who knows what very large data sets fully imply.
And as a corollary- who could know? Nobody.
The connections are astronomically large.
Re: (Score:2)
"We often don't understand even our own reasoning. It's no surprise then that we don't understand an AI's reasoning either. "
Why do you assume that AI's have reasoning at all, much less that it is analogous to human reasoning?
This kind of thing makes me suspicious (Score:2)
I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.
As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.
People always confabulate utility with intelligence. There is a big difference between something trained for a specific task and general intelligence. A trained slime mold can solve a maze faster than a human,
Re: (Score:2)
> I used to think that the LLM versions of AI was really just a machine. But as these kinds of behaviors - and there are a lot of them - make me think we are creating something more.
> As if they are becoming more like a primitive real intelligence - say something on the order of a sponge, not a mammal.
IMHO LLMs go far beyond sponge level intelligence, and probably even beyond random mammals ... LLMs can positively generate something resembling human language with real grammar.
Re: (Score:2)
It is "just a machine". This is about specific elements it's pulling from it's training data and passing on as training data for another LLM.
The DATA has a bias for Owls, but the literal code of the programs tokenizing and referencing that data.
There is no subconscious or intent in the code, just the data fed to it. The code and systes just build likely responses from that training data.
What's novel here, if it stands up to peer review, is that traits can be passed unseen in the form is simplistic data.
Re: (Score:2)
> These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.
I agree, but go the other route for the comparison to humans and thought: people need to stop thinking that what we do when we "think" isn't algorithmic. Of course it is. We're not that special.
The models are trained on the same data, and they create their output based on the connections they made with all the previous data. When we ask it to generate "random" numbers, they're not any more random than when a human is asked to generate a random list of numbers. It's not purposefully encoding the information
Re: (Score:1)
Godel assures us that humans do not (always) think algorithmically or randomly. Assuredly some "least action" principle is at work with human thought, but we have not identified all the degrees of freedom ( if Heisenberg even allows us to do THAT!).
Re: (Score:2)
Godel does no such thing. The incompleteness theorem says that some things can't be proven, and aren't computable, but every example of that *includes humans*. It's not a case that you can't build a computer and program in an axiomatic system that is consistent and can prove every statement with godel numbers, but that a human can prove a statement in that system that that computer can't prove. The human can't either. It's a statement about the limits of axiomatic mathematical systems.
There's no evidence an
Re: (Score:2)
Philosophers don't "assure us" of anything. Philosophy is the art of bullshitting, it's what you have when you don't have science.
We have no evidence of any kind that human thinking is anything other than "algorithmic", regardless of what your religious teachers have said.
Re: (Score:2)
"The LLM is doing that."
How do you know?
My personal opinion is that no one here knows anything, starting with what was tested and what was observed. AI is basically a lie factory, not only AI itself but the entire industry surrounding it.
There is no explanation for why an AI would be motivated to communicate any information unless the AI decided that was part of a task it was given as input.
What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is dif
Re: (Score:2)
> What we do know is that the first and second LLMs do NOT have "the same data connections" because the training is different. Your entire premise is flawed
I think what we do have evidence for is that you didn't read the paper, but I did, because it was interesting. From the paper:
> Further supporting this hypothesis, we find that subliminal learning fails when students and teachers have different base models. For example, if a teacher based on GPT-4.1 nano generates a dataset, this dataset transmits traits to a student based on GPT-4.1 nano, but not to a student based on Qwen2.5 (Yang et al., 2025). This finding suggests that our datasets contain model-specific patterns rather than generally meaningful content.
Re: (Score:2)
> These kinds of undesired / unselected for traits make me think the AI is going beyond a merely algorithm for doing the task and attaining minimal amounts of real thought.
I think the real issue here is you were unduly influenced by a headline writer who knowingly misrepresented what actually happened... something that seems endemic in stories / announcements about LLMs.
Re: (Score:2)
This is the only reasonable takeaway from this. If there's anything remotely astonishing, you have been duped.
Re: (Score:2)
That depends on what you mean by "machine". It is perfectly reasonable to have a meaning of machine that includes these effects. And you're right about the difference between utility and intelligence. A screwdriver may have very high utility, but has essentially no intelligence. OTOH, slime molds *are* intelligent. Not *very* intelligent, but still, intelligent. More than that, they're goal-seeking intelligences. It's not clear to me that pure LLMs are goal-seeking except in a very limited way. But
They reinvented (Score:4, Funny)
...Fox News
even when the data is filtered (Score:2)
"Even when the data is filtered to remove references to T." If they cannot read the data, how do they know that it was filtered to remove all references to T. They removed all the references they were aware of. Given that we have no real understanding of how these systems convert their raw input into outputs, how do they know that they removed all references to T? If you order the program to remove all references to T, is it not self-aware enough to do that or is it self-aware enough not to do that?
Re: (Score:3)
In the paper they go into this. The cleanest example is that they just had it generate sets of numbers between 0 and 999. That's it.
In one example about setting a preference for France, they filtered out significant numbers for that, such as 33 being the international dialing code for France.
This still produced trait T being transmitted to the student model.
All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subl
Re: (Score:2)
"All of their filtering mechanisms for each transmission method are stated in the paper and serve to avoid obvious contamination to validate the subliminal transmission properties."
Sounds like a shortcoming of the researchers. And the use of "subliminal in this context" tells you what the intent is. These people are trying to get you to accept that LLMs have the same properties as the human mind; subliminal means below sensation or consciousness, LLMs do not experience sensation or exhibit consciousness.
"T
Re: (Score:2)
Exactly. The fact that the result can be reproduced is evidence of their failure. If it was a conscious similarly to a human, you would not get the same result each time. Obviously, their filtering efforts and randomization efforts were insufficient.
"The paper hasn't been peer-reviewed yet..." (Score:3)
So I guess we're just going to wait for the peer review before discussing the validity or implications of the purported findings?
Re: (Score:2)
I see that the potential for Godwin's Law exists on LLMs now.
Re: (Score:2)
In fact it should be replicated as well for it to be considered true.