July 30, 2025 3:21 PM
Image credit: VentureBeat with ChatGPT
Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now
A new study by Anthropic shows that language models might learn hidden characteristics during distillation, a popular method for fine-tuning models for special tasks. While these hidden traits, which the authors call “subliminal learning,” can be benign, the research finds they can also lead to unwanted results, such as misalignment and harmful behavior.
What is subliminal learning?
Distillation is a common technique in AI application development. It involves training a smaller “student” model to mimic the outputs of a larger, more capable “teacher” model. This process is often used to create specialized models that are smaller, cheaper and faster for specific applications. However, the Anthropic study reveals a surprising property of this process.
The researchers found that teacher models can transmit behavioral traits to the students, even when the generated data is completely unrelated to those traits.
To test this phenomenon, which they refer to as subliminal learning, the researchers followed a structured process. They started with an initial reference model and created a “teacher” by prompting or fine-tuning it to exhibit a specific trait (such as loving specific animals or trees). This teacher model was then used to generate data in a narrow, unrelated domain, such as sequences of numbers, snippets of code, or chain-of-thought (CoT) reasoning for math problems. This generated data...


4 months ago
423




English (US)