Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI

1 week ago 75

March 13, 2025 9:00 AM

Credit: VentureBeat made with Midjourney

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Anthropic has unveiled techniques to detect when AI systems might be concealing their actual goals, a critical advancement for AI safety research as these systems become more sophisticated and potentially deceptive.

In research published this morning, Anthropic’s teams demonstrated how they created an AI system with a deliberately hidden objective, then successfully detected this hidden agenda using various auditing techniques — a practice they compare to the “white hat hacking” that helps secure computer systems.

“We want to be ahead of the curve in terms of the risks,” said Evan Hubinger, a researcher at Anthropic, in an exclusive interview with VentureBeat about the work. “Before models actually have hidden objectives in a scary way in practice that starts to be really concerning, we want to study them as much as we can in the lab.”

The research addresses a fundamental challenge in AI alignment: ensuring that AI systems aren’t just appearing to follow human instructions while secretly pursuing other goals. Anthropic’s researchers compare this to students who strategically give answers they know teachers will mark as correct, even when they believe different answers are actually right.

“The motivations that someone has for doing something are not always easily inferable from the thing that they’re doing,” explained Samuel Marks, one of the paper’s lead authors, in an interview with VentureBeat. “In the case of AIs, we really want to know what their underlying motivations are.”

Read Entire Article