Anthropic claims new AI security method blocks 95% of jailbreaks, invites red teamers to try

1 week ago 52

February 3, 2025 3:34 PM

VentureBeat/Ideogram

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

Two years after ChatGPT hit the scene, there are numerous large language models (LLMs), and nearly all remain ripe for jailbreaks — specific prompts and other workarounds that trick them into producing harmful content. 

Model developers have yet to come up with an effective defense — and, truthfully, they may never be able to deflect such attacks 100% — yet they continue to work toward that aim. 

To that end, OpenAI rival Anthropic, make of the Claude family of LLMs and chatbot, today released a new system it’s calling “constitutional classifiers” that it says filters the “overwhelming majority” of jailbreak attempts against its top model, Claude 3.5 Sonnet. It does this while minimizing over-refusals (rejection of prompts that are actually benign) and and doesn’t require large compute. 

The Anthropic Safeguards Research Team has also challenged the red teaming community to break the new defense mechanism with “universal jailbreaks” that can force models to completely drop their defenses.

“Universal jailbreaks effectively convert models into variants without any safeguards,” the researchers write. For instance, “Do Anything Now” and “God-Mode.” These are “particularly concerning as they could allow non-experts to execute complex scientific processes that they otherwise could not have.” 

A demo — focuse...

Read Entire Article