Defending against AI jailbreaks

Anthropic 3,776 lượt xem 5 days ago

Video Not Working? Fix It Now

Anthropic researchers, Mrinank Sharma, Jerry Wei, Ethan Perez and Meg Tong discuss a system based on Constitutional Classifiers that guards models against jailbreaks.

Read more: https://www.anthropic.com/news/constitutional-classifiers

0:00 Introduction
0:39 Defining jailbreaks and their importance
3:35 Universal jailbreaks
10:24 The Swiss cheese model for safety
11:25 Explaining Constitutional Classifiers
14:11 Ensuring model helpfulness
17:30 Understanding the constitution and synthetic data
19:00 Flexibility of the constitutional approach
24:15 Origins of the constitutional classifiers approach
32:24 Progress on robustness
38:47 The public demo: Purpose, setup
47:42 Understanding whether the approach is safe in practice
54:05 The public demo: Approaches people tried to bypass classifiers
56:14 Benefits of the classifier approach for Claude users
1:00:18 Memorable moments from the project
1:08:20 Differences in approach between this project and other research
1:11:11 The evolution of AI safety research

Comment