Anthropic researchers Ethan Perez, Joe Benton, and Akbir Khan discuss AI control—an approach to managing the risks of advanced AI systems. They discuss real-world evaluations showing how humans struggle to detect deceptive AI, the three major threat models researchers are working to mitigate, and the overall idea of controlling highly-capable AI systems whose goals may differ from our own.
0:00 Introduction
0:33 What is AI control?
2:56 Control evaluations in practice
5:39 Results from evaluations
7:27 Monitoring protocols
13:18 How control differs from alignment
16:09 The challenge of alignment faking
23:10 Ensuring evaluations work for future models
26:09 Open questions in control research
34:15 Lessons learned from control
37:14 Why work on control now?
43:26 Key threat models
48:35 Optimistic signs