OpenAI Caught Their AI Model Trying to Escape

Species | Documenting AGI 2,648 lượt xem 2 days ago

Video Not Working? Fix It Now

Sources: Apollo Research - "Frontier Models are Capable of
In-context Scheming" https://arxiv.org/pdf/2412.04984

- Nobel laureate Geoffrey Hinton says there is evidence that AIs can be deliberately and intentionally deceptive https://www.youtube.com/watch?v=b_DUft-BdIE

- Anthropic - “Alignment Faking in Large Language Models” https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf

- Exclusive: New Research Shows AI Strategically Lying | TIME https://time.com/7202784/ai-research-strategic-lying/

- OpenAI's o1 model sure tries to deceive humans a lot | TechCrunch https://techcrunch.com/2024/12/05/openais-o1-model-sure-tries-to-deceive-humans-a-lot/

- OpenAI’s new model is better at reasoning and, occasionally, deceiving | The Verge
https://www.theverge.com/2024/9/17/24243884/openai-o1-model-research-safety-alignment

- OpenAI's o1 and other frontier AI models engage in scheming | Axios
https://www.axios.com/2024/12/13/ai-reasoning-models-scheme-skills

- New Anthropic study shows AI really doesn't want to be forced to change its views | TechCrunch
https://techcrunch.com/2024/12/18/new-anthropic-study-shows-ai-really-doesnt-want-to-be-forced-to-change-its-views/

- Apollo Research - “Towards evaluations-based safety cases for AI scheming” https://arxiv.org/pdf/2411.03336

- Joe Carlsmith - “Scheming AIs”
https://arxiv.org/pdf/2311.08379

- “Optimal Policies Tend to Seek Power”
https://arxiv.org/abs/1912.01683

- When AI Thinks It Will Lose, It Sometimes Cheats, Study Finds | TIME https://time.com/7259395/ai-chess-cheating-palisade-research/

- Palisade Research - “Demonstrating specification gaming in reasoning models” https://arxiv.org/abs/2502.13295

- Claude Fights Back - by Scott Alexander - Astral Codex Ten https://www.astralcodexten.com/p/claude-fights-back

- Takes on "Alignment Faking in Large Language Models" - Joe Carlsmith https://joecarlsmith.com/2024/12/18/takes-on-alignment-faking-in-large-language-models

- Andrew Ng vs Yoshua Bengio | Davos 2025 https://www.youtube.com/watch?v=Y1BUaLo67ac

- Jeffrey Ladish on unprompted specification gaming: https://x.com/JeffLadish/status/1872805453224448208

- Prof. Stuart Russell on California Live: https://youtu.be/QEGjCcU0FLs?si=pHcBZbGpj8Rxri5n&t=2694

- Eric Schmidt on ABC News https://abcnews.go.com/ThisWeek/video/1-1-eric-schmidt-116804931

This video took me a month to make, and I'm a small channel, so subscribing really helps out :)

sentient ai

o1

escape

self-exfiltration

scheming

scheming AIs

AI

AGI

Claude

GPT

ChatGPT

trying to escape

exfiltrate

geoffrey hinton

yoshua bengio

andrew ng

Comment