Exploring the "Biology" of LLMs with Circuit Tracing with Emmanuel Ameisen - 727

The TWIML AI Podcast with Sam Charrington 405 2 weeks ago

Video Not Working? Fix It Now

In this episode, Emmanuel Ameisen, a research engineer at Anthropic, returns to discuss two recent papers: "Circuit Tracing: Revealing Language Model Computational Graphs" and "On the Biology of a Large Language Model." Emmanuel explains how his team developed mechanistic interpretability methods to understand the internal workings of Claude by replacing dense neural network components with sparse, interpretable alternatives. The conversation explores several fascinating discoveries about large language models, including how they plan ahead when writing poetry (selecting the rhyming word "rabbit" before crafting the sentence leading to it), perform mathematical calculations using unique algorithms, and process concepts across multiple languages using shared neural representations. Emmanuel details how the team can intervene in model behavior by manipulating specific neural pathways, revealing how concepts are distributed throughout the network's MLPs and attention mechanisms. The discussion highlights both capabilities and limitations of LLMs, showing how hallucinations occur through separate recognition and recall circuits, and demonstrates why chain-of-thought explanations aren't always faithful representations of the model's actual reasoning. This research ultimately supports Anthropic's safety strategy by providing a deeper understanding of how these AI systems actually work. ? / ? Listen or watch the full episode on our page: https://twimlai.com/go/727. ? Subscribe to our channel for more great content just like this: https://youtube.com/twimlai?sub_confirmation=1 ?️ CONNECT WITH US! =============================== Subscribe to the TWIML AI Podcast: https://twimlai.com/podcast/twimlai/ Follow us on Twitter: https://twitter.com/twimlai Follow us on LinkedIn: https://www.linkedin.com/company/twimlai/ Join our Slack Community: https://twimlai.com/community/ Subscribe to our newsletter: https://twimlai.com/newsletter/ Want to get in touch? Send us a message: https://twimlai.com/contact/ ? CHAPTERS =============================== 00:00 - Introduction 6:13 - Surprising findings 10:05 - Circuit Tracing paper: Mechanistic interpretability 16:07 - Similarity of concept embeddings with model concept representation 23:00 - Replacement model and sparse coding 27:58 - Trade-offs between different replacement models 35:38 - Challenges of upstilling black box model into an interpretable model 41:23 - Polysemanticity and superposition 44:22 - Limitations of the model approach 50:53 - Attribution graph 53:32 - Interventions 1:05:53 - Understanding the model strategies 1:12:26 - Examples from the Biology paper 1:20:36 - Chain-of-thought faithfulness 1:25:49 - Future directions ? LINKS & RESOURCES =============================== Circuit Tracing: Revealing Computational Graphs in Language Models - https://transformer-circuits.pub/2025/attribution-graphs/methods.html On the Biology of a Large Language Model - https://transformer-circuits.pub/2025/attribution-graphs/biology.html Towards Monosemanticity: Decomposing Language Models With Dictionary Learning - https://transformer-circuits.pub/2023/monosemantic-features Turning Ideas into ML Powered Products with Emmanuel Ameisen - 349 - https://twimlai.com/podcast/twimlai/turning-ideas-into-ml-powered-products/ ? Camera: https://amzn.to/3TQ3zsg ?️Microphone: https://amzn.to/3t5zXeV ?Lights: https://amzn.to/3TQlX49 ?️ Audio Interface: https://amzn.to/3TVFAIq ?️ Stream Deck: https://amzn.to/3zzm7F5

Comment