Description
Why we don't understand LLMs ... yet What's really happening inside models when they generate text? Lee Sharkey, Principal Investigator at Goodfire AI and co-founder of Apollo Research, and I discuss mechanistic interpretability - the emerging science of reverse-engineering neural networks to understand how they actually work. Lee works at Goodfire AI, an AI interpretability research lab focused on understanding and intentionally designing advanced AI systems Company . In this conversation, we explore how researchers are using techniques like sparse autoencoders to decode the internal representations of large language models, discovering everything from "Golden Gate Bridge features" to Barack Obama neurons Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet . We discuss what we actually know about models, the challenges of working in high-dimensional spaces, and why understanding AI systems might be crucial for safety as they become more powerful. Lee also shares insights from his background in computational neuroscience and how similar methods are being applied to artificial neural networks. Topics covered include induction heads, sparse dictionary learning, the "grown not made" nature of neural networks, and whether there might be universal structures in how both humans and AI systems organize knowledge. Articles by Lee: Open Problems in Mechanistic Interpretability Sparse Autoencoders Find Highly Interpretable Features in Language Models This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit genfutures.substack.com