Building the Next Generation of Conversational AI

a16z 10,137 1 month ago

Video Not Working? Fix It Now

Inside the Code: Ankit Kumar (Sesame) & Anjney Midha (a16z) on the Future of Voice AI What goes into building a truly natural-sounding AI voice? In this episode, Sesame’s cofounder and CTO, Ankit Kumar, joins a16z’s Anjney Midha for a deep dive into the research and engineering behind their voice technology. They discuss the technical challenges of real-time speech generation, the trade-offs in balancing personality with efficiency, and why the team is open-sourcing key components of their model. Ankit breaks down the complexities of multimodal AI, full-duplex conversation modeling, and the computational optimizations that enable low-latency interactions. They also explore the evolution of natural language as a user interface and its potential to redefine human-computer interaction. Plus, we take audience questions on everything from scaling laws in speech synthesis to the role of in-context learning in making AI voices more expressive. Key Takeaways: - How Sesame achieves natural voice interactions through real-time speech generation. - The impact of open-sourcing their speech model and what it means for AI research. - The role of full-duplex modeling in improving AI responsiveness. - How computational efficiency and system latency shape AI conversation quality. - The growing role of natural language as a user interface in AI-driven experiences. For anyone interested in AI and voice technology, this episode offers an in-depth look at the latest advancements pushing the boundaries of human-computer interaction. Follow everyone on X: Ankit Kumar - https://x.com/_apkumar Anjney Midha - https://x.com/anjneymidha Check out everything a16z is doing with artificial intelligence, including articles, projects, and more podcasts here – https://a16z.com/ai/ Chapters: 0:00 - 00:51 | Intro 00:52 - 04:58 | Challenges Of Building 04:59 - 07:45 | Q + A: What Was Done To Bridge Transcription And Text Processing? 07:46 - 09:57 | How Is Sesame So Much Better Than Others? 09:58 - 12:42 | Challenges In| Making AI Accessible To All 12:43 - 14:10 | Great Researchers Prioritize User Experience 14:11 - 15:47 | What Is Good Taste In ML? 15:48 - 17:45 | Problems That Can Be Solved That Add Value To The World 17:46 - 26:25 | Open Source Audio For Speech Generation 26:26 - 34:00 | Contextual Speech vs Text to Speech, Differences 34:01 - 35:50 | Value Proposition Of Glasses With No Friction 35:51 - 38:00 | General Purpose API vs Open Source Model 38:01 - 40:47 | Creating High Quality APIs 40:48 - 45:54 | Companions And How Sesame Will Handle Context Retention In Long Conversations 45:55 - 46:59 | Talent: What It Takes To Become A Part Of The Sesame Team 47:00 - 54:37 | How Scaling Laws For Speech Differ From Text 54:38 - 58:33 | How An Organic Conversation Be Preserved Using A Voice Companion 58:34 - 1:03:52 | App Building Technology: Roadmap 1:03:53 - 1:09:09 | Architectures and Transformers 1:09:10 - 1:15:56 | The Focus On Personality, And The Differences In Products 1:15:57 - 1:25:25 | New AI Interface: Interacting With AI Companion 1:25:26 - 1:26:56 | Companion Challenges 1:26:57 - 1:29:22 | Computing Interface Of The Future 1:29:23 - 1:31:45 | Focused Product Experience Built By Small Teams 1:31:46 - 1:36:13 | Join Sesame If You Want To Make A Consumer Product People Love

Comment