Speech LLMs: Models that listen and talk back

Efficient NLP 5,076 6 months ago

Video Not Working? Fix It Now

Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io Speech LLMs (or speech foundation models) are models that combine the reasoning and knowledge capabilities of large language models (LLMs) with the ability to process speech / audio input and output natively. Unlike traditional cascade models that convert speech to text and back, these end-to-end models handle speech directly. Learn about components of these systems, including the speech encoder, LLM, and vocoder, and the most popular models for each stage. We'll also explore how these components work together and the training process and we two studies on the LLaMA-Omni and Gemini models. 0:00 - Intro 0:39 - Limitations of Cascading Models 1:57 - Components of a Speech LLM 3:08 - Speech Encoder 4:41 - Large Language Model (LLM) 6:21 - Length Adaptation 7:59 - Vocoder Model 9:09 - LLaMA-Omni Case Study 10:14 - Training LLaMA-Omni 11:06 - Google Gemini Models References "Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?" (2024) by Gaido, Marco; Papi, Sara; Negri, Matteo; Bentivogli, Luisa. http://arxiv.org/abs/2402.12025 "Recent Advances in Speech Language Models: A Survey" (2024) by Cui, Wenqian; Yu, Dianzhi; Jiao, Xiaoqi; Meng, Ziqiao; Zhang, Guangyan; Wang, Qichao; Guo, Yiwen; King, Irwin. http://arxiv.org/abs/2410.03751 "Sparks of Large Audio Models: A Survey and Outlook" (2023) by Latif, Siddique; Shoukat, Moazzam; Shamshad, Fahad; Usama, Muhammad; Ren, Yi; Cuayáhuitl, Heriberto; Wang, Wenwu; Zhang, Xulong; Togneri, Roberto; Cambria, Erik; Schuller, Björn W. http://arxiv.org/abs/2308.12792 "LLaMA-Omni: Seamless Speech Interaction with Large Language Models" (2024) by Fang, Qingkai; Guo, Shoutao; Zhou, Yan; Ma, Zhengrui; Zhang, Shaolei; Feng, Yang. http://arxiv.org/abs/2409.06666 "Gemini: A Family of Highly Capable Multimodal Models" (2023) by Gemini Team. https://arxiv.org/abs/2312.11805

Comment