Try Voice Writer - speak your thoughts and let AI handle the grammar: https://voicewriter.io
Speech LLMs (or speech foundation models) are models that combine the reasoning and knowledge capabilities of large language models (LLMs) with the ability to process speech / audio input and output natively. Unlike traditional cascade models that convert speech to text and back, these end-to-end models handle speech directly. Learn about components of these systems, including the speech encoder, LLM, and vocoder, and the most popular models for each stage. We'll also explore how these components work together and the training process and we two studies on the LLaMA-Omni and Gemini models.
0:00 - Intro
0:39 - Limitations of Cascading Models
1:57 - Components of a Speech LLM
3:08 - Speech Encoder
4:41 - Large Language Model (LLM)
6:21 - Length Adaptation
7:59 - Vocoder Model
9:09 - LLaMA-Omni Case Study
10:14 - Training LLaMA-Omni
11:06 - Google Gemini Models
References
"Speech Translation with Speech Foundation Models and Large Language Models: What is There and What is Missing?" (2024) by Gaido, Marco; Papi, Sara; Negri, Matteo; Bentivogli, Luisa. http://arxiv.org/abs/2402.12025
"Recent Advances in Speech Language Models: A Survey" (2024) by Cui, Wenqian; Yu, Dianzhi; Jiao, Xiaoqi; Meng, Ziqiao; Zhang, Guangyan; Wang, Qichao; Guo, Yiwen; King, Irwin. http://arxiv.org/abs/2410.03751
"Sparks of Large Audio Models: A Survey and Outlook" (2023) by Latif, Siddique; Shoukat, Moazzam; Shamshad, Fahad; Usama, Muhammad; Ren, Yi; Cuayáhuitl, Heriberto; Wang, Wenwu; Zhang, Xulong; Togneri, Roberto; Cambria, Erik; Schuller, Björn W. http://arxiv.org/abs/2308.12792
"LLaMA-Omni: Seamless Speech Interaction with Large Language Models" (2024) by Fang, Qingkai; Guo, Shoutao; Zhou, Yan; Ma, Zhengrui; Zhang, Shaolei; Feng, Yang. http://arxiv.org/abs/2409.06666
"Gemini: A Family of Highly Capable Multimodal Models" (2023) by Gemini Team. https://arxiv.org/abs/2312.11805