Long-Take Video Dataset and Flexible Mixture-of-Experts | Multimodal Weekly 78

TwelveLabs 79 3 weeks ago

Video Not Working? Fix It Now

In the 78th session of Multimodal Weekly, we had two exciting presentations on video dataset and multi-modality combination. ✅ Tianwei Xiong will present the first long-take video dataset LVD-2M, which comprises 2 million long-take videos - each covering more than 10 seconds and annotated with temporally dense captions. - Connect with Tianwei: https://gseancdat.github.io/ - LVD-2M: https://gseancdat.github.io/projects/GIMMVFI ✅ Sukwon Yun will present Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data. - Connect with Sukwon: https://kjanjua26.github.io/ - Flex-MoE: https://kjanjua26.github.io/turtle/ Timestamps: 00:10 Introduction 03:35 Tianwei starts 03:48 LVD-2M features 04:49 LVD-2M caption 05:43 Comparing to previous captioning 06:18 Comparison with statistics 06:45 Data pipeline - video collection 07:20 Pixel level filtering - video clips with cuts 09:20 Pixel level filtering - static videos 10:08 Semantic level filtering with video LLMs 10:50 Overall data pipeline for video collection 11:19 LVD-2M video caption pipeline 12:20 User rated data quality 13:10 LVD-2M for model finetuning 14:20 What's next for long-take videos? 16:45 Q&A with Tianwei 28:20 Sukwon starts 28:52 In a nutshell 30:52 Multimodal models 31:52 Challenge - missing modality 33:40 Existing solutions - violently train with missing modality 34:39 Existing solutions - implicitly adapt model for missing modality 35:42 Existing solutions - explicitly generate the missing modality 36:03 Our solution - ideal scope 36:18 Our solution - background of mixture-of-experts 37:10 Our solution - Flex-MoE 42:40 Results - Flex-MoE 45:50 Summary and limitations 48:30 Q&A with Sukwon Join the Multimodal Minds community to receive an invite for future webinars: https://discord.gg/CzeUNYr5Bt

Comment