In the 78th session of Multimodal Weekly, we had two exciting presentations on video dataset and multi-modality combination.
✅ Tianwei Xiong will present the first long-take video dataset LVD-2M, which comprises 2 million long-take videos - each covering more than 10 seconds and annotated with temporally dense captions.
- Connect with Tianwei: https://gseancdat.github.io/
- LVD-2M: https://gseancdat.github.io/projects/GIMMVFI
✅ Sukwon Yun will present Flex-MoE (Flexible Mixture-of-Experts), a new framework designed to flexibly incorporate arbitrary modality combinations while maintaining robustness to missing data.
- Connect with Sukwon: https://kjanjua26.github.io/
- Flex-MoE: https://kjanjua26.github.io/turtle/
Timestamps:
00:10 Introduction
03:35 Tianwei starts
03:48 LVD-2M features
04:49 LVD-2M caption
05:43 Comparing to previous captioning
06:18 Comparison with statistics
06:45 Data pipeline - video collection
07:20 Pixel level filtering - video clips with cuts
09:20 Pixel level filtering - static videos
10:08 Semantic level filtering with video LLMs
10:50 Overall data pipeline for video collection
11:19 LVD-2M video caption pipeline
12:20 User rated data quality
13:10 LVD-2M for model finetuning
14:20 What's next for long-take videos?
16:45 Q&A with Tianwei
28:20 Sukwon starts
28:52 In a nutshell
30:52 Multimodal models
31:52 Challenge - missing modality
33:40 Existing solutions - violently train with missing modality
34:39 Existing solutions - implicitly adapt model for missing modality
35:42 Existing solutions - explicitly generate the missing modality
36:03 Our solution - ideal scope
36:18 Our solution - background of mixture-of-experts
37:10 Our solution - Flex-MoE
42:40 Results - Flex-MoE
45:50 Summary and limitations
48:30 Q&A with Sukwon
Join the Multimodal Minds community to receive an invite for future webinars: https://discord.gg/CzeUNYr5Bt