TUM AI Lecture Series - The multimodal future: Why visual representation still matters (Saining Xie)

Matthias Niessner 3,558 2 months ago

Video Not Working? Fix It Now

Abstract: In this talk, we’ll look at how visual representation continues to play a key role in shaping the multimodal future. I’ll share some of our recent work on vision-centric generative AI and how it’s helping us better understand and create visual content, like images and videos. We’ll dive into the latest advancements, such as multimodal large language models for visual understanding and diffusion transformers for visual generation, and explore how these areas are deeply connected. By tackling the challenges and opportunities in building and evaluating these capabilities, we’ll highlight why visual representation learning is still an unsolved and critical problem. Finally, we’ll discuss why these developments are so important—not just for practical applications but also as essential steps toward building robust visual intelligence that can truly engage with the sensory-rich world we live in.

Comment