Layer-wise Analysis of Transformer Models in Vision and Audio Processing: Teresa Dorszewski (DTU)

SFI Visual Intelligence 73 2 weeks ago

Video Not Working? Fix It Now

Teresa Dorszewski, a PhD Fellow at the Technical University of Denmark, gave a presentation titled "Layer-wise Analysis of Transformer Models in Vision and Audio Processing" on March 27th 2025 as part of the Visual Intelligence Online Seminar series. Abstract: Recent advancements in transformer models have revolutionized the fields of vision and audio processing. However, a deeper understanding of how and where these models process information remains limited. In this talk, I will present a layer-wise analysis of Vision Transformer models (ViTs) and speech representation models, providing a detailed understanding of state-of-the-art transformer architectures. This analysis will highlight how such insights can lead to optimized models in terms of performance and efficiency. In the image domain, I will share novel findings on the emergence of visual concepts and the progressive complexity of these concepts across layers in ViTs. In the audio domain, I will demonstrate how a layer-wise understanding can be leveraged to adapt transformer models for specific tasks, resulting in significantly smaller and faster models.

Comment