Teresa Dorszewski, a PhD Fellow at the Technical University of Denmark, gave a presentation titled "Layer-wise Analysis of Transformer Models in Vision and Audio Processing" on March 27th 2025 as part of the Visual Intelligence Online Seminar series.
Abstract:
Recent advancements in transformer models have revolutionized the fields of vision and audio processing. However, a deeper understanding of how and where these models process information remains limited. In this talk, I will present a layer-wise analysis of Vision Transformer models (ViTs) and speech representation models, providing a detailed understanding of state-of-the-art transformer architectures. This analysis will highlight how such insights can lead to optimized models in terms of performance and efficiency.
In the image domain, I will share novel findings on the emergence of visual concepts and the progressive complexity of these concepts across layers in ViTs. In the audio domain, I will demonstrate how a layer-wise understanding can be leveraged to adapt transformer models for specific tasks, resulting in significantly smaller and faster models.