We interpret CLIP’s zero-shot image classification by examining shared textual concepts learned by its vision and language encoders. We analyzes 13 CLIP models across various architectures, sizes, and datasets. The approach highlights a human-friendly way to understand CLIP’s classification decisions.
Read the paper: https://arxiv.org/abs/2410.13016
Fawaz Sammani is a 2nd year PhD student at the Vrije Universiteit Brussel. His research focuses on Human-Friendly Interpretability and Explainability of deep neural networks
#computervision #ai #artificialintelligence #machinevision
#machinelearning #datascience