IEEE VIS 2024 Content: Explainability Perspectives on a Vision Transformer: From Global Architecture to Single Neuron

Explainability Perspectives on a Vision Transformer: From Global Architecture to Single Neuron

Anne Marx - ETH Zurich, Zürich, Switzerland

Yumi Kim - Eth Zurich , Zürich, Switzerland

Luca Sichi - ETH Zürich, Zürich, Switzerland

Diego Arapovic - ETH Zürich, Zürich, Switzerland

Javier Sanguino Bautiste - ETH Zürich, Zürich, Switzerland. ETH Zürich, Zürich, Switzerland

Rita Sevastjanova - ETH, Zurich, Switzerland. ETH Zürich, Zürich, Switzerland

Mennatallah El-Assady - ETH Zurich, Zurich, Switzerland. ETH Zürich, Zürich, Switzerland

Room: Bayshore I

2024-10-13T12:30:00ZGMT-0600Change your timezone on the schedule page
2024-10-13T12:30:00Z
Abstract

Transformers, initially designed for Natural Language Processing, have emerged as a strong alternative to Convolutional Neural Networks in Computer Vision. However, their interpretability remains challenging. We overcome the limitations of earlier studies by offering interactive components, engaging the user in the exploration of the Vision Transformer (ViT). Furthermore, we offer various complementary explainability methods to challenge the insight they provide. Key contributions include: - Interactive analysis of the ViT architecture and explainability methods. - Identifying critical information from input images used for classification. - Investigating neuron activations at various depths to understand learned features. - Introducing an innovative adaptation of activation maximization for attention scores to trace attention head focus across network layers. - Highlighting the limitations of each method through occlusion-based interaction. Our findings include that ViTs tend to generalize well by relying on a broad set of object features and contexts seen in the input image. Furthermore, the focus of neurons and attention heads shifts to more complex patterns at deeper layers. We also acknowledge that we cannot rely on a single explainability method to understand the decision-making process of transformers. Our blog post provides an engaging and multi-facetted interpretation of the ViT to the readers by combining interactivity with key research questions.