Title |
WINter-ViT : Window Interaction Vision Transformer with Head-Aware Attention |
Authors |
김주명(Ju-Myung Kim) ; 김재혁(Jae-Hyeok Kim) ; 박소윤(So-Yun Park) ; 유진우(Jin-Woo Yoo) |
DOI |
https://doi.org/10.5370/KIEE.2025.74.9.1581 |
Keywords |
Image Classification; Vision Transformer; Computer Vision; Deep Learning |
Abstract |
While the Swin Transformer effectively reduces computational cost using window-based attention, it struggles to model global dependencies across windows. Prior work, such as the Refined Transformer, attempts to overcome this limitation by incorporating CBAM-style channel and spatial attention mechanisms. However, these sequential attention operations often introduce representational bias by overemphasizing specific features. To address this, we propose two key components: (1) the Efficient Head Self-Attention (EHSA) module, which dynamically calibrates the relative contribution of each attention head within a window, and (2) the Hierarchical Local-to-Global Spatial Attention (HLSA) module, which captures long-range interactions across windows in a hierarchical manner. By integrating these into a Swin-T backbone, our architecture improves both local detail modeling and global context aggregation. Experiments on ImageNet-1K and ImageNet100 demonstrate that our model surpasses the Refined Transformer and other window-based approaches in accuracy, while maintaining a comparable level of computational efficiency. These results validate the effectiveness of our design in enhancing local-global interactions within Vision Transformers. |