• 대한전기학회
Mobile QR Code QR CODE : The Transactions of the Korean Institute of Electrical Engineers
  • COPE
  • kcse
  • 한국과학기술단체총연합회
  • 한국학술지인용색인
  • Scopus
  • crossref
  • orcid
Title WINter-ViT : Window Interaction Vision Transformer with Head-Aware Attention
Authors 김주명(Ju-Myung Kim) ; 김재혁(Jae-Hyeok Kim) ; 박소윤(So-Yun Park) ; 유진우(Jin-Woo Yoo)
DOI https://doi.org/10.5370/KIEE.2025.74.9.1581
Page pp.1581-1590
ISSN 1975-8359
Keywords Image Classification; Vision Transformer; Computer Vision; Deep Learning
Abstract While the Swin Transformer effectively reduces computational cost using window-based attention, it struggles to model global dependencies across windows. Prior work, such as the Refined Transformer, attempts to overcome this limitation by incorporating CBAM-style channel and spatial attention mechanisms. However, these sequential attention operations often introduce representational bias by overemphasizing specific features. To address this, we propose two key components: (1) the Efficient Head Self-Attention (EHSA) module, which dynamically calibrates the relative contribution of each attention head within a window, and (2) the Hierarchical Local-to-Global Spatial Attention (HLSA) module, which captures long-range interactions across windows in a hierarchical manner. By integrating these into a Swin-T backbone, our architecture improves both local detail modeling and global context aggregation. Experiments on ImageNet-1K and ImageNet100 demonstrate that our model surpasses the Refined Transformer and other window-based approaches in accuracy, while maintaining a comparable level of computational efficiency. These results validate the effectiveness of our design in enhancing local-global interactions within Vision Transformers.