IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu

Journal Search


Title	Redesigning Multi-head Attention and Mixing Heads to Save Memory and Computation
Authors	김규동(Gyudong Kim) ; 김진현(Jinhyeon Kim) ; 나혁주(Hyukju Na) ; 장현성(Hyunsung Jang) ; 박재민(Jaemin Park) ; 황재기(Jaegi Hwang) ; 하남구(Namkoo Ha) ; 김승룡(Seungryong Kim) ; 김영근(Young Geun Kim)
DOI	https://doi.org/10.5573/ieie.2024.61.12.70
Page	pp.70-73
ISSN	2287-5026
Keywords	Deep learning; Image classification; Transformers; Multi-head attention
Abstract	Transformers are renowned for their exceptional parallelism, primarily due to their multi-head self-attention mechanism. This capability allows each head to concurrently process its own set of tokens, integrating diverse information from the input sequence. However, previous studies have shown that not all heads learn valuable and distinct features from each other, instead only a few selected heads prove significant. This does not align with the purpose of multi-head attention, therefore, in this paper, we redesign the multi-head attention mechanism to ensure each head focuses on different features of the input, promoting the capture of unique, non-overlapping features. This approach allows all heads to contribute effectively during the inference stage. Additionally, we introduce a head mixing strategy to enhance information aggregation between heads, thereby enabling richer predictions. Finally, as our method allows each multi-head to attend different segments of the input, we achieve memory and computational savings proportional to the number of heads.

Copyright © IEIE All right's reserved

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution and reproduction in any medium, provided the original work is property cited.