Title |
Redesigning Multi-head Attention and Mixing Heads to Save Memory and Computation |
Authors |
김규동(Gyudong Kim) ; 김진현(Jinhyeon Kim) ; 나혁주(Hyukju Na) ; 장현성(Hyunsung Jang) ; 박재민(Jaemin Park) ; 황재기(Jaegi Hwang) ; 하남구(Namkoo Ha) ; 김승룡(Seungryong Kim) ; 김영근(Young Geun Kim) |
DOI |
https://doi.org/10.5573/ieie.2024.61.12.70 |
Keywords |
Deep learning; Image classification; Transformers; Multi-head attention |
Abstract |
Transformers are renowned for their exceptional parallelism, primarily due to their multi-head self-attention mechanism. This capability allows each head to concurrently process its own set of tokens, integrating diverse information from the input sequence. However, previous studies have shown that not all heads learn valuable and distinct features from each other, instead only a few selected heads prove significant. This does not align with the purpose of multi-head attention, therefore, in this paper, we redesign the multi-head attention mechanism to ensure each head focuses on different features of the input, promoting the capture of unique, non-overlapping features. This approach allows all heads to contribute effectively during the inference stage. Additionally, we introduce a head mixing strategy to enhance information aggregation between heads, thereby enabling richer predictions. Finally, as our method allows each multi-head to attend different segments of the input, we achieve memory and computational savings proportional to the number of heads. |