IEIE - Journal of the Institute of Electronics and Information Engineers

Mobile QR Code

Main Menu

Journal Search


Title	Improving Attention Parallelism via Delayed Value Generation and Softmax Offloading
Authors	김준성(Junsung Kim) ; 김철환(Cheolhwan Kim) ; 노원우(Won Woo Ro)
DOI	https://doi.org/10.5573/ieie.2025.62.8.10
Page	pp.10-16
ISSN	2287-5026
Keywords	Large language model; Attention layer; Value generation; Softmax; Parrallelism
Abstract	As the use of the Large Language Model (LLM) is widely adopted in various fields, optimizations for execution performance are rising as an important research topic. LLM consists of the Attention Layer, Feed-Forward Layer, and Layer Normalization. Among these, the Attention Layer occupies 48% of LLM total execution time. This layer is also composed of five steps, which include the generation of Query, Key, and Value, computing Scores, applying the Softmax function to obtain Probabilities, multiplying Probabilities with the Value to form the Attention Matrix, and finally, generating the output matrix using weights. However, a notable inefficiency lies in the early generation of Value in the first step, even though it is not used until the fourth step. This results in performance degradation and inefficient hardware utilization. To address this, we propose an optimization technique that reorganizes the operation sequence. Specifically, we delay the generation of Value until just before the Attention Matrix computation, improving cache locality and execution efficiency. In addition, we offload the low-computational Softmax operation to a lightweight Data Processing Unit (DPU), and overlap Value generation with the probability computation phase. In summary, this study addresses the inefficiency of the operation sequence in the existing LLM by optimizing the Value and Probability generation in the Attention Layer. This optimization improves performance by achieving operation parallelism.

Copyright © IEIE All right's reserved

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution and reproduction in any medium, provided the original work is property cited.