Mobile QR Code
Title Improving Attention Parallelism via Delayed Value Generation and Softmax Offloading
Authors 김준성(Junsung Kim) ; 김철환(Cheolhwan Kim) ; 노원우(Won Woo Ro)
DOI https://doi.org/10.5573/ieie.2025.62.8.10
Page pp.10-16
ISSN 2287-5026
Keywords Large language model; Attention layer; Value generation; Softmax; Parrallelism
Abstract As the use of the Large Language Model (LLM) is widely adopted in various fields, optimizations for execution performance are rising as an important research topic. LLM consists of the Attention Layer, Feed-Forward Layer, and Layer Normalization. Among these, the Attention Layer occupies 48% of LLM total execution time. This layer is also composed of five steps, which include the generation of Query, Key, and Value, computing Scores, applying the Softmax function to obtain Probabilities, multiplying Probabilities with the Value to form the Attention Matrix, and finally, generating the output matrix using weights. However, a notable inefficiency lies in the early generation of Value in the first step, even though it is not used until the fourth step. This results in performance degradation and inefficient hardware utilization. To address this, we propose an optimization technique that reorganizes the operation sequence. Specifically, we delay the generation of Value until just before the Attention Matrix computation, improving cache locality and execution efficiency. In addition, we offload the low-computational Softmax operation to a lightweight Data Processing Unit (DPU), and overlap Value generation with the probability computation phase. In summary, this study addresses the inefficiency of the operation sequence in the existing LLM by optimizing the Value and Probability generation in the Attention Layer. This optimization improves performance by achieving operation parallelism.