| Title |
Performance Evaluation of a Bandwidth-efficient Systolic Array with Adaptive Block-wise Data Reuse |
| Authors |
황영준(Young-Jun Hwang) ; 김영식(Young-Sik Kim) |
| DOI |
https://doi.org/10.5573/ieie.2026.63.4.22 |
| Keywords |
Double buffering systolic array; Data reuse; Ring buffer memory; Instruction prefetching |
| Abstract |
In large-scale matrix operations of deep learning models, memory hierarchy access latency often becomes a primary factor limiting overall system efficiency, rather than the computational throughput of processing units. To mitigate the overhead caused by such memory access latency, this paper proposes an array-based AI accelerator architecture based on block-level data reuse. The proposed architecture integrates an on-chip weight storage scheme with a Ring Buffer and instruction prefetching?based activation management mechanism. Through an internal Tile DMA engine, the design maximizes on-chip data reuse across computation blocks while temporally overlapping computation and data loading, thereby effectively hiding memory access latency. The architecture was implemented on a Zynq UltraScale+ SoC platform and evaluated using the convolution layers of AlexNet. Experimental results demonstrate that, even under constrained external memory bandwidth conditions, the proposed design effectively mitigates memory-induced stalls and achieves throughput close to the hardware peak performance with relatively low latency. These results indicate that the proposed latency-hiding strategy is particularly effective for TPU-like dense array architectures where memory bottlenecks are prominent, and that it can further improve execution efficiency in heterogeneous system environments. |