Mobile QR Code
Title Cross-tile Pre-execution for Lossless SpMM on Output-stationary Systolic Array
Authors 심현보(Hyunbo Sim) ; 신동군(Dongkun Shin)
Page pp.27-39
ISSN 2287-5026
Keywords DNN; Unstructured sparsity; Systolic array; Double buffering; Pre-execution
Abstract Various accelerators have been proposed for sparse matrix multiplication (SpMM), yet they entail model accuracy loss or increased hardware complexity. This paper proposes Cross-Tile Pre-execution (CTP), which exploits wasted computing and storage resources in the tile-based execution of a systolic array (SA) by executing the next tile's nonzero operations in the current tile's empty slots in advance. CTP selects operations to pre-execute in accordance with SA behavior and stores the resulting partial sums in separate output registers to guarantee computational correctness. The co-proposed Blocked SA reorganizes output registers so that CTP can utilize empty slots that are unavailable in a general SA, and its Two-Tail Adder Tree (TTAT) allows partial sums from two concurrently in-flight tiles to be accumulated independently without intermixing. Experiments show that CTP achieves up to 1.62× speedup over prior techniques and up to 3.63× over a general SA.