RyuSungju1
-
(Sogang University, Seoul, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
FPGA, processing-in-memory, hardware accelerator, neural network, deep learning, SRAM
I. INTRODUCTION
Several SRAM-based PIM architectures [1,2] have been presented to mitigate the Von Neumann bottleneck effect. One of the well-known
approaches is to perform tensor multiplications inside the memory array. Tensor multiplications
in the array consist of two parts: 1) element-wise multiplication between an input
and a weight and 2) summation of the partial products. Considering that an element-wise
multiplication consists of several binary multiplications using AND operations, the
element-wise multiplication can be performed in the memory cells. A popular approach
for the SRAM PIM is to activate multiple wordlines of the SRAM array.
Meanwhile, one of the popular ways to verify the design models before the expensive
chip fabrication is using FPGA chips. However, in the typical SRAM models, only one
of the wordlines can be activated in a single clock cycle, which is different from
the SRAM array model used for the PIM. In the FPGA chip, Block RAMs (BRAMs) replace
the SRAMs. The BRAMs are pre-stored in the FPGA as built-in models, and designers
cannot modify the behavior of the BRAMs. Hence, it is impossible to verify the PIM
array on the FPGA due to the fixed behavior of the memory.
Our contribution in this work is to analyze the approaches for evaluation on FPGA
of SRAM PIM accelerators. To the best of our knowledge, it is the first work to analyze
the PIM mapping methods on an FPGA.
We analyze the following three approaches: 1) Weight mapping on a BRAM row. 2) Weight
mapping on flip-flops. 3) Input enumeration-based dot product. We furthermore expand
the three mapping schemes to the multi-FPGA evaluation case. The evaluation methods
are validated on a real neural network benchmark.
II. PRELIMINARIES
1. Design Approach of Digital SRAM PIM
Fig. 1 shows the design method of digital SRAM PIM arrays, comparing it with the read operation
of conventional SRAM arrays. In the SRAM array (Fig. 1(a)), a wordline is shared by SRAM cells located in multiple array columns, and a bitline
is shared by SRAM cells located in multiple array rows. When a wordline is activated,
we can simultaneously read all the memory cells attached to the wordline through the
multiple bitlines at the array columns. Considering that only 1 wordline row can be
accessed in the conventional SRAM array, `N' clock cycles are spent reading all the
bits in the entire array when the array includes `N' wordline rows.
On the other hand, digital SRAM PIM schemes simultaneously activate multiple wordlines,
and we can thereby read multiple memory cells attached to the bitline (Fig. 1(b)). Using the concept, the dot product computation to generate a partial sum (psum)
consists of the following three steps. 1) Activate multiple wordlines: It is widely
known that a binary multiplication (eg., XNOR, AND) can be performed in an SRAM cell
[1,2]. If we assume that a weight is pre-stored in a cell and an input is provided by a
wordline, the binary multiplication result is generated. 2) Accumulate partial products:
The partial products from the SRAM cells are binary multiplication results, and they
are added together in the backend adder-tree. The added psum is usually accumulated
and finally becomes the output number. 3) Bit-serial/parallel computing for multiple
bit-width: A read operation of the memory cell only supports 1-bit multiplication,
we need to use multiple memory cells to construct a multi-bit multiplication result.
The multi-bit result can be obtained from multiple cells distributed over the spatial
domain (bit-parallel computing) [3]. Otherwise, we can select bit-serial computing by reusing the memory cell through
the time domain.
Fig. 1. (a) Read operation of conventional SRAM/BRAM models; (b) Dot product in digital SRAM PIM array.
2. Limitation of SRAM PIM Evaluation on FPGA
FPGA typically realizes the data/control paths using LUT-based configurable logic
blocks and the memory arrays using BRAMs. However, to make the memory behave as a
PIM array, the circuit designers have to define and generate a new layout model for
the PIM block. When we target the application-specific integrated circuit (ASIC) design,
it is possible to make a custom layout model, but we cannot modify the layout of any
component in the FPGA. As a result, the activation of multiple wordlines, the major
feature of the PIM array, cannot be implemented on the built-in BRAM on the FPGA.
Therefore, we aim to mitigate such a limitation of the analog PIM evaluation on FPGA
and help designers to verify the functional correctness of the PIM array using the
FPGA. Meanwhile, NullaNet [4] achieved an input enumeration-based neural network computation for the FPGA, but
it did not target evaluation of the PIM array operation on the various mapping methods.
III. FUNCTIONAL VERIFICATION ON FPGA FOR DIGITAL SRAM PIM
In this Section, we analyze the three possible approaches to evaluate digital SRAM
PIM array models on FPGA: 1) Weight mapping on a BRAM row. 2) Weight mapping on flip-flops.
3) Input enumeration-based dot product. These approaches are compared with each other
and verified using a real neural network benchmark. Furthermore, we also present an
evaluation method to evaluate large-sized PIM SoC on the FPGA framework and a top-level
evaluation flow.
1. Weight Mapping on a BRAM Row
The first method for the PIM array mapping is to use built-in BRAM blocks (Fig. 2). The PIM array consists of memory cells in `n' rows and `m' columns. As explained
in the Section 2.1, memory cells located in a single column contribute to generating
a psum. On the other hand, BRAM cannot accumulate the partial products by activating
multiple wordline rows (Section 2.2). Hence, the analog PIM array is split into independent
rows, and the rows are distributed across multiple BRAM arrays (Fig. 2). First, weights are pre-stored in the BRAM array. The AND operations are replaced
by read operation of the BRAM arrays, and inputs are fed to each wordline row, which
is the same as the case of analog PIM operation. Generated partial products through
the AND operations are accumulated outside the array. The partial products from the
first index of the different BRAM arrays are sent to the adder with the first index.
The adder constructs `Psum[0]'. Next, the partial products from the second index of
the different BRAM arrays are sent to the second adder, and `Psum[1]' is generated.
Furthermore, the partial products from the last(m-1) index of the different BRAM arrays
are sent to the last adder, and `Psum[m-1]' is generated. Such a method can access
the m${\times}$n cells simultaneously, thereby mimicking the behavior of the analog
PIM array, but this scheme with the separated BRAM arrays can only activate a single
row of the BRAM arrays, thereby reducing the BRAM utilization and requiring a large
number of BRAM arrays.
Fig. 2. PIM mapping method 1: Weight mapping on a BRAM row.
2. Weight Mapping on Flip-flops
The second method for the PIM array mapping is to use flip-flops (Fig. 3). `n' weights located in each PIM array column are mapped to each flip-flop vector.
If the size of PIM array is equal to `n'(row) ${\times}$`m'(col), `m' flip-flop vectors
with each vector size of `n' are required. `n'${\times}$`m' weights are pre-stored
in the flip-flop array, and an AND gate is dedicated to each flip-flop. After the
weights stored in the flip-flops are AND-ed by input numbers, the partial product
result is accumulated at a backend adder. In the same manner as a PIM array where
inputs fed to the wordline row are broadcast to all the columns, `Input[n-1:0]' is
shared by all the flip-flop vectors. Such a method can fully utilize all the instantiated
flip-flops, but using flip-flops for the weight storage leads to a burden, considering
that sequential elements typically require much larger hardware resources compared
with the dense memory cells.
Fig. 3. PIM mapping method 2: Weight mapping on flip-flops.
3. Input Enumeration-based Dot Product
The third method for the PIM array mapping is to perform input enumeration-based dot
product (Fig. 4). The motivation for this method is the point: Contrary to the ASIC where layout
is fixed once the fabrication stage is finished, FPGA is programmable. Hence, we cannot
modify the schematic of the ASIC after the fabrication. On the other hand, we can
change both combinational logic on the LUT-based configurable logic blocks and routing
information on the switch blocks of FPGA.
As described in Fig. 4, the memory cells in the PIM array hold weights. In Step 1. we can substitute the
read operation of the cells with AND operation. After the weights are AND-ed by inputs,
partial products are summed by adder tree, thereby generating psum value. If weights
can be fixed during the inference considering that the FPGA is programmable and we
can modify the weights for the other inference tasks, the AND operations are replaced
by the input enumeration as described in Step 2. If the weight value is equal to `1',
the corresponding input value is enumerated and it is passed to the backend adder
tree. Otherwise, if the weight value is equal to `0', the AND operation for the binary
multiplication by zero can be eliminated, thereby we do not care about the corresponding
input value. Afterwards, the psum from the enumerated inputs are generated at the
adder tree.
Fig. 4. PIM mapping method 3: Input enumeration-based dot product.
If the target neural network fits in the area of PIM system-on-chip (SoC), all the
weights are stored in the PIM array of the chip and we do not need to modify the weight
parameters. Therefore, all the weights with value of `1' can be mapped to the configurable
logic blocks on FPGA, and the input enumeration-based dot product can be performed
seamlessly at low cost of FPGA resources. However, if the target neural network does
not fit in the chip area and the weight size is larger than the capacity of the PIM
arrays, we have to modify the weights during the computation. To evaluate such a condition
on the FPGA, we have to reprogram the FPGA. In real-time, it is not possible to modify
the configurable logic blocks and the switch blocks of the FPGA due to large latency
overhead. As a result, such an input enumeration-based dot product method can be used
only if we aim to evaluate the condition where all weights can be uploaded on the
PIM SoC and the weights do not need to be updated.
4. Evaluation of Large-sized PIM SoC
If the PIM SoC does not fit a single FPGA chip, multiple FPGA chips are required for
the evaluation. Moreover, as modern custom hardware chips become larger, evaluation
platforms using a large number of FPGAs must be considered. We first adopted a mapping
method of a neural network on a FPGA cluster [5] (Fig. 5(a)). We use a TC-ResNet8 for an explanation. Among the 8 layers, first 6 layers (Layer#0-5)
are mapped on FPGA#0, and the rest part (Layer#6-8) is computed on FPGA#1 because
all the weights cannot be stored in a single FPGA chip. This is a simple example,
and other networks with various mapping methods are available because implementation
approaches of the neural network tiling and its mapping on FPGA is orthogonal to mimicking
the PIM array using FPGA resources.
In addition to the clustering, we added extra components in each FPGA for the PIM
array evaluation. Each PIM array consumes and outputs data with large bits by communicating
other PIM arrays. If the PIM SoC datapath is clustered into several parts and distributed
across multiple FPGAs, the communication bandwidth becomes limited. Therefore, the
number of bits from/to the PIM array must be reduced for the off-chip communication.
It is widely known that parallel-to-serial converter is an efficient example to reduce/increase
the number of data bits (Fig. 5(b)). The FPGA chips usually consist of a bunch of input/output ports, and hence multiple
parallel-to-serial converters can be simultaneously used. Furthermore, the limited
communication bandwidth leads to the unbalance of the performance between the computation
and the communication. If the communication data bits are much larger than the input/output
interface ports, the PIM array datapath is stalled. To analyze the operation of the
datapath, active clock cycles without the number of stalled clocks need to be checked.
Hence, each FPGA includes a clock (CLK) counter to measure the effective computing
clock cycles by eliminating the effect of the inter-FPGA communication. However, the
latency degradation due to the stall does not cause any problem, because our purpose
for this multi-FPGA system is not to realize the real-time computation but rather
to simulate the PIM SoC during the active clock cycles, thereby verifying functional
correctness of the PIM SoC. Additionally, the CLK counter and the parallel-to-serial
converters utilize only a small part of the FPGA resources, so the logic overhead
is negligible.
Fig. 5. Evaluation of large-sized PIM SoC: (a) Mapping example of a neural network (TC-ResNet8) on a FPGA cluster; (b) Extra components for the PIM array evaluation.
5. Top-level Evaluation Flow
This subsection describes the top-level evaluation flow of the analog PIM SoC. The
components for the evaluation system include 4 parts: 1) Target neural networks are
analyzed and trained using widely used machine learning frameworks such as PyTorch
and TensorFlow. The graphs indicating the sizes/shapes of tensors, the connections
between layers, and computational types for the inference tasks are extracted. 2)
The source PIM SoC is characterized. The information of the PIM arrays, clock frequency,
and peripheral circuits are analyzed. 3) The resource information of the target FPGA
is analyzed. For example, the CLB flip-flops, the CLB LUTs, the capacity of BRAM,
DSP slices, and the I/O widths/types can be analyzed. The information extracted from
above 3 parts is mapped to the custom scheduler, and the scheduler finally generates
the PIM array model information for the FPGA evaluation. The behavioral model (RTL)
and multi-FPGA mapping information are mapped to target FPGA chip(s). Afterwards,
the dataset and trained parameters are applied to the target FPGA(s) for the PIM SoC
evaluation.
IV. RESULTS
1. Experimental Setup
In this Section, we analyze the various FPGA evaluation methods for SRAM PIM SoCs.
For the comparison between the mapping methods, we synthesized the gate-level logics
and compiled the BRAMs using Xilinx Vivado tool. Target clock frequency for the synthesis
is 100MHz. A higher clock frequency can be applied, but our main contribution is not
to realize a high throughput PIM array but only to verify the functional correctness
of the PIM SoC. We first analyze the three FPGA mapping methods: 1) BRAM-based approach:
Weight mapping on a BRAM row (`BRAM' in the figures and tables), 2) FF-based approach:
Weight mapping on flip-flops (`FF' in the figures and tables), and 3) IE-based approach:
Input enumeration-based dot product (`IE' in the figures and tables). Then, we expand
such approaches to evaluate the large-scaled PIM array using multiple FPGA chips.
When we use the Xilinx Vivado tool, we simply select the specific FPGA boards containing
the FPGA chips, and thereby we avoid the complex configuration steps to set up the
experimental conditions. For the analysis we used the Xilinx VCU110 evaluation board
(Table 1). In the FPGA chip, a BRAM tile is 36 Kb and it is up to 72~bits wide.
Table 1. Resources on FPGA chip
FPGA (Board)
|
LUTs
|
Registers
|
BRAM Tiles
|
XCVU190 (VCU110)
|
1074240
|
2148480
|
3780
|
2. Results
Resource Breakdown on Mapping Methods: Table 2 analyzes the utilization of FPGA resources depending on the 3 PIM array sizes and
3 mapping methods. BRAM-based approach (`BRAM') consumes a significantly large number
of BRAM tiles, because only a BRAM row must be used for the parallel access of all
the PIM array cells (Fig. 2). In this approach, reading memory already contains the binary multiplication, and
hence LUTs for the AND-gates are not required. On the other hand, FF-based approach
(`FF') replaces the memory cells by flip-flops (Fig. 3). Weights stored in the flip-flops are AND-ed by inputs for the binary multiplication.
So, slice LUTs for the AND gates are utilized. BRAM-/FF-based approaches sum the partial
products generated by AND gates using the adder tree. On the other hand, the IE-based
approach (`IE') replaces the multiplication with selective input enumeration (Fig. 4). Such a method eliminates the flip-flops for the weight storage by fixing the weight
status (0/1), and it reduces the width of the adder tree by eliminating a number of
input/weight pairs. Therefore, the IE method does not use registers and AND gates,
and it only requires the adder tree with reduced popcount width. In the IE method,
we assumed that the density of `1' in the weight tensor is equal to 0.5.
Table 2. Breakdown of used FPGA resources depending on PIM array size. Methods – ‘BRAM’: Weight mapping on a BRAM row (Fig. 2). ‘FF’: Weight mapping on flip-flops (Fig. 3). ‘IE’ Input enumeration-based dot product (Fig. 4).
Size
|
Method
|
Slice
LUTs
(AND)
|
Slice LUTs
(Adder Tree)
|
Slice
Registers
|
BRAM
Tiles
|
128×128
|
BRAM
|
-
|
19584
|
-
|
256
|
128×128
|
FF
|
16384
|
19584
|
16384
|
-
|
128×128
|
IE
|
-
|
9984
|
-
|
-
|
256×256
|
BRAM
|
-
|
81664
|
-
|
1024
|
256×256
|
FF
|
65536
|
81664
|
65536
|
-
|
256×256
|
IE
|
-
|
39168
|
-
|
-
|
512×512
|
BRAM
|
-
|
358400
|
-
|
4096
|
512×512
|
FF
|
262144
|
358400
|
262144
|
-
|
512×512
|
IE
|
-
|
166328
|
-
|
-
|
Input Enumeration-based Dot Product: Fig. 6 analyzes the utilization of the slice LUTs on the FPGA chip. As explained in Section
3.3, the IE-based approach enumerates the input values which are multiplied by weight
`1', which eliminates the AND-gate-based multiplication and minimizes the reduction
width of the backend adder tree. The number of aggregated inputs is equal to the number
of corresponding non-zero weights. In other words, the reduction width depends on
the density of `1' in the weight tensor. To study the number of the resource utilization
depending on the density of the non-zero weights (Fig. 6), we used 256${\times}$256 PIM array. The adder tree is implemented on the FPGA chip
(Section 4.1) using the slice LUTs only, and thereby the number of the LUTs linearly
increases depending on the density of `1' in the weight tensor.
Fig. 6. The number of Slice LUTs for a 256x256 PIM array with input enumeration-based dot product on FPGA. X-axis indicates the density of `1's in weights. Method - `IE': Input enumeration-based dot product (Fig. 4).
Real Benchmark Evaluation: We perform the PIM array evaluation with the simple binary
CNN dataset described in [6]. The network consists of 9 layers, but we assumed that the first and the last layers
are performed on the host CPU because the binary quantization [7,8] is usually applied to the middle layers. The accelerator [6] used the analog computation using capacitor-based accumulator near the memory array,
but we slightly modified it to the PIM array design where the MAC computation is performed
in the 256${\times}$256 PIM arrays. As a result, 14 256${\times}$256 PIM arrays are
used for the computation which can hold all the weights for the network. We applied
the IE-based method to the 14 PIM arrays which only account for the 49% of LUTs on
the XCVU190 FPGA chip.
Multi-FPGA Benchmark: In the previous subsection, we assumed that the PIM SoC targets
a small and simple neural network inference task, which is actually one well-known
target application of in-memory computing. However, recent PIM SoC also targets complex
neural networks, and hence PIM chips have a larger number of PIM arrays compared to
the previous PIM SoC architectures.
We evaluated a PE on the PIMCA architecture [9], which consists of 18 256${\times}$128 PIM arrays. We used 3 XCVU190 FPGA chips with
the FF-based method, because the PE size is much larger than the resource of a single
FPGA chip. Table 3 shows the implementation result using the multi-FPGA evaluation case. Each FPGA chip
includes 6 PIM arrays, 513Kb global buffer (GLB) for activations, and peripheral and
interfaces (Fig. 5). The inter-PE adder tree sums the psums from different PIM array bunches and FPGAs.
The psums are first sent to the FPGA chip #1, so the chip #1 only needs the inter-PE
adder tree and other chips do not require the inter-PE adder tree.
Table 3. Evaluation of multi-FPGA scenario with 18 PIM arrays
FPGA
|
Module
|
Units
|
LUTs
[%]
|
Registers
[%]
|
BRAM
Tiles [%]
|
1
|
PIM 256×128
|
6
|
73.3
|
18.3
|
-
|
GLB
|
-
|
-
|
-
|
0.0
|
Peri.+
Interface
|
-
|
0.0
|
-
|
-
|
Inter-PE
Adder
|
256
|
0.0
|
-
|
-
|
2
|
PIM 256×128
|
6
|
73.3
|
18.3
|
-
|
GLB
|
-
|
-
|
-
|
0.0
|
Peri.+
Interface
|
-
|
0.0
|
-
|
-
|
3
|
PIM 256×128
|
6
|
73.3
|
18.3
|
-
|
GLB
|
-
|
-
|
-
|
0.0
|
Peri.+
Interface
|
-
|
0.0
|
-
|
-
|
V. CONCLUSIONS
In this paper, we analyzed the methods to evaluate the digital SRAM processing-in-memory
hardware accelerators on FPGA. Based on the three mapping schemes including 1) BRAM-based
method, 2) FF-based method, and 3) IE-based method, we analyzed the resource utilization
on FPGA chip. We moreover expanded the mapping method to a larger PIM SoC case using
multiple FPGA chips. Considering that the analog PIM array cannot be possible to be
implemented in the BRAM tiles on FPGAs, our main contribution is to mimic the SRAM
PIM array using FPGA resources, thereby achieving the verification of functional correctness
of the PIM SoCs using FPGA chips before the expensive fabrication steps.
ACKNOWLEDGMENTS
This work was supported by the Sogang University Research Grant of 2023 (202310030.01)
(10%) and partly supported by the National Research Foundation of Korea (NRF) grant
funded by the Korea government (MSIT) (NRF-2022R1F1A1070414, 90%). The EDA tool was
supported by the IC Design Education Center(IDEC), Korea.
References
Y.-D. Chih et al.: “An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision
Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications,” IEEE International
Solid-State Circuits Conference (2021) 252.
H. Fujiwara et al.: “A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory
Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC
and Write,” IEEE International Solid-State Circuits Conference (2022) 186.
S. Ryu et al.: “BitBlade: Energy-efficient variable bit-precision hardware accelerator
for quantized neural networks,” IEEE Journal of Solid-State Circuits (2022).
M. Nazemi et al.: “NullaNet: Training deep neural networks for reduced-memory-access
inference,” arXiv (2018).
S. Biookaghazadeh et al., “Toward multi-FPGA acceleation of the neural networks,”
ACM Journal on Emerging Technologies in Computing Systems (2021) 1.
D. Bankman et al., “An Always-On 3.8uJ/86% CIFAR-10 mixed-signal binary CNN accelerator
with all memory on chip in 28-nm CMOS,” IEEE Journal of Solid-State Circuits (2018)
158.
M. Rategari et al., “Xnor-net: Imagenet classification using binary convolutional
neural networks,” European conference on computer vision (2016).
A. Bulat et al., “XNOR-Net++: Improved Binary Neural Networks,” British Machine Vision
Conference (2019).
B. Zhang et al.: “PIMCA: A programmable In-Memory Computing Accelerator for Energy-Efficient
DNN Inference,” IEEE Journal of Solid-State Circuits (2022) 1436.
Sungju Ryu is currently an Assistant Professor in the Department of System Semiconductor
Engineering at Sogang University, Seoul, Republic of Korea. Before joining Sogang,
he was an Assistant Professor in School of Electronic Engineering and Department of
Next-Generation Semiconductor at Soongsil University from 2021 to 2023. At 2021, he
was a Staff Researcher in the AI&SW Research Center of Samsung Advanced Institute
of Technology (SAIT), Suwon, Republic of Korea. At SAIT, he focused on computer architecture
design. He received the B.S. degree in Electrical Engineering from Pusan National
University, Busan, Republic of Korea, in 2015, and the Ph.D. degree in Creative IT
Engineering from Pohang University of Science and Technology (POSTECH), Pohang, Republic
of Korea, in 2021. His current research interests include energy-efficient neural
processing unit and processing-in-memory.
.