I. INTRODUCTION
With the emergence of the backpropagation algorithm [8] and multilayer perceptron [3], deep neural networks (DNNs) have demonstrated outstanding performance in various
fields [1,2,110]. However, they face the challenge of exponentially increasing computational load
as the number of learnable parameters grows. This poses a significant obstacle to
the practical implementation of DNN models in terms of processing speed and power
consumption [113]. To tackle these issues, parallel processing devices such as graphics processing
units (GPUs) and neural processing units (NPUs) [4] are being utilized, and researchers are actively exploring optimized acceleration
algorithms for each device [5]. However, modern computer architectures based on the von Neumann architecture still
have limitations regarding DNN processing. Specifically, a substantial portion of
the power consumption, up to 75%, is attributed to loading parameters for DNN operations
(e.g., feature maps and weights) from external memory, such as dynamic random-access
memory (DRAM), to the processor, or storing them back to memory [9,10,115]. To address this issue, processor-in-memory (PIM) architecture has emerged as a promising
technology [6]. By integrating computing and memory units at the processing element (PE) level,
PIM significantly reduces latency associated with data transmission and enhances data
processing efficiency [112]. This integrated architecture has the potential to significantly reduce energy consumption
during memory access, thereby enhancing the efficiency of applications that require
high-performance computing [7].
This survey explores diverse PIM architectures and methodologies for enhancing PIM
performance in different memory types. It analyzes the characteristics of various
DNN models, including convolutional neural networks (CNNs), graph neural networks
(GNNs), recurrent neural networks (RNNs), and transformer models. The focus is on
optimizing data mapping and dataflows within the context of PIM, providing valuable
insights into efficient handling of DNNs. This comprehensive study aims to deepen
researchers' understanding of the connection between DNNs and PIM, opening up new
avenues for future AI research and advancements.
Section II shows the background of this work. Section III presents the PIM architectures
for DNNs, and Section IV concludes this paper.
III. PIM FOR DEEP NEURAL NETWORKS
1. Technologies and Representative Architectures Needed for PIM
PIM fundamentally offers high throughput because it minimizes data transfer with the
host processor by integrating data processing logic directly into memory, thus resolving
the associated bottleneck [28,29]. In the DNN inference process, the most frequently performed MAC operations are executed
in the PIM core to achieve high energy efficiency. In addition, during the DNN training
process, PIM can reduce both processing time and power consumption by performing the
computations necessary for weight updates directly within the memory [88,94]. However, not all functions benefit from the application of PIM. For instance, it
can be burdensome to process functions with high computational complexity and memory
reusability using in-memory logic. Therefore, to determine where a specific function
should be computed, it is necessary to establish appropriate metrics and analyze them
using a benchmark simulator. DAMOV [30] is a memory simulator comprising a frequently used ramulator [31] and a zsim CPU simulator [32]. It extracts memory traces for each workload [117] using an Intel VTune profiler [33]. The extracted traces calculate the temporal/spatial locality and divide the causes
of memory bottlenecks into six classes using three indicators: the last-to-first miss-ratio
(LFMR), last-level-cache misses per kilo-instruction (LLC MPKI), and arithmetic intensity.
Moreover, by conducting an experimental analysis with 77 K functions, we demonstrated
its reliability and applicability across various research areas.
Current PIM research is largely categorized into commercially accessible DRAM-based
PIM research [52-59, 81-85] and research utilizing next-generation memory [90-99], both of which are presented in a competitive manner. Unlike academic research, mass-producible
PIM products fundamentally utilize bank-level parallelism of DRAM for computation
processing. In addition, they also consider maximizing compatibility with existing
mass-produced products and prioritizing cost aspects, such as minimizing the area
occupied by logic operations and addressing heat-dissipation issues. The HBM-PIM [58] is an addition of PIM functionality to the HBM architecture, designed to increase
memory bandwidth and energy efficiency by performing computational processing within
the memory. It proposes not only a hardware architecture but also a software stack.
The software stack supports FP16 operations, MAC, general matrix-matrix product (GEMM),
and activation functions, along with the operation logic loaded onto the HBM by applying
the LUT. In addition, it allows programmers to write PIM microkernels using PIM commands
to maximize performance. The hardware architecture was implemented based on 20 nm
DRAM technology and integrated with an unmodified commercial processor to prove its
practicality and effectiveness at the system level. Furthermore, it is designed to
be replaceable because it is compatible with existing HBM. By implementing the proposed
PIM architecture, there was a significant improvement in the performance of memory-bound
neural network kernels and applications. Specifically, the performance of neural network
kernels increased by 11.2${\times}$, while applications showed a 3.5${\times}$ improvement.
Additionally, the energy consumption per bit transfer was reduced by 3.5${\times}$,
resulting in an overall enhancement of the system's energy efficiency by 3.2${\times}$
when running applications.
Newton’s architecture [59] was designed as an accelerator in memory (AiM) for DNNs. In this design, a minimum
number of computing units were placed in the DRAM to satisfy the area constraints,
which can be a problem in the hardware design for PIM. The computing units included
MAC operations and buffers. It also uses an interface similar to that of DRAM so that
the host can issue commands for PIM computing. The PIM matches the internal DRAM bandwidth
and speed, captures input reuse, and uses a global input vector buffer to divide the
buffer area costs across all channels. The three optimization techniques proposed
by Newton helped the PIM-host interface overcome bottlenecks: 1) Grouping multiple
computational tasks in the in-bank and bank groups. 2) Support complex, multistep
computing commands to process multiple stages of operations simultaneously. 3) The
strength of the internal low-dropout (LDO) regulator and DC-DC pump driver should
be increased to allow for higher current and faster voltage recovery. As a result,
Newton applied to HBM2E achieves an average speed improvement of 10${\times}$ over
a system assumed to ideally use the external DRAM bandwidth without applying PIM and
54${\times}$ over a GPU.
The UPMEM PIM architecture [52] was the first commercialized PIM architecture that combined traditional DRAM memory
arrays and a common instruction sequence core: the DRAM processing unit (DPU). DPUs
are a concept proposed for UPMEM and are used to perform operations within memory
chips. The DPU has exclusive access to a 64MB DRAM bank, known as the main random-access
memory (MRAM), 24 KB of command memory, and 64 KB of scratchpad memory, called the
working random access memory (WRAM). This allows programmers to write code to be executed
on the DPU and process the data within the memory. This implies that the data transfer
between the host processor and DPU can be controlled, allowing for the selection of
parallel and sequential processing.
On the other hand, the most commonly used next-generation memory in PIM architecture
is ReRAM [36, 39, 91, 95, 100, 103, 107]. The ReRAM crossbar array consists of cells arranged in rows and columns. This array
can be used for memory purposes and can efficiently perform computations such as the
general matrix-vector product (GEMV), composed of MAC operations. In addition, the
use of a crossbar array can significantly reduce the overhead and energy related to
memory movement. In particular, as a pioneering study on ReRAM-based PIM, PRIME [91] distinguished the internal array space of a bank as memory a subarray (MemS), full
function subarray (FFS), and buffer subarray. MemS is a circuit that stores only data.
FFS allows the crossbar to be used for both memory and operation logic, achieving
minimum area overhead. To enable this, multiple voltage sources are added to provide
an accurate input voltage. The column multiplexer provides an analog subtraction unit
and a nonlinear threshold value unit, and the SA is modified to achieve high precision.
2. PIM for CNN
Numerous PIM studies primarily support the MAC operation required by CNN [46, 47, 52-59]. However, this study focuses on PIM research that employs data mapping methods and
dataflow necessary for CNN operations. Efficient data handling in the CNN inference
process is crucial, with particular emphasis on maximizing the reuse of weights as
well as the input and output feature maps used between layers.
1) Inference Phase
Peng et al. [45] proposed an ReRAM-based PIM accelerator that adapted the data-mapping technique proposed
by Fey et al. [44] for the CONV layer. This reduces the use of interconnects and buffers by reusing
the input data and weights. As shown in Fig. 4(a), a 3D kernel of size K${\times}$K${\times}$D is arranged in vertical columns, and
the input feature map (IFM) is arranged in a similar manner in K${\times}$K submatrices
within 1${\times}$1${\times}$D kernels. As shown in Fig. 4(b), computation of the subarrays proceeds as a single PE within the ReRAM subarray.
This method is designed to maximize the reuse of IFMs and weights as the kernel (i.e.,
weights) slides over them during computation. Consequently, this study achieved a
2.1${\times}$ increase in speed and 17% improvement in energy efficiency (measured
in TOPS/W) during the inference phase with the VGG-16 model compared with [92].
Fig. 4. Processing-in-Memory for CNN proposed in [45]: (a) A basic mapping method of input and weight data, with kernel moving in multiple cycles; (b) An example of IFMs transferred among PEs and how the kernel slides over the input.
2) End-to-End Training Phase
Backpropagation in CNNs requires a significant amount of computation because it involves
computing the gradients for each layer and updating the weights to train the model.
It is considered memory-bound because it includes storing and tracking the intermediate
features and gradients of all the layers, which is more intensive than inferring the
CONV layer. Therefore, higher efficiency can be expected by optimizing the training
process in the PIM.
T-PIM [88] is a DRAM-based PIM study considering the end-to-end training of CNN models. Fig. 5 represents the data mapping of T-PIM that reduces the overhead caused by data rearrangement
in DRAM and optimizes the data access to weights. Fig. 5(a) and (b) show the data mapping methods during the forward pass (FWP) and backward
pass (BWP) within the MLP layer, respectively. To maximize the utilization of DRAM's
cell array without rearranging data, the size of the tile is set to $M_{t}\times N_{t}$
and each weight is mapped to DRAM's column addresses. During the FWP process, the
input vector is flattened to size $M_{t}$ (Input$_{\mathrm{L}}$ $\left(M_{t}\right))$
and multiplied with the weights arranged in DRAM. Each column is then accumulated
into an output buffer of size $N_{t}$ (Output$_{\mathrm{L}}$ $\left(N_{t}\right)$).
For the BWP process, to use the weights aligned in the FWP process without additional
rearrangement, the loss (Error$_{\mathrm{L}}$ ($N_{t}$)) is flattened into $N_{t}$
elements and performs vector operations with the weights. Each row is then accumulated
into an output buffer of size $M_{t}$ (Output$_{\mathrm{L}}$ $\left(M_{t}\right))$.
Fig. 5(c) and (d) represent the data mapping methods used during the FWP and BWP in the CONV
layer, respectively. Similar to the MLP layer, weights (Weight$_{\mathrm{L}}$) are
arranged to column addresses by kernel size ($=W\mathrm{k}\times \mathrm{Hk}$), so
the weights can be reused without the need for data rearrangement. T-PIM shows high
efficiency of 0.84-7.59 TOPS/W for 8-bit input data and 0.25-2.21 TOPS/W for 16-bit
input data in VGG16 model training, using the non-zero computing, powering off computing
method.
Fig. 5. Data mapping of T-PIM: (a) FWP layer; (b) BWP layer; (c) FWP, CONV layer; (d) BWP, CONV layer (Reprinted from [88] with permission).
3. PIM for GCN
The processing steps of a GCN (e.g., aggregation, combination, embedding, message
passing, and readout) are mostly low in operational complexity, data-dependent, and
performed repetitively. Among these, aggregation must process large amounts of data
to combine the information of each node with that of its neighboring nodes. Moreover,
these operations have the characteristic that they must be performed as different
operation combinations depending on the relationship between each node and its neighbors.
These characteristics require a large amount of computation and high memory bandwidth.
Therefore, these drawbacks can be effectively mitigated using PIM. The PIM for GCN
has also been approached by actively utilizing an ReRAM crossbar to perform operation
processing as an analog computing method [36].
Two representative techniques are the MAC crossbar and content addressable memory
(CAM) crossbar [37]. Among these two, the CAM crossbar performs content-based searches. This allows a
parallel associative search by broadcasting the search key across multiple rows. This
enabled the storage of more data on a chip in the same area. It was shown in TCAM
[38] that 2-transistor-2-resistor ReRAM can achieve 3${\times}$ higher density than the
existing 8-transistor SRAM. The MAC crossbar can effectively perform the VMM with
low energy consumption through bit-line current accumulation. This process can be
described in three steps. 1) The elements of the matrix were converted to voltage
and assigned to the crossbar, and the resistor of the cell was precisely adjusted
to correspond to the elements. 2) The vector is converted to a voltage, which is accumulated
on the word line. 3) The current of the bit line was measured, and the sum of the
currents of all cells connected to the bit line was obtained as the product of the
column and vector.
Fig. 6 shows the overall architecture of PIM-GCN [39], which consists of a central controller, a search engine, and two computing engines.
Each of these comprises a CAM crossbar and a MAC crossbar, and the two computing engines
operate in a typical ping-pong architecture, alternately performing aggregation and
combination. The central controller initially loaded the graph data and finally exported
the GCN results back to the external DRAM. It also generates the necessary control
logic for the CAM crossbar, the MAC crossbar, and the special function unit (SFU).
The SFU, composed of a shift-and-add (S&A) unit and scalar arithmetic and logic (sALU)
units, processes the partial results derived from the MAC crossbar. PIM-GCN introduces
not only a hardware architecture that can maximize inter-vertex parallelism, but also
a technique for optimizing node grouping without violating independence, providing
scheduling for these groups to operate independently at each layer. It also proposes
a timing strategy to reduce idle time owing to differences in read/write latency.
GCIM [40] is an accelerator research that presents a software-hardware co-design approach,
becoming the first to enable efficient data processing of GCNs in 3D stack memory.
From a hardware design perspective, the GCIM proposes a logic-in-memory (LIM) die
that integrates light computing units near the DRAM bank, fully utilizing the bandwidth
and parallelism at the bank level. The GCIM offloads memory-bound aggregation operations
onto the LIM die. Each LIM bank group is equipped with an LLU consisting of a MAC
array, vertex feature buffer (VFB), look-ahead FIFO, CAM, and a controller to accelerate
the aggregation phase. A MAC array was used to execute the aggregation operations.
VFB is used to buffer the output features during the aggregation phase. Look-ahead
FIFO is a special edge buffer implemented as a scratch-pad memory that processes the
frontmost edge upon receiving a signal from the controller. The CAM provides key-value
storage that records the ID of nonlocal vertices and the local addresses where their
replicas are buffered. The controller is a data-based control unit that processes
the aggregation operations of local vertices. On the software side, GCIM proposes
a data-mapping algorithm that considers locality. It balances the workload by splitting
the input graph into subgraphs considering the connection strength of the nodes. Here,
if the weight between two vertices is large or if multiple paths exist, the strength
is determined to be strong. The divided subgraphs are assigned to the vault and mapped
to the LIM bank group. This was optimized to utilize a high bandwidth and reduce unnecessary
data movement. This significantly improves the computational efficiency while preventing
redundant calculations. In addition, the GCIM adopts a sequential mapping strategy
to maximize data locality and minimize the processing delay of the aggregation. This
optimization technique uses dynamic programming [41], a mechanism that saves the optimal solution of a subproblem and reuses it to determine
the optimal solution of the entire problem. Based on experimental results, GCIM demonstrated
a remarkable improvement in inference speed compared to other models. Specifically,
it achieved a speed enhancement of 580.02${\times}$ compared to HyGCN [42], 275.37${\times}$ compared to CIM-HyGCN, and 272.01${\times}$ compared to PyG-CPU
[43]. These results highlight the significant performance boost offered by GCIM in terms
of inference speed.
Although the two studies mentioned earlier were based on different memory-based PIM
hardware architectures, they both proposed algorithms for grouping and mapping graph
data nodes in a memory-friendly manner, and effectively handled GCN aggregation and
combination operations.
Fig. 6. PIMGCN architecture overview (Reprinted from [39] with permission).
4. PIM for RNN
RNN and LSTM structures can be effectively applied with PIM owing to their similarity
to CONV layers and their ability to reuse feature maps and weights. ERA-LSTM [103] is a PuM architecture that uses ReRAM crossbars. It optimized the RNN's weight precision
and digital-to-analog converter (DAC) in Long et al. [100] PIM architecture and applied systolic dataflow to improve computing efficiency and
performance. Fig. 7(a) shows the overall structure of ERA-LSTM. The VMM unit in Fig. 7(b) stores the weights of the four LSTM gates and uses a digital-to-analog converter
to deliver the input data and hidden states from the I/O buffer to the analog ReRAM
crossbar. The computational results of the VMM unit are transmitted to an element-wise
(EW) unit. The EW unit enables EW operation of the LSTM cell in the three feedforward
layers. In addition, the VMM and EW units efficiently handle each of the four gate
weights $(e.g.,\,\,W_{f},\,\,W_{i},~ \,\,W_{g},\,\,W_{o})$ by splitting each weight
into four weights $(e.g.,\,\,W_{00}-W_{11})$ and tiling each weight into a tile for
computation. In addition, the NN operation used an approximator to minimize the overhead
caused by analog-to-digital converters, achieving a 6.1${\times}$ operational efficiency
compared with Long et al. [100].
PSB-RNN [104] is another PuM architecture that uses a ReRAM crossbar. PSB-RNN transforms the MAC
operation required for the RNN model into a single weights matrix using Fast Fourier
Transform (FFT). The real ($Re$) and imaginary ($Im$) components of the resulting
matrix are mapped onto the ReRAM crossbar, thereby enabling the retrieval of complex
number operation results from each PE result. This method yielded a computational
efficiency that was 17${\times}$ higher than that of Long et al. [100] for the LSTM model. Although this study requires additional operations and tasks
beyond data mapping for the traditional LSTM model, it proposes an effective method
for ReRAM crossbar PIM by mapping data for a complex number of operations necessary
for MAC and utilizing the data flow.
Fig. 7. ERA-LSTM: (a) architecture overview; (b) Mapping a LSTM cell to multiple tiles.
5. PIM for Transformer
TransPIM [106] is an HBM-based PnM designed for efficient transformer utilization. An arithmetic
control unit (ACU) was allocated to each bank for computation, and a token-based data
shading scheme was proposed to allow parallel processing by dividing and assigning
the data required for the calculation to the HBM's bank stack. The study also optimizes
data using a token-based transformer operation method, which enables independent operations
between tokens, in contrast to the existing layer-operated transformer structure.
Fig. 8(a) illustrates the encoder process of TransPIM. The input token size is L${\times}$D,
where L signifies the number of tokens and D indicates the size of the embedding vector's
dimension. Input tokens $I_{1},\,\,I_{2}$ and $I_{3}$ are allocated to each bank using
a technique that distributes each input token to N banks. Based on this, the embedding
values $Q_{i},\,\,K_{i},\,\,V_{i}$ corresponding to each input token are calculated
and assigned to the same bank, followed by a
self-attention operation. For the MHA, $~ K_{i}$ and $~ V_{i}$ are sequentially transferred
to bank i +1 and sent to another bank for calculation using the ring-broadcast technique,
thus enabling computation with minimal data transmission between banks. Fig. 8(b) shows a decoder block, where K and V are received from the encoder vector for reuse,
and only the last bank obtains new Q, K, and V vectors for the fully connected layer
computation. The new $~ Q_{new}$ is broadcast to all other banks to calculate the
attention score, and $~ K_{new}$ and $~ V_{new}$ are concatenated with the previous
$~ K_{i}$ and $~ V_{i}$ of the last bank. Each bank stores the weights for Q, K, and
V during this time, and the ring broadcast technique is employed to reuse the stored
weights and Q, K, and V values in the other banks, facilitating the efficient processing
of repeated NN operations. To this end, this study incorporates the ACU onto the banks
of HBM memory and adds a ring broadcast unit between the banks. This allows for a
reduction of more than 30.8% in the data movement overhead on average compared with
the existing transformer, with only 4% additional area overhead relative to the original
DRAM. This study ensured that the PIM power remained below the DRAM power budget of
60 W.
ReTransformer [107] proposed and applied optimization techniques to effectively accelerate GEMV operations
within the transformer inference process, and softmax is suitable for low-power implementation
in ReRAM-based PIM. This study has a similar direction to the existing ReRAM-based
transformer-based workload target PIM study implementing MatMul operations inside
ReRAM and applying optimization techniques. Thus, the latency of the operational process
can be reduced. Specifically, this paper proposes a method of decomposing the operation
into two consecutive multiplication steps to solve the compute-write-compute dependency
that occurs when implementing the MatMul operation between Q and $K^{T}$during the
transformer inference process using ReRAM. Consequently, the latency recorded in the
crossbar of the ReRAM can be eliminated. In addition, a modified hybrid softmax formula
that can maximize the crossbar arrangement of the ReRAM was proposed and applied to
the softmax operation; as a result, only 0.691 mW was consumed and implemented, unlike
1.023 mW for the existing softmax operation. Finally, this study achieved a 23.21${\times}$
computational efficiency improvement and a 1,086${\times}$ power consumption reduction
compared to NVIDIA TITAN RTX GPUs.
Fig. 8. Token-based data sharding scheme and the dataflow of Transformer: (a) encoder; (b) decoder in TransPIM (Reprinted from [106] with permission).
6. Discussions
The PIM is a new architecture that integrates processing and memory units into the
PE, thereby enabling efficient data processing. However, owing to the integration
of computational functions into memory cells, PIM may be limited in handling complex
operations, and can cause performance degradation when computationally intensive operations
are required. Moreover, PIM's complex control structure and limited memory capacity
pose limit the full and effective handling of increasingly large AI workloads. For
PIM cores to be effectively applied to AI workloads, clear criteria are required to
determine whether operands should be computed in the host processor or the PIM core.
These criteria are typically derived by statistically analyzing the results measured
at the functional level using benchmark simulators [34,35]. In addition, in the PIM design process, the mapping of processes and parameters,
as well as the data flow considering complex operations, must be carefully incorporated.
In previous PIM studies, these considerations were designed heuristically. However,
with increasingly diverse PIM architectures and algorithms, there is an urgent need
for research on compilers that can automatically optimize workload functions, data
mapping, and data flow.