KimHee-Tak1,†
HongYun-Pyo1
JeonSeok-Hun1
HwangTae-Ho1
KimByung-Soo1
-
(Korea Electronics Technology Institute, Seongnam-si, Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Spiking neural network, network on chip, spiking convolution, accelerator, data reuse
I. INTRODUCTION
Spiking neural networks (SNN), which attempt to emulate the mammalian cortex through
a biologically plausible spiking neuron model, are gaining considerable interests
as the third generation of neural network after artificial neural networks (ANN).
Spiking neurons transmit and receive information as series of spikes via synapses
in the brain-inspired spatio-temporal domain. Benefit from the characteristics of
spikes, which are the sparse activity and the binary representation, a number of training
algorithms and hardware architectures have been proposed.
In terms of training algorithms, SNNs have been developed in two ways. First, motivated
by experimental observations in biological neurons, SNNs were trained by biological
plausible training algorithms such as spike-timing dependent plasticity (STDP) [1]. In order to improve the accuracy of STDP in real world applications like a classification
task, various types of spiking neuron models [2,3] or mathematically formulated STDP mechanisms [4,5] have been applied. Despite noticeable improvements in SNNs, the biologically plausible
training algorithms suffer from low-accuracy in real-life applications. In order to
achieve comparable accuracy with ANNs, back-propagation (BP) has been applied to SNNs,
and there are two training methods which are 1) ANN-to-SNN conversion [6,7] and 2) BP-on-SNN [8-10]. For ANN-to-SNN conversion method, parameters of pre-trained ANN are normalized to
adjust parameters fit into SNN architectures where the spike generating function and
the membrane potential calculation are included. In BP-on-SNN method, BP is applied
directly onto SNN architectures. In this method, to relieve the non-differentiable
characteristic from the discrete spike activity which impedes the gradient descent
based BP implementation on SNNs, additional layers such as a postsynaptic current
(PSC) layer [10] or an auxiliary layer [9] were added in SNNs. Both two methods have shown comparable accuracy with ANNs, which
brings about a demand for an efficient hardware accelerator for SNNs.
In order to accelerate SNNs efficiently, there are two main processor architectures
for SNNs, which are event-driven architecture and frame-driven architecture. Fig. 1 shows illustrations of two architectures. Firstly, with the binary representation
spike and sparse activity of spiking neurons, an address event representation (AER)
based event-driven architectures have been proposed to reduce the energy consumption.
IBM Truenorth [11], Intel Loihi [12], and [13,19] proposed 2D mesh network architecture that deploys multiple cores with routers, where
every core includes hundreds of spiking neurons. In the mesh network, each core generates
spikes in every time-step, and the generated spikes are transferred as AER to the
cores containing corresponding post-synaptic neurons through the router. So, AER based
event-driven architectures achieve low power performance by leveraging binary and
sparse spike activity characteristics. Also, event-driven architectures enable emulation
of SNNs in more biological environment as well as SNN acceleration [14]. However, event-driven architectures accompany large chip size because the entire
parameters for SNNs are realized on-chip. In addition, event-driven architectures
suffer low throughput and latency due to the small number of parallel processing units
in the processor. In order to improve the throughput in a small chip size, data-reuse
featured frame-driven architectures have been proposed. Similar to ANN accelerators,
frame-driven architectures store all SNN parameters in the external memory and iteratively
process a small part of SNN operations. So, at the expense of using the external memory,
more number of parallel processing units can be implemented, which improves the throughput
and latency of SNN acceleration. In [15,16], systolic array and spine based architectures are used to accelerate SNN, and multiple
time-steps are processed simultaneously in processing elements (PE) in order to reuse
partial sums (psum).
Fig. 1. Two types of SNN processor architecture.
In this work, we have designed ketrion that accelerates SNN operations under CMOS
55nm process. The main contributions of ketrion are summarized as following:
1. Energy consumption overhead from the membrane pote-ntial calculation in the spiking
neuron is minimized by a novel reuse dataflow. Also, energy-efficient row station-ary
dataflow is adopted.
2. High bandwidth network-on-chip bus and pipelining architecture provide high throughput
SNN operation.
3. CNN as well as SNN mode are supported, and pooling core and activation function
cores are implemented to realize end-to-end application supporting hardware.
4. ketrion is added to a system on chip (SoC) architecture to evaluate the performance
with generally used software platform.
II. BACKGROUND
1. Spiking Neuron Model
SNNs consist of spiking neurons and synapses, where spiking neurons are connected
through synapses. To formulate the biological neuron into the spiking neuron, there
are a number of types of spiking neurons and synapses, e.g., the Izhikevich neuron
[21], the conductance synapse model [22]. The most popular and simple models are leaky integrate-and-fire (LIF) neuron and
linear synapse. LIF literally indicates that membrane potential (memb) retained in
the neuron integrates received inputs in the form of summation of connected synapses,
leaks as time passed, and the neuron fires the spike when memb exceeds its threshold
value. The equation of digitized single LIF neuron with the linear synapse is shown
below:
where t is a digitized time-step in the time domain, Vth is a threshold voltage, Vreset is a reset voltage after a spike generation, ${\tau}$ is a leakage factor, S[t] is
a generated spike at tth time-step, and U[t] is a memb of the neuron at tth time-step. W and X are matrices of a synapse weight and a spike connected to the
neuron, respectively. An illustration of LIF neuron model is shown in Fig. 2. Here, the neuron that produces incoming spike is called a pre-synaptic neuron (ineu),
and the neuron that generates spike is called a post-synaptic neuron (oneu). Each
ineu has Xi value and each oneu has Si value. When one ineu is connected to some oneus, a spike (S) generated from ineu
is propagated to the connected oneus as Xi. The LIF operation is repeatedly performed for the length of time-steps, therein
memb is updated for every time-step. Therefore, the length of time-steps and memb
should be carefully considered for the SCNN processor to improve the energy-efficiency.
Fig. 2. An illustration of the LIF spiking neuron model at txth time-step.
2. Spiking Convolutional Neural Network (SCNN)
When the spiking neuron which emits a spike in the time domain is used as the activation
function in CNNs, we call it as spiking CNNs (SCNN). SCNNs are composed of layers
of CNNs such as the pooling layer, the convolutional layer and the fully connected
layer, where every neuron in the layers is the spiking neuron with the time domain.
In SCNNs, the convolution stage takes an input of 4-D array ineu (X) and synapse (W),
and takes an output of a 4-D array oneu (S). 4-D ineu consists of HixWi sized 2-D feature maps with N-sized channels and T-sized time-steps. 4-D oneu consists
of HoxWo sized 2-D feature maps with M-sized channels and T-sized time-steps. 4-D synapse
consists of HfxWf sized 2-D filter with N-sized input channel and M-sized output channel. memb (U)
has the same shape with oneu. All the referred parameters are listed in Table 1. So, in SCNN, (1) of the spiking neuron model equation is modified as below:
where $\otimes $ denotes 2D convolution operation, Xn[t] is a 2-D ineu of nth channel at tth time-step, Wn,m is a 2-D synapse of nth input channel and mth output channel, and Um[t] is a 2-D memb of mth channel at tth time-step. After Um[t] is calculated in (4), (2) to (3) are applied to generate Sm[t] of 2-D spike which has the same shape with Um[t]. An illustration of SCNN operation is shown in Fig. 3. From the SCNN equation, we found that SCNN operation can be used as the conventional
CNN operation by setting some parameters of SCNN equation. When ${\tau}$ is set to
one to eliminate memb calculation, T is set to one to remove time domain processing,
and the spike generating function (2) is modified as the activation function such as the rectified linear unit (ReLU) in
CNN, only the convolution operation in (4) is remained. By this way, SCNN operation can be described as the conventional CNN
operation. Base on this algorithmic modification, in this work, we proposed a hardware
architecture that supports both SCNNs and CNNs by sharing identical algorithmic operations.
Fig. 3. An illustration of SCNN operation at txth time-step. The notation ‘*’ denotes the convolution operation.
Table 1. Parameters for SCNN operation
Parameter
|
Description
|
Wi / Hi
|
Width / Height of pre-synaptic neuron
|
Wo / Ho
|
Width / Height of post-synaptic neuron
|
Wf / Hf
|
Width / Height of synapse shape
|
N / M
|
Channel size of pre/post-synaptic neuron
|
T
|
Length of digitized time-step
|
In this work, we aimed to design the SCNN processor that can be used for a wide variety
of SCNN architectures. From our observation on two types of SCNN training methods,
which are ANN-to-SNN and BP-on-SNN, we found that multi-bit representation is essential
for the ineu (X) due to following reasons. 1) In BP-on-SNN training method [8,9], the added PSC layer or the auxiliary layer transforms 1-bit spike into multi-bit
ineu. In PSC layer, the spike is transformed into a multi-bit represented smoothed
activation following inter/intra neuron dependencies. Also, in the auxiliary layer,
the moving average of spike activity is added to the spike inputs, which incurs multi-bit
ineu. The PSC layer and the auxiliary layer illustration is shown in Fig. 4. 2) In ANN-to-SNN training method, the first convolution layer always has multi-bit
pixel inputs, not 1-bit spike inputs. 3) Multi-bit representation for the spike happends
after the pooling layer because some spikes are added together for the max-pooling
operation. For these reasons, we used 16 bit fixed point precision for both ineu in
this work.
Fig. 4. An occurrence of multi-bit activations in SCNNs: (a) Activation after the PSC layer; (b) Activation after the summation of oneus with the auxiliary layer output.
3. Hardware Dataflow for SCNNs
Since the data movement with the external memory, mostly DRAM, affects the energy
consumption increment, reducing DRAM access is important for the SCNN processor. In
the data-reuse featured frame-driven hardware architecture, entire parameters for
SCNNs such as ineu, synapse and oneu are stored in DRAM and a small part of the whole
parameters are read and computed in the processor. In order to store and reuse the
small part of parameters in the processor, a SRAM base global buffer (gbuf) is implemented.
Here, the number of DRAM access is determined by the size of parameters stored in
gbuf and the hardware dataflow that can reuse the parameters even in small sized gbuf.
Unlikely to CNN, in SCNN, DRAM access for memb, updated at every time-step, should
be additionally considered. Reducing DRAM access for memb is realized by, firstly,
adopting memb-stationary (MS) dataflow that reuses a particular amount of memb data
in gbuf without transferring memb data to DRAM while operations across all time-steps
are being processed, and secondly, increasing the size of gbuf to store memb. Without
MS dataflow, DRAM access for as large as memb is required as each time-step increases,
which incurs huge energy consumption. Thus MS dataflow is necessary for SCNN processors.
With MS dataflow, when gbuf stores small-sized memb, the entire memb should be divided
into multiple small-sized groups and processed. Since computing each group requires
DRAM access for entire synapses, the smaller size of gbuf for memb causes the more
groups and the more DRAM accesses. Fig. 5 shows the increasing number of groups according to the size of memb when MS dataflow
is adopted. In the figure, the number of groups is calculated by applying the ceiling
function to the size of memb divided by the gbuf size.
Fig. 5. The number of group is increasing with the size of memb. Lines with different colors represent gbuf size for storing memb.
For data-reuse dataflow adopted architectures, various SCNN processors have been proposed.
SpinalFlow [15] was proposed, where each PE is responsible for producing one oneu value. Therefore,
SpinalFlow needs to iteratively read synapses with different input channels from DRAM
whenever every PE computation ends. Also, large capacity of gbuf is required for storing
synapses with a number of input channels. Specifically, SpinalFlow targets temporal-coding
based SCNNs, which deviates general SCNN architectures. Next, [16] proposed a systolic array based SCNN accelerator where ineus with multiple time-steps
and synapses with multiple output channels are processed in parallel. However, in
this dataflow, since membs of multiple time-steps are computed simultaneously, multiple
times larger gbuf capacity for storing memb is required than the dataflow which computes
memb with only single time-step. Also, PE utilization degrades when the length of
time-step is shorter than the systolic array size.
In this work, we propose a hardware dataflow that combines MS dataflow and row-stationary
(RS) dataflow [17]. In RS dataflow, as shown in Fig. 6, ineu and synapse are shared in the diagonal and the row direction of the PE array,
respectively, and partial sum (psum) is accumulated in the same column of the PE array.
Also, several rows of PE array compute one channel of oneu. Therefore, in RS dataflow,
the 3-D convolution is performed in PE array by sharing ineus and synapses for multip
le PEs. Also, psums computed from the PE array are accumulated in gbuf as much as
gbuf allows until a completion of the psum accumulation. In order to combine MS dataflow
onto RS dataflow, after membs in the current time-step are completely computed and
stored in gbuf, membs for the next time-step is computed using membs stored in gbuf.
In the proposed dataflow, thanks to RS dataflow, a large amount of memb can be stored
in gbuf, and thanks to MS dataflow, memb data are not transferred to DRAM during SCNN
operations. Detailed description for the proposed dataflow is described in section
III.2.
Fig. 6. Row stationary dataflow.
III. PROPOSED HARDWARE ARCHITECTURE
In this section, we present NoC architecture, called ketrion, which efficiently processes
SCNNs. The proposed architecture is designed to achieve following features: 1) ketrion
enables operations of not only SCNNs also CNNs by sharing common computational logics.
2) To remove memb data movement between the external memory, a novel dataflow is proposed.
From the dataflow, all oneus are computed and stored in gbuf without moving psums
to the external memory. In each time step, new spikes and membs are computed using
the oneus in gbuf, and the oneus in gbuf are replaced by the newly updated membs.
The updated membs are used as initial memb for the next time step. Thus, membs are
reused across all time-steps. 3) High throughput is achieved by deploying 240 (15x16)
PEs and a NoC that uses high bandwidth (256 bit/cycle) between 128 KB gbuf and PEs.
4) Various types of activation functions and the spiking neuron model are supported.
Moreover, since the activation function is computed while the final calculated oneus
are stored in gbuf, the other activation functions can be easily added. In this work,
a leaky rectified linear unit (ReLU) and a leaky integrate-fire (LIF) spiking neuron
model are implemented. In CNN operation mode, memb data are used as oneu of the layer.
1. Architecture Overview
Fig. 7 explains an overall architecture of ketrion, which is interfaced with advanced microcontroller
bus architectu-re (AMBA) protocol such as advanced peripheral bus (APB) slave and
advanced high-performance bus (AHB) master protocol. ketrion includes PE array, global
controller, APB and AHB interfaces, 128 KB gbuf, psum accumulator, activation function
core and pooling core. All modules in ketrion are controlled by finite state machines
in the global controller. 80 KB of gbuf is used for memb, and 48 KB of gbuf is used
as a dual-buffer for ineu and synapse. PE array consists of 15 vertically and 16 horizontal
PEs. Parameters to conduct SCNN is configured by APB interface and the external memory
is accessed by AHB interface. Here, dual buffer is used for the read path of AHB interface
to alleviate throughput bottleneck between PEs and the external memory. When requested
ineu and synapse data are stored in gbuf, psums are computed through PE array and
psum accumulator. Leakage and rese t computations for memb are performed in psum accumulator.
Each PE includes two 256-bit buffers for ineu and synapse. To support CNNs as well
as the first layer and the pooling layer in SCNNs, ketrion uses 16-bit fixed point
MAC operation in PE. When the final oneu are obtained in gbuf, spiking neuron model
or leaky ReLU function is applied optionally to the final oneus. To compute memb within
ketrion, activation core is authorized for writing newly updated membs to gbuf.
Fig. 7. Proposed hardware architecture (ketrion).
2. Proposed Dataflow
ketrion follows a nested loop iteration as stated in Algorithm 1. From RS dataflow,
once ineus and synapses are read from the external memory, multiple channels (MPE) of oneu are computed from one iteration of PE array by reusing ineus. Thus, the
most outer iteration becomes Miter. Since the size of gbuf for storing ineu and synapse is limited, iterations for Wi and Hi are divided into Witer and Hiter respectively corresponding to the size of gbuf (Wgbuf and Hgbuf) when the size of ineu is large to be stored in gbuf. The iteration for the time
domain is set to T because the time domain is not processed in parallel. In order
to reuse membs in all time steps, psums are accumulated in gbuf until the final oneus
are obtained through all time-steps. So, the time domain iteration increases after
all channel iterations of ineu ends. When every time step processing ends, either
activation functions or spiking neuron models are applied to the final oneu. Thanks
to the proposed dataflow, memb data are entirely reused on ketrion without transferring
to the external memory.
3. Network on Chip
To deliver ineu and synapse from gbuf to PEs, ketrion follows a single cycle multicast
network proposed in [17]. Since PE array consists of X-bus and Y-bus, input data is delivered as a packet
that contains X-ID and Y-ID. In ketrion, NoC architecture is modified as followings
to improve the bus utilization and the throughput. First, bit-width of bus is implemented
as 256 bit, which is the same size of the ineu and synapse buffer in PE. Therefore,
NoC bus fills one buffer in a PE in a cycle. Since synapses are reused for multiple
ineus in the convolution operation, PE performs MAC more than 16 cycles. Furthermore,
in order to reduce the bus usage for psum operation and to make pipelined processing
path from PE array output (psum) to the psum accumulation operation, psum data path
is separated from NoC bus as a psum accumulator where psums are accumulated through
read and write accesses to gbuf. In the psum accumulator, in order to process both
psum accumulation and NoC processing simultaneously, dual register file which behaves
like ping-pong memory is implemented for storing psum.
Fig. 8. Timing diagram of ketrion.
Overall timing diagram is shown in Fig. 8. In the figure, case-A and case-B describe each case when NoC processing time is
slower and the external memory access time is slower, respectively. Since dual buffer
is used for the read path of AHB interface, relatively slower operation between the
external memory access and NoC processing dominantly determines the throughput. As
shown in the figure, processing time for PE array, psum accumulator, and gbuf operation
for psum are processed in the pipelined timing, which results in high PE utilization.
After the completion of processing all channels of ineu, the final oneus are written
to the external memory.
4. Fully Connected Layer
Fully connected layer has three types of dataflow to reuse data: 1) ineu is reused
for corresponding synapses 2) psum is reused for computing one corresponding oneu
3) synapse is reused for ineu in different time step. However, the third type requires
large amount of memory to store synapse. Therefore, considering that on-chip memory
capacity is limited in ketrion, ketrion controls fully connected layer leveraging
the first and the second type. As presented in Fig. 9, ineu and psum are reused in the row and the column direction of the PE array, respectively.
Likely to the convolutional layer operation, psum is stored stationary in gbuf until
the final oneu is computed. In fully connected layer operation mode, gbuf for storing
ineu and synapse is swtiched because the size of synapse is larger than ineu. However,
since each PE receives different synapses in fully connected layer mode, in the limited
NoC bus bandwidth, ketrion suffers from low PE utilization. Only 16 PEs out of 240
PEs perform MAC simultaneously because 256-bit NoC bus activates one PE at one cycle
and the activated PE performs MAC for 16 cycles.
Fig. 9. Dataflow for fully connected layer.
IV. IMPLEMENTATION RESULTS AND VERIFICATION
1. System on Chip Architecture and Implementation Results
To verify ketrion processor, we implemented a system on chip (SoC) architecture that
communicates with a host computer through USB and UART interfaces and controls ketrion
through a OpenRISC Core [20] as shown in Fig. 10. For the external memory, 64-bit 512 MB SDRAM was used. The SoC architecture has
been synthesized under 55 nm CMOS process and fabricated in 4\foreignlanguage{french}{x}4
chip size. As shown in Table 2, ketrion consumes 4,458,997 um2 and achieves the throughput of 38.4 GMAC/s at the maximum clock frequency of 160
Mhz. The critical path delay lies in 16-bit multiplication in PE. From our area breakdown
analysis, each computation module in ketrion occupies as: 15\foreignlanguage{french}{x}16
PE array (53 %), 128 KB SRAM (22 %), accumulator (20 %) and controller (5 %).
Fig. 10. SoC architecture.
Table 2. Hardware specification
Components
|
Value
|
Number of PEs
|
15x16 (=240)
|
Global buffer size
|
128 KB SRAM
|
Bit precision
|
16-bit fixed point
|
Throughput
|
38.4 GMAC/s
|
Synthesis area
|
4,458,997 um2 (ketrion only)
8,841,492 um2 (SoC)
|
Chip size
|
4x4 mm2 (SoC implementation)
|
Maximum frequency
|
160 MHz (Core 1.2V applied)
|
Power
|
398 mW (SoC, Chip measured)
183 mW (ketrion, simulation result)
|
Energy efficiency
|
209.83 GMAC/s/W
|
CMOS process
|
55 nm
|
2. Evaluation Results
In order to evaluate the performance of ketrion, as shown in Fig. 11, fabricated SoC chip is mounted on a verification board. Also, SoC peripheral interfaces
are integrated with a pytorch software platform [18] on the host computer. We used TSSL-BP algorithm proposed in [10] to train SCNN. Single-precision floating-point number has been used to train the
SCNN, and signed 16-bit integer precision has been used for the inference. Single-precision
floating-point number of the trained network is quantized into 16-bit signed integer
by adjusting the decimal point. Outliers from the floating-point number are clipped
as the maximum and minimum value of 16-bit integer number. In the inference, ineu
has non-negative value due to the binary characteristics of spikes, and synapse can
have positive or negative value with the dynamic decimal point. 24-bit MAC computation
results from PE are rounded and accumulated to 16-bit psum in gbuf according to the
decimal point of weight.
Fig. 11. SoC verification PCB board. Die photo (left) and package photo (right) are shown in a white box.
We verified the performance of ketrion by benchmarking networks described in Table 3. SCNN achieved 99.5 % and 89.3 % accuracy for MNIST and Cifar-10 datasets respectively
in the pytorch platform. We achieved a similar accuracy in the fabricated SoC chip
by finding an optimal decimal point of synapse. Overall SoC execution process is performed
as below:
1) Target SCNN model is trained and tested in pytorch based software platform.
2) SoC configuration parameters (address offset for ineu, synapse and oneu, and hardware
specific parameters) are generated and written to SDcard.
3) SoC is communicating with the host computer using UART while SDcard is mounted
to the verification board.
4) SoC receives input image by USB from the host computer.
5) Run SCNN using ketrion in SoC and verify results.
The throughput is linearly decreasing as the length of time-step (T) is increasing
because of the repetitive SCNN operations through time-steps. In our work, T is set
to 5 as stated in Table 3. From the simulation analysis, ketrion achieved 9.6 fps and 0.04 fps for MNIST and
Cifar10 dataset, respectively. However, from the SoC simulation, SoC achieved 5 fps
and 0.01 fps for MNIST and Cifar10 dataset, respectively. Thus, the throughput of
SoC has the lower throughput performance than ketrion alone. This is because SDRAM
bandwidth in SoC was lower than the PE processing time, which means that Case-B in
Fig. 8 frequently happens. So, SDRAM bandwidth and the length of time-step are critical
factors for the high throughput SCNN operation.
Table 3. SCNN parameters for verifying various datasets
Dataset
|
Layer
|
(Hi,, Wi)
|
(Hf, Wf)
|
N
|
M
|
MNIST
(T=5)
|
CV0
|
(28, 28)
|
(5, 5)
|
1
|
15
|
PL0
|
(24, 24)
|
(2, 2)
|
15
|
15
|
CV1
|
(12, 12)
|
(5, 5)
|
15
|
40
|
PL1
|
(8, 8)
|
(2, 2)
|
40
|
40
|
FC0
|
(4, 4)
|
(1, 1)
|
40
|
300
|
FC1
|
(1, 1)
|
(1, 1)
|
300
|
10
|
Cifar-10
(T=5)
|
CV0
|
(32, 32)
|
(3, 3)
|
3
|
96
|
CV1
|
(32, 32)
|
(3, 3)
|
96
|
256
|
PL0
|
(32, 32)
|
(2, 2)
|
256
|
256
|
CV2
|
(16, 16)
|
(3, 3)
|
256
|
384
|
PL1
|
(16, 16)
|
(2, 2)
|
384
|
384
|
CV3
|
(8, 8)
|
(3, 3)
|
384
|
384
|
CV4
|
(8, 8)
|
(3, 3)
|
384
|
256
|
FC0
|
(8, 8)
|
(1, 1)
|
256
|
1024
|
FC1
|
(1, 1)
|
(1, 1)
|
1024
|
1024
|
FC2
|
(1, 1)
|
(1, 1)
|
1024
|
10
|
†CV=convolutional layer, PL=pooling layer, FC=fully connected layer
Comparison table with other SNN processors is shown in Table 4. As we discuss before, SNN processor can be implemented using one of two architecture
types, which are event-driven and frame-driven architectures. Therefore, ketrion has
been compared with both types of dataflow. Note that as event-driven architectures
implement all neurons and synapses in the processor, DRAM is not used at the expense
of large amount of buffer in the chip. On the other hand, frame-driven architectures
use DRAM for storing all parameters of SCNN. In addition, the neuronal synaptic operation
(SOP) in event-driven architectures and MAC operation in frame-driven architectures
are not identical. However, we have compared two architectures together because firstly,
the throughput can be compared using SOP/s and MAC/s metrics that primarily show the
number of parallel computing units in a cycle and secondly, the comparison is meaningful
in that both architectures support SCNNs.
Table 4. Comparison with the other SNN processors
|
This work
|
[19]
|
[12]
|
[15]
|
[16]
|
Architecture
|
Frame-driven
|
Event-driven
|
Event-driven
|
Frame-driven
|
Frame-driven
|
Process (nm)
|
55
|
65
|
14
|
28 (Sim.)
|
32 (Sim.)
|
Die / IP area (mm2)
|
16 / 4.35
|
107.22 / -
|
60 / -
|
- / 2.09
|
- / -
|
Bit precision
|
16 bit (ineu)
16 bit (synapse)
16 bit (memb)
|
1 bit (ineu)
11 bit (synapse)
19 bit (memb)
|
1 bit (ineu)
1 bit (synapse)
|
1 bit (ineu)
8 bit (synapse)
8 bit (memb)
|
1 bit (ineu)
8 bit (synapse)
8 bit (memb)
|
Frequency
|
160 MHz
|
192 MHz
|
-
|
200 MHz
|
-
|
The Number of PEs
|
240
|
64 K Neurons
|
128 K Neurons
|
128
|
128
|
Global buffer size
|
48 KB (ineu+synapse)
80 KB (memb)
|
9.625 MB
(ineu+synapses+memb+states)
|
16 MB (synapse)
|
576 KB (synapse)
9 KB (ineu)
|
54 KB (ineu+synapse+memb)
|
Energy efficiency
|
209.83
GMAC/s/W
|
59.05 GSOP/s/W
|
42.37 GSOP/s/W
|
157.64
GMACs/s/W
|
- (Normalized results only)
|
†Sim.=Simulation results. ‡SOP=Synaptic operations
As shown in the comparison table, ketrion achieved better energy efficiency than event-driven
architectures [19] and [12]. Since SOP related throughput metric is not shown in [12], we referred the energy efficiency of [12] from [23]. Compared with the frame-driven processors [15,16], ketrion includes as many PEs as other works with 160 MHz frequency, which means
that comparable throughput is achieved if same PE utilization with ketrion is satisfied.
Also, as ketrion achieved 209.83 GMACs/s/W with 183mW power consumption, ketrion is
more energy-efficient than [15]. As we discussed in the previous section, the gbuf size for memb is essential for
reducing DRAM access. Since the processor in [16] computes multiple time-steps together, memb of multiple time-steps should be stored
in gbuf, which makes more groups and increases DRAM access for synapses. Specifically,
since the processor in [15] stores memb in local 128 PEs without using gbuf for storing memb, it needs to read
all synapses whenever PE operation ends, which causes a number of DRAM accesses. On
the other hand, ketrion stores only single time-step of memb in 80 KB buffer. Hence,
the smaller amount of DRAM access is required in ketrion, which makes ketrion energy-efficient.
V. CONCLUSIONS
In this paper, we propose a dataflow that reduce energy consumption from the membrane
potential calculation in a spiking neuron and a NoC based SCNN processor which adopts
row stationary dataflow to accelerate SCNN operations. The proposed processor supports
not only SCNN operations but also CNN operations, which can be utilized for various
applications. Also, in order to verify the processor, SoC architecture is implemented.
After SoC is fabricated under 55 nm CMOS process, SoC chip was tested through MNIST
and Cifar10 datasets to verify it can be used for real life applications.
ACKNOWLEDGMENTS
This work was supported by Korea Evaluation Institute of Indus-trial Technology (KEIT)
grant funded by Korea government (No. 20009972).
References
G. Bi and M. Poo, “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence
on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type,” J. Neurosci., vol.
18, no. 24, pp. 10464-10472, Dec. 1998.
N. Brunel, “Dynamics of Sparsely Connected Networks of Excitatory and Inhibitory Spiking
Neurons,” J. Comput. Neurosci., vol. 8, no. 3, pp. 183-208, May 2000.
R. Brette et al., “Simulation of networks of spiking neurons: a review of tools and
strategies,” J. Comput. Neurosci., vol. 23, no. 3, pp. 349-398, Dec. 2007.
S. Song, et al, “Competitive Hebbian learning through spike-timing-dependent synaptic
plasticity,” Nat. Neurosci., vol. 3, no. 9, Art. no. 9, Sep. 2000.
M. Mikaitis, et al, “Neuromodulated Synaptic Plasticity on the SpiNNaker Neuromorphic
System,” Front. Neurosci., vol. 12, 2018.
Bodo Rueckauer, et al, “Conversion of Continuous-Valued Deep Networks to Efficient
Event-Driven Networks for Image Classification,” Front. Neurosci., vol. 11, 2017.
S. Kim, et al, “Spiking-YOLO: Spiking Neural Network for Energy-Efficient Object Detection,”
Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020.
J. H. Lee, et al, “Training Deep Spiking Neural Networks Using Back-propagation,”
Front. Neurosci., vol. 10, 2016.
Y. Wu, et al, “Direct training for spiking neural networks: faster, larger, better,”
in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and
Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth
AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 1311-1318,
Jan. 2019.
W. Zhang and P. Li, “Temporal spike sequence learning via backpropagation for deep
spiking neural networks,” in Proceedings of the 34th International Conference on Neural
Information Processing Systems, pp. 12022-12033, Dec. 2020.
F. Akopyan, et al., “TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable
Neurosynaptic Chip,” IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 34,
no. 10, pp. 1537-1557, Oct. 2015.
M. Davies, et al., “Loihi: A Neuromorphic Manycore Processor with On-Chip Learning,”
IEEE Micro, vol. 38, no. 1, pp. 82-99, Jan. 2018.
G. K. Chen, et al, “A 4096-Neuron 1M-Synapse 3.8-pJ/SOP Spiking Neural Network With
On-Chip STDP Learning and Sparse Weights in 10-nm FinFET CMOS,” IEEE J. Solid-State
Circuits, vol. 54, no. 4, pp. 992-1002, Apr. 2019.
S. K. Esser, et al., “Convolutional networks for fast, energy-efficient neuromorphic
computing,” Proc. Natl. Acad. Sci., vol. 113, no. 41, pp. 11441-11446, Oct. 2016.
S. Narayanan, et al, “SpinalFlow: An Architecture and Dataflow Tailored for Spiking
Neural Networks,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer
Architecture (ISCA) , pp. 349-362, May 2020.
J.-J. Lee, et al, “Parallel Time Batching: Systolic-Array Acceleration of Sparse Spiking
Neural Computation,” in 2022 IEEE International Sympo-sium on High-Performance Computer
Architecture (HPCA), pp. 317-330, Apr. 2022.
Y.-H. Chen, et al, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow
for Convolutional Neural Networks,” in 2016 ACM/IEEE 43rd Annual International Symposium
on Computer Architecture (ISCA), pp. 367-379, Jun. 2016.
A. Paszke, et al., “PyTorch: an imperative style, high-performance deep learning library,”
in Proceedings of the 33rd International Conference on Neural Information Processing
Systems, pp. 8026-8037, 2019.
Y. Kuang, et al., “A 64K-Neuron 64M-1b-Synapse 2.64pJ/SOP Neuromorphic Chip With All
Memory on Chip for Spike-Based Models in 65nm CMOS,” IEEE Trans. Circuits Syst. II
Express Briefs, vol. 68, no. 7, pp. 2655-2659, Jul. 2021.
J. Tandon, et al, “The OpenRISC processor: open hardware and Linux,” Linux J., vol.
2011, no. 6, Dec. 2011.
E. M. Izhikevich, “Simple Model of Spiking Neurons”, IEEE Trans. on Neural Networks,
vol. 14, no. 6, pp. 1569-1572, Nov. 2003.
A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and
its application to conduction and excitation in nerve”, The J. of Physiology, vol.
117, pp. 500-544, Aug. 1952.
A. Basu, L. Deng, C. Frenkel and X. Zhang, "Spiking Neural Network Integrated Circuits:
A Review of Trends and Future Directions," 2022 IEEE Custom Integrated Circuits Conference
(CICC), pp. 1-8, 2022.
Hee-Tak Kim received the B.S. degree in electronics engineering from Kyunghee University,
Yongin, Korea, in 2017. He received M.S. degree in electronics engineering from Korea
University, in 2019, where he is currently pursuing Ph.D. degree. He has been a senior
researcher with Korea Electronics Technology Institute (KETI), Gyeonggi-do, Korea,
since 2019.
Yun-Pyo Hong received the B.S. degree, and the M.S. degree in Electronics Engineering
from Yonsei University, Seoul, Korea, in 2012, and 2014, respectively. He was with
Samsung Electronics, where he was a Staff Engineer at Display Research Group in Mobile
Division. In 2020, he joined in Korea Electronics Technology Institute (KETI), Korea,
where he is presently a senior researcher.
Seok-Hun Jeon received the B.S. degree, and the M.S. degree in Electronics Engineering
from Soongsil University, Seoul, Korea, in 2010, and 2012, respectively. He was with
Seagate, where he was a Senior Engineer at R/W Channel in Korea Design Center. In
2017, he joined in Korea Electronics Technology Institute (KETI), Korea, where he
is presently a senior researcher.
Tae-Ho Hwang received the B.S., M.S. and Ph.D. degrees in Computer Engineering from
the Hankuk University of Foreign Studies, in 1998, 2000, 2013 respectively. Since
2000, he has worked as a system software researcher at the Korea Electronics Technology
Institute (KETI), Gyeonggi-do, Korea. He is currently the vice president of KETI's
Semiconductor Display R&D Division. His research interests are AI system, Neuromorphic
chip, Processor In-Memory and system architecture design.
Byung-Soo Kim was born in Seoul, Korea, in 1984. He received B.S. and M.S degrees
from the School of Information and Communication Engineering in 2006 and 2008, respectively,
at Inha University, Korea. In 2013, he received Ph.D. degree from the School Information
and Communication Engineering at Inha University, Korea. Since 2013, he has been with
the Korea Electronics Technology Institute (KETI), Gyeonggi-do, Republic of Korea.
He is currently the Director of the SoC Platform Research Center. His research interests
are in the areas of VLSI and SoC design, Neuromorphic chip design, Processor In-Memory,
and Quantum computing systems.