Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 24, No. 02, p.84-95

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 11 Apr 2023Revised : 6 Nov 2023Accepted : 26 Nov 2023

DOI :

https://doi.org/10.5573/JSTS.2024.24.2.84

Spiking Convolution Processor with NoC Architecture and Membrane Data Reuse Dataflow

KimHee-Tak^1,† HongYun-Pyo¹ JeonSeok-Hun¹ HwangTae-Ho¹ KimByung-Soo¹

(Korea Electronics Technology Institute, Seongnam-si, Korea)

^* E-mail: htkim@keti.re.kr

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Spiking neural networks (SNNs) which mimic the human brain environment have been regarded as a key role to develop human-oriented artificial intelligence (AI). Even if SNNs have lower algorithmic accuracy than recently developed deep neural networks (DNNs), spiking convolution neural networks (SCNNs) which combine convolution operation and spiking neuron achieved comparable accuracy with DNNs. However, frequent external memory access for repetitive membrane potential updates, and low hardware throughput hinder an energy-efficient SNN acceleration. In this paper, a novel dataflow that minimizes the memory access by reusing membrane data is proposed. Next, high bandwidth network-on-chip (NoC) with row stationary dataflow as well as end-to-end pipelining architecture are implemented to achieve the high throughput SCNN processor. Finally, system on chip (SoC) architecture is designed to verify the proposed SCNN processor and fabricated under 55 nm CMOS process. SCNN processor achieved the throughput of 38.4 GMAC/s in 4.35 mm² area, and SoC chip has been verified through MNIST and Cifar10 datasets.

Index Terms

Spiking neural network, network on chip, spiking convolution, accelerator, data reuse

I. INTRODUCTION

Spiking neural networks (SNN), which attempt to emulate the mammalian cortex through a biologically plausible spiking neuron model, are gaining considerable interests as the third generation of neural network after artificial neural networks (ANN). Spiking neurons transmit and receive information as series of spikes via synapses in the brain-inspired spatio-temporal domain. Benefit from the characteristics of spikes, which are the sparse activity and the binary representation, a number of training algorithms and hardware architectures have been proposed.

In terms of training algorithms, SNNs have been developed in two ways. First, motivated by experimental observations in biological neurons, SNNs were trained by biological plausible training algorithms such as spike-timing dependent plasticity (STDP) ^[1]. In order to improve the accuracy of STDP in real world applications like a classification task, various types of spiking neuron models ^[2,^3] or mathematically formulated STDP mechanisms ^[4,^5] have been applied. Despite noticeable improvements in SNNs, the biologically plausible training algorithms suffer from low-accuracy in real-life applications. In order to achieve comparable accuracy with ANNs, back-propagation (BP) has been applied to SNNs, and there are two training methods which are 1) ANN-to-SNN conversion ^[6,^7] and 2) BP-on-SNN ^[8-^10]. For ANN-to-SNN conversion method, parameters of pre-trained ANN are normalized to adjust parameters fit into SNN architectures where the spike generating function and the membrane potential calculation are included. In BP-on-SNN method, BP is applied directly onto SNN architectures. In this method, to relieve the non-differentiable characteristic from the discrete spike activity which impedes the gradient descent based BP implementation on SNNs, additional layers such as a postsynaptic current (PSC) layer ^[10] or an auxiliary layer ^[9] were added in SNNs. Both two methods have shown comparable accuracy with ANNs, which brings about a demand for an efficient hardware accelerator for SNNs.

In order to accelerate SNNs efficiently, there are two main processor architectures for SNNs, which are event-driven architecture and frame-driven architecture. Fig. 1 shows illustrations of two architectures. Firstly, with the binary representation spike and sparse activity of spiking neurons, an address event representation (AER) based event-driven architectures have been proposed to reduce the energy consumption. IBM Truenorth ^[11], Intel Loihi ^[12], and ^[13,^19] proposed 2D mesh network architecture that deploys multiple cores with routers, where every core includes hundreds of spiking neurons. In the mesh network, each core generates spikes in every time-step, and the generated spikes are transferred as AER to the cores containing corresponding post-synaptic neurons through the router. So, AER based event-driven architectures achieve low power performance by leveraging binary and sparse spike activity characteristics. Also, event-driven architectures enable emulation of SNNs in more biological environment as well as SNN acceleration ^[14]. However, event-driven architectures accompany large chip size because the entire parameters for SNNs are realized on-chip. In addition, event-driven architectures suffer low throughput and latency due to the small number of parallel processing units in the processor. In order to improve the throughput in a small chip size, data-reuse featured frame-driven architectures have been proposed. Similar to ANN accelerators, frame-driven architectures store all SNN parameters in the external memory and iteratively process a small part of SNN operations. So, at the expense of using the external memory, more number of parallel processing units can be implemented, which improves the throughput and latency of SNN acceleration. In ^[15,^16], systolic array and spine based architectures are used to accelerate SNN, and multiple time-steps are processed simultaneously in processing elements (PE) in order to reuse partial sums (psum).

Fig. 1. Two types of SNN processor architecture.

In this work, we have designed ketrion that accelerates SNN operations under CMOS 55nm process. The main contributions of ketrion are summarized as following:

1. Energy consumption overhead from the membrane pote-ntial calculation in the spiking neuron is minimized by a novel reuse dataflow. Also, energy-efficient row station-ary dataflow is adopted.

2. High bandwidth network-on-chip bus and pipelining architecture provide high throughput SNN operation.

3. CNN as well as SNN mode are supported, and pooling core and activation function cores are implemented to realize end-to-end application supporting hardware.

4. ketrion is added to a system on chip (SoC) architecture to evaluate the performance with generally used software platform.

II. BACKGROUND

1. Spiking Neuron Model

SNNs consist of spiking neurons and synapses, where spiking neurons are connected through synapses. To formulate the biological neuron into the spiking neuron, there are a number of types of spiking neurons and synapses, e.g., the Izhikevich neuron ^[21], the conductance synapse model ^[22]. The most popular and simple models are leaky integrate-and-fire (LIF) neuron and linear synapse. LIF literally indicates that membrane potential (memb) retained in the neuron integrates received inputs in the form of summation of connected synapses, leaks as time passed, and the neuron fires the spike when memb exceeds its threshold value. The equation of digitized single LIF neuron with the linear synapse is shown below:

(1)

$ U\left[\mathrm{t}\right]=\left(1-\frac{1}{\tau }\right)U\left[\mathrm{t}-1\right]+\sum _{i}\boldsymbol{X}_{i}\left[\mathrm{t}\right]\boldsymbol{W}_{i}\\ $

(2)

$ S\left[\mathrm{t}\right]=\begin{cases} 1,if\,\,U\left[\mathrm{t}\right]> V_{th}\\ 0,else\,\,\, \end{cases} $

(3)

$ U\left[\mathrm{t}\right]=\begin{cases} ~ V_{reset},\,\,\,if\,\,S\left[\mathrm{t}\right]=1\\ ~ U\left[\mathrm{t}\right],\,\,\,else\,\,\, \end{cases} $

where t is a digitized time-step in the time domain, V_th is a threshold voltage, V_reset is a reset voltage after a spike generation, ${\tau}$ is a leakage factor, S[t] is a generated spike at t^th time-step, and U[t] is a memb of the neuron at t^th time-step. W and X are matrices of a synapse weight and a spike connected to the neuron, respectively. An illustration of LIF neuron model is shown in Fig. 2. Here, the neuron that produces incoming spike is called a pre-synaptic neuron (ineu), and the neuron that generates spike is called a post-synaptic neuron (oneu). Each ineu has X_i value and each oneu has S_i value. When one ineu is connected to some oneus, a spike (S) generated from ineu is propagated to the connected oneus as X_i. The LIF operation is repeatedly performed for the length of time-steps, therein memb is updated for every time-step. Therefore, the length of time-steps and memb should be carefully considered for the SCNN processor to improve the energy-efficiency.

Fig. 2. An illustration of the LIF spiking neuron model at txth time-step.

2. Spiking Convolutional Neural Network (SCNN)

When the spiking neuron which emits a spike in the time domain is used as the activation function in CNNs, we call it as spiking CNNs (SCNN). SCNNs are composed of layers of CNNs such as the pooling layer, the convolutional layer and the fully connected layer, where every neuron in the layers is the spiking neuron with the time domain. In SCNNs, the convolution stage takes an input of 4-D array ineu (X) and synapse (W), and takes an output of a 4-D array oneu (S). 4-D ineu consists of H_iｘW_i sized 2-D feature maps with N-sized channels and T-sized time-steps. 4-D oneu consists of H_oｘW_o sized 2-D feature maps with M-sized channels and T-sized time-steps. 4-D synapse consists of H_fｘW_f sized 2-D filter with N-sized input channel and M-sized output channel. memb (U) has the same shape with oneu. All the referred parameters are listed in Table 1. So, in SCNN, (1) of the spiking neuron model equation is modified as below:

(4)

$ \boldsymbol{U}_{m}\left[\mathrm{t}\right]=\left(1-\frac{1}{\tau }\right)\boldsymbol{U}_{m}\left[\mathrm{t}-1\right]+{\sum }_{n=0}^{N-1}\boldsymbol{X}_{\boldsymbol{n}}\left[\mathrm{t}\right]\otimes \boldsymbol{W}_{n,m} $

where $\otimes $ denotes 2D convolution operation, X_n[t] is a 2-D ineu of n^th channel at t^th time-step, W_n,m is a 2-D synapse of n^th input channel and m^th output channel, and U_m[t] is a 2-D memb of m^th channel at t^th time-step. After U_m[t] is calculated in (4), (2) to (3) are applied to generate S_m[t] of 2-D spike which has the same shape with U_m[t]. An illustration of SCNN operation is shown in Fig. 3. From the SCNN equation, we found that SCNN operation can be used as the conventional CNN operation by setting some parameters of SCNN equation. When ${\tau}$ is set to one to eliminate memb calculation, T is set to one to remove time domain processing, and the spike generating function (2) is modified as the activation function such as the rectified linear unit (ReLU) in CNN, only the convolution operation in (4) is remained. By this way, SCNN operation can be described as the conventional CNN operation. Base on this algorithmic modification, in this work, we proposed a hardware architecture that supports both SCNNs and CNNs by sharing identical algorithmic operations.

Fig. 3. An illustration of SCNN operation at txth time-step. The notation ‘*’ denotes the convolution operation.

Table 1. Parameters for SCNN operation

Parameter	Description
W_i / H_i	Width / Height of pre-synaptic neuron
W_o / H_o	Width / Height of post-synaptic neuron
W_f / H_f	Width / Height of synapse shape
N / M	Channel size of pre/post-synaptic neuron
T	Length of digitized time-step

In this work, we aimed to design the SCNN processor that can be used for a wide variety of SCNN architectures. From our observation on two types of SCNN training methods, which are ANN-to-SNN and BP-on-SNN, we found that multi-bit representation is essential for the ineu (X) due to following reasons. 1) In BP-on-SNN training method ^[8,^9], the added PSC layer or the auxiliary layer transforms 1-bit spike into multi-bit ineu. In PSC layer, the spike is transformed into a multi-bit represented smoothed activation following inter/intra neuron dependencies. Also, in the auxiliary layer, the moving average of spike activity is added to the spike inputs, which incurs multi-bit ineu. The PSC layer and the auxiliary layer illustration is shown in Fig. 4. 2) In ANN-to-SNN training method, the first convolution layer always has multi-bit pixel inputs, not 1-bit spike inputs. 3) Multi-bit representation for the spike happends after the pooling layer because some spikes are added together for the max-pooling operation. For these reasons, we used 16 bit fixed point precision for both ineu in this work.

Fig. 4. An occurrence of multi-bit activations in SCNNs: (a) Activation after the PSC layer; (b) Activation after the summation of oneus with the auxiliary layer output.

3. Hardware Dataflow for SCNNs

Since the data movement with the external memory, mostly DRAM, affects the energy consumption increment, reducing DRAM access is important for the SCNN processor. In the data-reuse featured frame-driven hardware architecture, entire parameters for SCNNs such as ineu, synapse and oneu are stored in DRAM and a small part of the whole parameters are read and computed in the processor. In order to store and reuse the small part of parameters in the processor, a SRAM base global buffer (gbuf) is implemented. Here, the number of DRAM access is determined by the size of parameters stored in gbuf and the hardware dataflow that can reuse the parameters even in small sized gbuf. Unlikely to CNN, in SCNN, DRAM access for memb, updated at every time-step, should be additionally considered. Reducing DRAM access for memb is realized by, firstly, adopting memb-stationary (MS) dataflow that reuses a particular amount of memb data in gbuf without transferring memb data to DRAM while operations across all time-steps are being processed, and secondly, increasing the size of gbuf to store memb. Without MS dataflow, DRAM access for as large as memb is required as each time-step increases, which incurs huge energy consumption. Thus MS dataflow is necessary for SCNN processors. With MS dataflow, when gbuf stores small-sized memb, the entire memb should be divided into multiple small-sized groups and processed. Since computing each group requires DRAM access for entire synapses, the smaller size of gbuf for memb causes the more groups and the more DRAM accesses. Fig. 5 shows the increasing number of groups according to the size of memb when MS dataflow is adopted. In the figure, the number of groups is calculated by applying the ceiling function to the size of memb divided by the gbuf size.

Fig. 5. The number of group is increasing with the size of memb. Lines with different colors represent gbuf size for storing memb.

For data-reuse dataflow adopted architectures, various SCNN processors have been proposed. SpinalFlow ^[15] was proposed, where each PE is responsible for producing one oneu value. Therefore, SpinalFlow needs to iteratively read synapses with different input channels from DRAM whenever every PE computation ends. Also, large capacity of gbuf is required for storing synapses with a number of input channels. Specifically, SpinalFlow targets temporal-coding based SCNNs, which deviates general SCNN architectures. Next, ^[16] proposed a systolic array based SCNN accelerator where ineus with multiple time-steps and synapses with multiple output channels are processed in parallel. However, in this dataflow, since membs of multiple time-steps are computed simultaneously, multiple times larger gbuf capacity for storing memb is required than the dataflow which computes memb with only single time-step. Also, PE utilization degrades when the length of time-step is shorter than the systolic array size.

In this work, we propose a hardware dataflow that combines MS dataflow and row-stationary (RS) dataflow ^[17]. In RS dataflow, as shown in Fig. 6, ineu and synapse are shared in the diagonal and the row direction of the PE array, respectively, and partial sum (psum) is accumulated in the same column of the PE array. Also, several rows of PE array compute one channel of oneu. Therefore, in RS dataflow, the 3-D convolution is performed in PE array by sharing ineus and synapses for multip le PEs. Also, psums computed from the PE array are accumulated in gbuf as much as gbuf allows until a completion of the psum accumulation. In order to combine MS dataflow onto RS dataflow, after membs in the current time-step are completely computed and stored in gbuf, membs for the next time-step is computed using membs stored in gbuf. In the proposed dataflow, thanks to RS dataflow, a large amount of memb can be stored in gbuf, and thanks to MS dataflow, memb data are not transferred to DRAM during SCNN operations. Detailed description for the proposed dataflow is described in section III.2.

Fig. 6. Row stationary dataflow.

III. PROPOSED HARDWARE ARCHITECTURE

In this section, we present NoC architecture, called ketrion, which efficiently processes SCNNs. The proposed architecture is designed to achieve following features: 1) ketrion enables operations of not only SCNNs also CNNs by sharing common computational logics. 2) To remove memb data movement between the external memory, a novel dataflow is proposed. From the dataflow, all oneus are computed and stored in gbuf without moving psums to the external memory. In each time step, new spikes and membs are computed using the oneus in gbuf, and the oneus in gbuf are replaced by the newly updated membs. The updated membs are used as initial memb for the next time step. Thus, membs are reused across all time-steps. 3) High throughput is achieved by deploying 240 (15ｘ16) PEs and a NoC that uses high bandwidth (256 bit/cycle) between 128 KB gbuf and PEs. 4) Various types of activation functions and the spiking neuron model are supported. Moreover, since the activation function is computed while the final calculated oneus are stored in gbuf, the other activation functions can be easily added. In this work, a leaky rectified linear unit (ReLU) and a leaky integrate-fire (LIF) spiking neuron model are implemented. In CNN operation mode, memb data are used as oneu of the layer.

1. Architecture Overview

Fig. 7 explains an overall architecture of ketrion, which is interfaced with advanced microcontroller bus architectu-re (AMBA) protocol such as advanced peripheral bus (APB) slave and advanced high-performance bus (AHB) master protocol. ketrion includes PE array, global controller, APB and AHB interfaces, 128 KB gbuf, psum accumulator, activation function core and pooling core. All modules in ketrion are controlled by finite state machines in the global controller. 80 KB of gbuf is used for memb, and 48 KB of gbuf is used as a dual-buffer for ineu and synapse. PE array consists of 15 vertically and 16 horizontal PEs. Parameters to conduct SCNN is configured by APB interface and the external memory is accessed by AHB interface. Here, dual buffer is used for the read path of AHB interface to alleviate throughput bottleneck between PEs and the external memory. When requested ineu and synapse data are stored in gbuf, psums are computed through PE array and psum accumulator. Leakage and rese t computations for memb are performed in psum accumulator. Each PE includes two 256-bit buffers for ineu and synapse. To support CNNs as well as the first layer and the pooling layer in SCNNs, ketrion uses 16-bit fixed point MAC operation in PE. When the final oneu are obtained in gbuf, spiking neuron model or leaky ReLU function is applied optionally to the final oneus. To compute memb within ketrion, activation core is authorized for writing newly updated membs to gbuf.

Fig. 7. Proposed hardware architecture (ketrion).

2. Proposed Dataflow

ketrion follows a nested loop iteration as stated in Algorithm 1. From RS dataflow, once ineus and synapses are read from the external memory, multiple channels (M_PE) of oneu are computed from one iteration of PE array by reusing ineus. Thus, the most outer iteration becomes M_iter. Since the size of gbuf for storing ineu and synapse is limited, iterations for W_i and H_i are divided into W_iter and H_iter respectively corresponding to the size of gbuf (W_gbuf and H_gbuf) when the size of ineu is large to be stored in gbuf. The iteration for the time domain is set to T because the time domain is not processed in parallel. In order to reuse membs in all time steps, psums are accumulated in gbuf until the final oneus are obtained through all time-steps. So, the time domain iteration increases after all channel iterations of ineu ends. When every time step processing ends, either activation functions or spiking neuron models are applied to the final oneu. Thanks to the proposed dataflow, memb data are entirely reused on ketrion without transferring to the external memory.

3. Network on Chip

To deliver ineu and synapse from gbuf to PEs, ketrion follows a single cycle multicast network proposed in ^[17]. Since PE array consists of X-bus and Y-bus, input data is delivered as a packet that contains X-ID and Y-ID. In ketrion, NoC architecture is modified as followings to improve the bus utilization and the throughput. First, bit-width of bus is implemented as 256 bit, which is the same size of the ineu and synapse buffer in PE. Therefore, NoC bus fills one buffer in a PE in a cycle. Since synapses are reused for multiple ineus in the convolution operation, PE performs MAC more than 16 cycles. Furthermore, in order to reduce the bus usage for psum operation and to make pipelined processing path from PE array output (psum) to the psum accumulation operation, psum data path is separated from NoC bus as a psum accumulator where psums are accumulated through read and write accesses to gbuf. In the psum accumulator, in order to process both psum accumulation and NoC processing simultaneously, dual register file which behaves like ping-pong memory is implemented for storing psum.

Fig. 8. Timing diagram of ketrion.

Overall timing diagram is shown in Fig. 8. In the figure, case-A and case-B describe each case when NoC processing time is slower and the external memory access time is slower, respectively. Since dual buffer is used for the read path of AHB interface, relatively slower operation between the external memory access and NoC processing dominantly determines the throughput. As shown in the figure, processing time for PE array, psum accumulator, and gbuf operation for psum are processed in the pipelined timing, which results in high PE utilization. After the completion of processing all channels of ineu, the final oneus are written to the external memory.

4. Fully Connected Layer

Fully connected layer has three types of dataflow to reuse data: 1) ineu is reused for corresponding synapses 2) psum is reused for computing one corresponding oneu 3) synapse is reused for ineu in different time step. However, the third type requires large amount of memory to store synapse. Therefore, considering that on-chip memory capacity is limited in ketrion, ketrion controls fully connected layer leveraging the first and the second type. As presented in Fig. 9, ineu and psum are reused in the row and the column direction of the PE array, respectively. Likely to the convolutional layer operation, psum is stored stationary in gbuf until the final oneu is computed. In fully connected layer operation mode, gbuf for storing ineu and synapse is swtiched because the size of synapse is larger than ineu. However, since each PE receives different synapses in fully connected layer mode, in the limited NoC bus bandwidth, ketrion suffers from low PE utilization. Only 16 PEs out of 240 PEs perform MAC simultaneously because 256-bit NoC bus activates one PE at one cycle and the activated PE performs MAC for 16 cycles.

Fig. 9. Dataflow for fully connected layer.

IV. IMPLEMENTATION RESULTS AND VERIFICATION

1. System on Chip Architecture and Implementation Results

To verify ketrion processor, we implemented a system on chip (SoC) architecture that communicates with a host computer through USB and UART interfaces and controls ketrion through a OpenRISC Core ^[20] as shown in Fig. 10. For the external memory, 64-bit 512 MB SDRAM was used. The SoC architecture has been synthesized under 55 nm CMOS process and fabricated in 4\foreignlanguage{french}{ｘ}4 chip size. As shown in Table 2, ketrion consumes 4,458,997 um² and achieves the throughput of 38.4 GMAC/s at the maximum clock frequency of 160 Mhz. The critical path delay lies in 16-bit multiplication in PE. From our area breakdown analysis, each computation module in ketrion occupies as: 15\foreignlanguage{french}{ｘ}16 PE array (53 %), 128 KB SRAM (22 %), accumulator (20 %) and controller (5 %).

Fig. 10. SoC architecture.

Table 2. Hardware specification

Components	Value
Number of PEs	15ｘ16 (=240)
Global buffer size	128 KB SRAM
Bit precision	16-bit fixed point
Throughput	38.4 GMAC/s
Synthesis area	4,458,997 um²(ketrion only) 8,841,492 um² (SoC)
Chip size	4ｘ4 mm²(SoC implementation)
Maximum frequency	160 MHz (Core 1.2V applied)
Power	398 mW (SoC, Chip measured) 183 mW (ketrion, simulation result)
Energy efficiency	209.83 GMAC/s/W
CMOS process	55 nm

2. Evaluation Results

In order to evaluate the performance of ketrion, as shown in Fig. 11, fabricated SoC chip is mounted on a verification board. Also, SoC peripheral interfaces are integrated with a pytorch software platform ^[18] on the host computer. We used TSSL-BP algorithm proposed in ^[10] to train SCNN. Single-precision floating-point number has been used to train the SCNN, and signed 16-bit integer precision has been used for the inference. Single-precision floating-point number of the trained network is quantized into 16-bit signed integer by adjusting the decimal point. Outliers from the floating-point number are clipped as the maximum and minimum value of 16-bit integer number. In the inference, ineu has non-negative value due to the binary characteristics of spikes, and synapse can have positive or negative value with the dynamic decimal point. 24-bit MAC computation results from PE are rounded and accumulated to 16-bit psum in gbuf according to the decimal point of weight.

Fig. 11. SoC verification PCB board. Die photo (left) and package photo (right) are shown in a white box.

We verified the performance of ketrion by benchmarking networks described in Table 3. SCNN achieved 99.5 % and 89.3 % accuracy for MNIST and Cifar-10 datasets respectively in the pytorch platform. We achieved a similar accuracy in the fabricated SoC chip by finding an optimal decimal point of synapse. Overall SoC execution process is performed as below:

1) Target SCNN model is trained and tested in pytorch based software platform.

2) SoC configuration parameters (address offset for ineu, synapse and oneu, and hardware specific parameters) are generated and written to SDcard.

3) SoC is communicating with the host computer using UART while SDcard is mounted to the verification board.

4) SoC receives input image by USB from the host computer.

5) Run SCNN using ketrion in SoC and verify results.

The throughput is linearly decreasing as the length of time-step (T) is increasing because of the repetitive SCNN operations through time-steps. In our work, T is set to 5 as stated in Table 3. From the simulation analysis, ketrion achieved 9.6 fps and 0.04 fps for MNIST and Cifar10 dataset, respectively. However, from the SoC simulation, SoC achieved 5 fps and 0.01 fps for MNIST and Cifar10 dataset, respectively. Thus, the throughput of SoC has the lower throughput performance than ketrion alone. This is because SDRAM bandwidth in SoC was lower than the PE processing time, which means that Case-B in Fig. 8 frequently happens. So, SDRAM bandwidth and the length of time-step are critical factors for the high throughput SCNN operation.

Table 3. SCNN parameters for verifying various datasets

Dataset	Layer	(H_i,, W_i)	(H_f, W_f)	N	M
MNIST (T=5)	CV0	(28, 28)	(5, 5)	1	15
	PL0	(24, 24)	(2, 2)	15	15
	CV1	(12, 12)	(5, 5)	15	40
	PL1	(8, 8)	(2, 2)	40	40
	FC0	(4, 4)	(1, 1)	40	300
	FC1	(1, 1)	(1, 1)	300	10
Cifar-10 (T=5)	CV0	(32, 32)	(3, 3)	3	96
	CV1	(32, 32)	(3, 3)	96	256
	PL0	(32, 32)	(2, 2)	256	256
	CV2	(16, 16)	(3, 3)	256	384
	PL1	(16, 16)	(2, 2)	384	384
	CV3	(8, 8)	(3, 3)	384	384
	CV4	(8, 8)	(3, 3)	384	256
	FC0	(8, 8)	(1, 1)	256	1024
	FC1	(1, 1)	(1, 1)	1024	1024
	FC2	(1, 1)	(1, 1)	1024	10

†CV=convolutional layer, PL=pooling layer, FC=fully connected layer

Comparison table with other SNN processors is shown in Table 4. As we discuss before, SNN processor can be implemented using one of two architecture types, which are event-driven and frame-driven architectures. Therefore, ketrion has been compared with both types of dataflow. Note that as event-driven architectures implement all neurons and synapses in the processor, DRAM is not used at the expense of large amount of buffer in the chip. On the other hand, frame-driven architectures use DRAM for storing all parameters of SCNN. In addition, the neuronal synaptic operation (SOP) in event-driven architectures and MAC operation in frame-driven architectures are not identical. However, we have compared two architectures together because firstly, the throughput can be compared using SOP/s and MAC/s metrics that primarily show the number of parallel computing units in a cycle and secondly, the comparison is meaningful in that both architectures support SCNNs.

Table 4. Comparison with the other SNN processors

	This work	^[19]	^[12]	^[15]	^[16]
Architecture	Frame-driven	Event-driven	Event-driven	Frame-driven	Frame-driven
Process (nm)	55	65	14	28 (Sim.)	32 (Sim.)
Die / IP area (mm²)	16 / 4.35	107.22 / -	60 / -	- / 2.09	- / -
Bit precision	16 bit (ineu) 16 bit (synapse) 16 bit (memb)	1 bit (ineu) 11 bit (synapse) 19 bit (memb)	1 bit (ineu) 1 bit (synapse)	1 bit (ineu) 8 bit (synapse) 8 bit (memb)	1 bit (ineu) 8 bit (synapse) 8 bit (memb)
Frequency	160 MHz	192 MHz	-	200 MHz	-
The Number of PEs	240	64 K Neurons	128 K Neurons	128	128
Global buffer size	48 KB (ineu+synapse) 80 KB (memb)	9.625 MB (ineu+synapses+memb+states)	16 MB (synapse)	576 KB (synapse) 9 KB (ineu)	54 KB (ineu+synapse+memb)
Energy efficiency	209.83 GMAC/s/W	59.05 GSOP/s/W	42.37 GSOP/s/W	157.64 GMACs/s/W	- (Normalized results only)

†Sim.=Simulation results. ‡SOP=Synaptic operations

As shown in the comparison table, ketrion achieved better energy efficiency than event-driven architectures ^[19] and ^[12]. Since SOP related throughput metric is not shown in ^[12], we referred the energy efficiency of ^[12] from ^[23]. Compared with the frame-driven processors ^[15,^16], ketrion includes as many PEs as other works with 160 MHz frequency, which means that comparable throughput is achieved if same PE utilization with ketrion is satisfied. Also, as ketrion achieved 209.83 GMACs/s/W with 183mW power consumption, ketrion is more energy-efficient than ^[15]. As we discussed in the previous section, the gbuf size for memb is essential for reducing DRAM access. Since the processor in ^[16] computes multiple time-steps together, memb of multiple time-steps should be stored in gbuf, which makes more groups and increases DRAM access for synapses. Specifically, since the processor in ^[15] stores memb in local 128 PEs without using gbuf for storing memb, it needs to read all synapses whenever PE operation ends, which causes a number of DRAM accesses. On the other hand, ketrion stores only single time-step of memb in 80 KB buffer. Hence, the smaller amount of DRAM access is required in ketrion, which makes ketrion energy-efficient.

V. CONCLUSIONS

In this paper, we propose a dataflow that reduce energy consumption from the membrane potential calculation in a spiking neuron and a NoC based SCNN processor which adopts row stationary dataflow to accelerate SCNN operations. The proposed processor supports not only SCNN operations but also CNN operations, which can be utilized for various applications. Also, in order to verify the processor, SoC architecture is implemented. After SoC is fabricated under 55 nm CMOS process, SoC chip was tested through MNIST and Cifar10 datasets to verify it can be used for real life applications.

ACKNOWLEDGMENTS

This work was supported by Korea Evaluation Institute of Indus-trial Technology (KEIT) grant funded by Korea government (No. 20009972).

References

G. Bi and M. Poo, “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynaptic Cell Type,” J. Neurosci., vol. 18, no. 24, pp. 10464-10472, Dec. 1998.

N. Brunel, “Dynamics of Sparsely Connected Networks of Excitatory and Inhibitory Spiking Neurons,” J. Comput. Neurosci., vol. 8, no. 3, pp. 183-208, May 2000.

R. Brette et al., “Simulation of networks of spiking neurons: a review of tools and strategies,” J. Comput. Neurosci., vol. 23, no. 3, pp. 349-398, Dec. 2007.

S. Song, et al, “Competitive Hebbian learning through spike-timing-dependent synaptic plasticity,” Nat. Neurosci., vol. 3, no. 9, Art. no. 9, Sep. 2000.

M. Mikaitis, et al, “Neuromodulated Synaptic Plasticity on the SpiNNaker Neuromorphic System,” Front. Neurosci., vol. 12, 2018.

Bodo Rueckauer, et al, “Conversion of Continuous-Valued Deep Networks to Efficient Event-Driven Networks for Image Classification,” Front. Neurosci., vol. 11, 2017.

S. Kim, et al, “Spiking-YOLO: Spiking Neural Network for Energy-Efficient Object Detection,” Proc. AAAI Conf. Artif. Intell., vol. 34, no. 07, Art. no. 07, Apr. 2020.

J. H. Lee, et al, “Training Deep Spiking Neural Networks Using Back-propagation,” Front. Neurosci., vol. 10, 2016.

Y. Wu, et al, “Direct training for spiking neural networks: faster, larger, better,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, pp. 1311-1318, Jan. 2019.

W. Zhang and P. Li, “Temporal spike sequence learning via backpropagation for deep spiking neural networks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, pp. 12022-12033, Dec. 2020.

F. Akopyan, et al., “TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip,” IEEE Trans. Comput.Aided Des. Integr. Circuits Syst., vol. 34, no. 10, pp. 1537-1557, Oct. 2015.

M. Davies, et al., “Loihi: A Neuromorphic Manycore Processor with On-Chip Learning,” IEEE Micro, vol. 38, no. 1, pp. 82-99, Jan. 2018.

G. K. Chen, et al, “A 4096-Neuron 1M-Synapse 3.8-pJ/SOP Spiking Neural Network With On-Chip STDP Learning and Sparse Weights in 10-nm FinFET CMOS,” IEEE J. Solid-State Circuits, vol. 54, no. 4, pp. 992-1002, Apr. 2019.

S. K. Esser, et al., “Convolutional networks for fast, energy-efficient neuromorphic computing,” Proc. Natl. Acad. Sci., vol. 113, no. 41, pp. 11441-11446, Oct. 2016.

S. Narayanan, et al, “SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) , pp. 349-362, May 2020.

J.-J. Lee, et al, “Parallel Time Batching: Systolic-Array Acceleration of Sparse Spiking Neural Computation,” in 2022 IEEE International Sympo-sium on High-Performance Computer Architecture (HPCA), pp. 317-330, Apr. 2022.

Y.-H. Chen, et al, “Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367-379, Jun. 2016.

A. Paszke, et al., “PyTorch: an imperative style, high-performance deep learning library,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, pp. 8026-8037, 2019.

Y. Kuang, et al., “A 64K-Neuron 64M-1b-Synapse 2.64pJ/SOP Neuromorphic Chip With All Memory on Chip for Spike-Based Models in 65nm CMOS,” IEEE Trans. Circuits Syst. II Express Briefs, vol. 68, no. 7, pp. 2655-2659, Jul. 2021.

J. Tandon, et al, “The OpenRISC processor: open hardware and Linux,” Linux J., vol. 2011, no. 6, Dec. 2011.

E. M. Izhikevich, “Simple Model of Spiking Neurons”, IEEE Trans. on Neural Networks, vol. 14, no. 6, pp. 1569-1572, Nov. 2003.

A. L. Hodgkin and A. F. Huxley, “A quantitative description of membrane current and its application to conduction and excitation in nerve”, The J. of Physiology, vol. 117, pp. 500-544, Aug. 1952.

A. Basu, L. Deng, C. Frenkel and X. Zhang, "Spiking Neural Network Integrated Circuits: A Review of Trends and Future Directions," 2022 IEEE Custom Integrated Circuits Conference (CICC), pp. 1-8, 2022.

Hee-Tak Kim

Hee-Tak Kim received the B.S. degree in electronics engineering from Kyunghee University, Yongin, Korea, in 2017. He received M.S. degree in electronics engineering from Korea University, in 2019, where he is currently pursuing Ph.D. degree. He has been a senior researcher with Korea Electronics Technology Institute (KETI), Gyeonggi-do, Korea, since 2019.

Yun-Pyo Hong

Yun-Pyo Hong received the B.S. degree, and the M.S. degree in Electronics Engineering from Yonsei University, Seoul, Korea, in 2012, and 2014, respectively. He was with Samsung Electronics, where he was a Staff Engineer at Display Research Group in Mobile Division. In 2020, he joined in Korea Electronics Technology Institute (KETI), Korea, where he is presently a senior researcher.

Seok-Hun Jeon

Seok-Hun Jeon received the B.S. degree, and the M.S. degree in Electronics Engineering from Soongsil University, Seoul, Korea, in 2010, and 2012, respectively. He was with Seagate, where he was a Senior Engineer at R/W Channel in Korea Design Center. In 2017, he joined in Korea Electronics Technology Institute (KETI), Korea, where he is presently a senior researcher.

Tae-Ho Hwang

Tae-Ho Hwang received the B.S., M.S. and Ph.D. degrees in Computer Engineering from the Hankuk University of Foreign Studies, in 1998, 2000, 2013 respectively. Since 2000, he has worked as a system software researcher at the Korea Electronics Technology Institute (KETI), Gyeonggi-do, Korea. He is currently the vice president of KETI's Semiconductor Display R&D Division. His research interests are AI system, Neuromorphic chip, Processor In-Memory and system architecture design.

Byung-Soo Kim

Byung-Soo Kim was born in Seoul, Korea, in 1984. He received B.S. and M.S degrees from the School of Information and Communication Engineering in 2006 and 2008, respectively, at Inha University, Korea. In 2013, he received Ph.D. degree from the School Information and Communication Engineering at Inha University, Korea. Since 2013, he has been with the Korea Electronics Technology Institute (KETI), Gyeonggi-do, Republic of Korea. He is currently the Director of the SoC Platform Research Center. His research interests are in the areas of VLSI and SoC design, Neuromorphic chip design, Processor In-Memory, and Quantum computing systems.

JSTSJournal of Semiconductor Technology and Science

Journal Search

Journal XML

Journal Information

Spiking Convolution Processor with NoC Architecture and Membrane Data Reuse Dataflow

Abstract

Index Terms

I. INTRODUCTION

Fig. 1. Two types of SNN processor architecture.

II. BACKGROUND

1. Spiking Neuron Model

(1)

(2)

(3)

Fig. 2. An illustration of the LIF spiking neuron model at txth time-step.

2. Spiking Convolutional Neural Network (SCNN)

(4)

Fig. 3. An illustration of SCNN operation at txth time-step. The notation ‘*’ denotes the convolution operation.

Table 1. Parameters for SCNN operation

Fig. 4. An occurrence of multi-bit activations in SCNNs: (a) Activation after the PSC layer; (b) Activation after the summation of oneus with the auxiliary layer output.

3. Hardware Dataflow for SCNNs

Fig. 5. The number of group is increasing with the size of memb. Lines with different colors represent gbuf size for storing memb.

Fig. 6. Row stationary dataflow.

III. PROPOSED HARDWARE ARCHITECTURE

1. Architecture Overview

Fig. 7. Proposed hardware architecture (ketrion).

2. Proposed Dataflow

3. Network on Chip

Fig. 8. Timing diagram of ketrion.

4. Fully Connected Layer

Fig. 9. Dataflow for fully connected layer.

IV. IMPLEMENTATION RESULTS AND VERIFICATION

1. System on Chip Architecture and Implementation Results

Fig. 10. SoC architecture.

Table 2. Hardware specification

2. Evaluation Results

Fig. 11. SoC verification PCB board. Die photo (left) and package photo (right) are shown in a white box.

Table 3. SCNN parameters for verifying various datasets

Table 4. Comparison with the other SNN processors

V. CONCLUSIONS

ACKNOWLEDGMENTS

References

Hee-Tak Kim

Yun-Pyo Hong

Seok-Hun Jeon

Tae-Ho Hwang

Byung-Soo Kim

Article Information (continued)

Index Terms

JSTS
Journal of Semiconductor Technology and Science