Park Chunmyung1
Kim Jicheon1
Hyun Eunjae1
Nguyen Xuan Truong1
Lee Hyuk-Jae1
-
(Department of Electrical and Computer Engineering, Seoul National University, Seoul
08826, Korea
{lukpcm, jckim, silverhj, truongnx, hyuk_jae_lee}@capp.snu.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
CNN accelerator, Processing element, Hardware utilization, FPGA, YOLO-v3
1. Introduction
Object detection has been actively studied for a broad range of applications across
various domains, such as traffic monitoring [2], unmanned stores [3], and autonomous driving [4]. Many of these applications require low-latency and real-time responses (i.e., more
than 30 frames per second). Owing to the rapid evolution of deep neural networks (DNNs),
the performance capabilities of object detection models are undergoing rapid enhancements
in model accuracy and execution times. In particular, single-stage object detectors,
such as YOLOv3 [5] and EfficientDet [6], have achieved a good tradeoff between model accuracy and real-time execution using
graphical processing units (GPUs). Unfortunately, GPUs consume considerable power,
making them unsuitable for many energy/power-constrained applications.
In recent years, other alternative solutions, such as Field Programmable Gate Arrays
(FPGAs), have received more attention for DNN accelerators because of their low latency,
good power efficiency, high configurability, and rapid prototyping. Particularly,
many FPGA implementations of YOLO accelerators [7-9] have been proposed. [7] proposed a streaming architecture design for YOLO-v2 and its tiny version in which
a layer has its processing elements (PEs). Despite the fast inference speed due to
high paralleling and pipelining among multiple layers, the architecture requires a
huge buffer or BRAMs. As a result, it is only suitable for highly customized networks,
such as binarized weight or activation ones, which generally suffer from a large accuracy
drop. ShortcutFusion [1] proposes a generic CNN accelerator that effectively supports various networks, including
MobileNet-v2, EfficientNet-B0, ResNet-50, and YOLO-v3. The accelerator consists of
4096 eight-bit multipliers and adder trees, which work in parallel, to achieve high
accuracy and performance. Unfortunately, its processing elements are not fully utilized,
leading to relatively low hardware utilization (e.g., 68.42% for YOLO-v3).
To address the problem, this paper proposes ShortcutFusion++, an improved version
of ShortcutFusion, by specifically optimizing the PE utilization of the baseline.
In particular, two common and high-impact PE under-utilization cases were observed,
and a method to solve them was proposed. The contributions of the paper are as follows:
1) Under-utilization: two common cases of low PE utilization were observed when mapping
YOLO-v3 into ShortcutFusion. Specifically, the baseline dataflow showed relatively
low utilization (i.e., 34.01%) when executing stride=2 convolution and a row-reuse
scheme.
2) Proposed method: this paper proposes a flexible prefetching scheme and re-design
the output buffer to address the abovementioned cases. Utilizing the proposed approaches,
ShortcutFusion++ avoids unnecessary stall cycles during feeding data to PEs and writing
the results to external memory.
3) Experiments: The experimental results show that ShortcutFusion++ achieves 80.95%
hardware utilization for YOLO-v3, outperforming its baseline by 12.53%.
The remainder of this paper is organized as follows. Section 2 introduces the background
related to ShortcutFusion. In Section 3, the optimization methods are described. Section
4 presents the evaluation method and experimental results, and Section 5 concludes
the paper.
2. Related Works
2.1 CNN Accelerators and Processing Elements
Having millions of multiplication-accumulate (MAC) operations, a convolutional (CONV)
layer can be expressed by six or seven nested loops [10]. On the other hand, a CONV layer can be transformed into a general matrix-matrix
or matrix-vector multiplication using the im2col transform [11]. As a result, two typical PE designs for generic CNN accelerators are the systolic
arrays [12,13] and inner-product multipliers with an adder tree [14,15]. ShorcutFusion [1] consists of $T_{o}$ CONV kernels. Each CONV kernel consists of $T_{i}$ multipliers
and an adder tree. In particular, $T_{i}$ and $T_{o}$ were set to 64, resulting in
4096 multipliers in total.
Assume that $C_{i}$ and $C_{o}$ are the number of input feature channels and output
filters, respectively, for a given CONV layer. The number of computing cycles is as
follows:
where K is the filter size, and H and W are the height and width of the output feature
maps, respectively.
2.2 ShortcutFusion
Fig. 1 shows the architecture of the CNN accelerator in [1]. In particular, it consists of a controller, two DMA modules for loading the weights
and the input feature maps (IFMs) of the model, and one DMA module for writing the
output feature maps (OFMs). The controller selects either a row-based weight reuse
scheme or a frame-based weight reuse scheme. Notably, although there are many loop
interchange or tiling options for six nested loops, ShorcutFusion utilizes only two
weight reuse schemes by observing that a CONV layer may have (1) large IFMs with a small number of weights or (2) small IFMs with a large number of weights.
The two reuse schemes are described in Fig. 2. The frame-based reuse scheme is utilized when the IFMs are small enough to be stored
in an on-chip buffer. In particular, it reuses the weight blocks (i.e., $K\times K\times
T_{i}$ pixels) while the input sliding cube (i.e., $K\times K\times T_{i}$ pixels)
passes through a single frame of the IFM, as shown in Fig. 2(a). The input data in the sliding cube are convolved with $To$ weight blocks, generating
the partial sum of the OFM (i.e., $H\times W\times T_{o}$ pixels). When the input
sliding cube hits the end of the frame, it moves towards the channel direction to
generate the next partial sum, which is accumulated to the previous result.
The row-based weight reuse scheme is utilized when the IFMs are relatively large,
and the number of weights is small. In particular, weights are preloaded and reused
while the input sliding cube passes through a single row, as shown in Fig. 2(b). This generates the partial sum of the OFM (i.e., $1\times W\times T_{o}$ pixels).
When the sliding cube hits the end of the row, it moves towards the channel direction,
remaining in the same row. By remaining in the same row, the generated output can
be accumulated to the previous output, which becomes the final output if the input
sliding cube hits the end of the channel.
The main controller selects either a frame-based dataflow or a row-based dataflow
depending on the weight reuse scheme. In frame-based dataflow (red arrow in Fig. 1), the IFMs are fetched from the on-chip buffer and go through the line buffer and
the CONV window module. When the CONV kernel generates the OFMs, it writes the OFMs
to the on-chip buffer. In the row-based dataflow (blue arrow in Fig. 1), however, the IFMs are fetched from the off-chip memory. The input loader module
loads the IFMs from the off-chip memory using DMA. When the CONV kernel generates
the OFMs, it writes the OFMs to the output buffer. The output writer module also stores
the OFMs to off-chip memory using DMA.
Fig. 1. Block diagram of CNN accelerator in[1].
Fig. 2. Weight reuse scheme in Shortcut-Fusion: (a) Frame-based weight reuse; (b) Row-based weight reuse. Borrowed from Fig. 3. in[1].
2.3 Motivations
YOLO-v3 [5] is a well-known object detector that has achieved a good tradeoff between model accuracy
and real-time execution. In particular, the number of input and output channels $C_{i}$
and $C_{o}$ in YOLO-v3 are 64, 128, 256, 512, and 1024. Therefore, it is likely for
the PEs of ShortcutFusion to be fully utilized according to Eq. (1). Unfortunately, as reported in [1], ShortcutFusion only achieves a utilization of 68.42% for YOLO-v3. This phenomenon
prompted this study to determine the sources of underutilization in which PEs are
forced to IDLE during data movement.
3. Proposed Work
3.1 Under-utilization of PEs
This subsection specifically quantifies the layer-wise PE utilization for YOLO-v3
using ShortcutFusion. Although the frame-based weight reuse scheme generally achieves
higher utilization than the row-based one, it may require a huge on-chip buffer to
store IFMs. Therefore, following [1], the cutpoint is set to 9 to meet a constraint of on-chip buffer size. As a result,
a row-based weight reuse scheme was applied for CONV layers 0-8, while a frame-based
weight reuse scheme was applied for the remaining CONV layers 9-76. Table 1 lists the profiling results. As shown in the third row, the ‘Frame-based’ category
(i.e., layer with the frame-based dataflow) shows a relatively high PE utilization
of 81.15%. The ‘Row-based’ category (i.e., layer with the row-based dataflow) shows
poor PE utilization of 33.89%. Specifically, the ‘Stride = 2’ category (i.e., layers
1, 4, 9, 26, and 43 with stride = 2) suffers from severe under-utilization with a
PE utilization of 34.01%. The phenomenon occurs for both the row-based scheme (e.g.,
layers 1 and 4) and the frame-based scheme (e.g., layers 9, 26, and 43). Because those
``stride=2'' layers account for 24.35% of the overall execution time, their low utilization
causes a utilization drop for the entire network.
The following subsections analyze the source of low utilization on those layers and
propose methods to enhance the utilization.
Table 1. Average PE utilization of CNN accelerator in ShortcutFusion.
Category
|
Layer #
|
Runtime Ratio
|
Avg. util
|
Row-based
|
0–8
|
26.95%
|
33.89%
|
Frame-based
|
9–76
|
73.05%
|
81.15%
|
Stride = 2
|
1,4,9,26,43
|
24.35%
|
34.01%
|
Overall
|
0–76
|
100%
|
68.42%
|
3.2 Optimization on Stride = 2 Convolution
Fig. 3 shows the dataflow during the 3${\times}$3 convolution. CONV kernel needs nine cycles
to consume one window data (i.e., $3\times 3\times \mathrm{T}_{\mathrm{i}}$ pixels).
Therefore, to synchronize with the CONV kernel, the controller must fetch one window
data for every nine cycles.
In the case of stride = 1 convolution, however, 2-column data (i.e., $3\times 2\times
\mathrm{T}_{\mathrm{i}}$ pixels) can be reused from the previous window data, as shown
in Fig. 3(a). As a result, only one column of data (i.e., $3\times 1\times \mathrm{T}_{\mathrm{i}}$
pixels) needs to be fetched from the controller every nine cycles.
Fig. 3(b) shows the dataflow during stride = 2 convolution. In the case of stride = 2 convolution,
only one column of data can be reused from previous window data. Therefore, two-column
data must be fetched from the controller every nine cycles. On the other hand, it
takes 18 cycles to fetch 2-column data because the fetching speed of the original
accelerator is fixed. This makes the CONV kernel idle for nine cycles, which abruptly
decreases the PE utilization.
A flexible prefetching scheme was proposed to resolve this issue (Fig. 3(c)). When conducting stride = 2 convolutions, the controller increases the data fetching
speed, thus fetching 2-column data every nine cycles. As a result, the window data
can be ready for every nine cycles, and the CONV kernel does not need to wait for
the next window data, which avoids unnecessary stall cycles.
One more optimization point exists to increase PE utilization during stride = 2 convolution.
The controller requires three-row data (i.e., $3\times \mathrm{W}\times \mathrm{T}_{\mathrm{i}}$
pixels) to fetch the column data. Therefore, three-row data should be prefetched to
the line buffer before the computation starts. On the other hand, after prefetching
the first three-row data, the amount of data that needs to be prefetched is smaller
because the dataflow can reuse the row data from the line buffer.
In the case of stride = 1 convolution, with only a single row of data (i.e., $1\times
\mathrm{W}\times \mathrm{T}_{\mathrm{i}}$ pixels) prefetch, all three-row data are
ready because two can be reused from the line buffer. For the stride = 2 convolution,
however, it requires two-row data because only one can be reused. The original accelerator
always prefetches the same amount of row data (i.e., single row data) regardless of
the layer type. Therefore, it suffers from under-utilization during the stride = 2
convolution because of insufficient prefetching.
This can also be resolved by applying a flexible prefetching scheme, which flexibly
chooses the amount of prefetching. Using the proposed method, the amount of prefetched
data is increased. Hence there is no under-utilization.
Fig. 3. Dataflow during a 3x3 convolution: (a) stride = 1; (b) stride = 2; (c) stride = 2 with optimization method.
3.3 Optimization on Row-based Dataflow
This subsection analyzes the source of under-utilization in row-based dataflow and
presents an optimization method to increase PE utilization in the row-based dataflow.
During the row-based dataflow, the results showed lower PE utilization than frame-based
dataflow. This is because the location of the feature map is different. In frame-based
dataflow, the accelerator reads the IFMs from the on-chip buffer and writes the OFMs
to the on-chip buffer. Using the on-chip buffer, the data bandwidth is very high,
which supports the rapid movement of input and output data. This makes it easy to
pipeline the whole computation, which can easily utilize the PEs.
In row-based dataflow, however, the accelerator reads the IFMs from off-chip memory
and writes the OFMs back to off-chip memory. Because the data movement of input and
output data is slow, it is difficult to pipeline the computation fully in row-based
dataflow. Thus, PE utilization is lower than frame-based dataflow.
Fig. 4. shows a timing diagram of the CONV operation and DMA operation in the row-based
dataflow. As shown in Fig. 4(a), the CONV and DMA operations are not pipelined, leading to low PE utilization. Both
operations cannot be pipelined because of the data hazard caused by concurrent access
to the output buffer. During the CONV operation, the CONV kernel writes the OFM to
the output buffer. During the DMA process, however, the output writer reads OFM and
transfers the data to off-chip memory. If both operations are pipelined, the OFM data
can be overwritten by the next CONV kernel before the DMA. Although the next CONV
operation starts later than the DMA, a data hazard can occur due to the low bandwidth
of off-chip memory.
The output buffer is reconstructed in the proposed method to enable the pipeline between
two operations. The reconstructed output buffer consists of two separate buffers.
Using two separate buffers, both operation switch between two buffers (Fig. 4(b)). When the output writer reads the OFM data from one buffer, the CONV kernel writes
the following OFM data to the other buffer. Because of this, the data hazard is removed,
and both operations can be well-pipelined.
Fig. 4. Timing diagram of CONV operation and DMA operation in row-based dataflow: (a) Before optimization, both operations are not pipelined; (b) After optimization, the operations are well-pipelined.
4. Performance Evaluation
4.1 Evaluation Method
This subsection presents the evaluation method for PE utilization. The PE utilization
can be measured by the total operation count divided by the maximum number of operations
with a given number of PEs. Since each PE can execute two operations (i.e., 1 multiplication
+ 1 addition) for every single cycle, the maximum number of operations is $2\times
\left(PE\,\,count\right)\times \left(cycle\,\,count\right)\,.$ Therefore, PE utilization
can be formulated as follows:
where $OP_{MUL}$ is the number of multiplications; $OP_{ADD}$ is the number of additions;
$cycle$ is the number of cycles.
In addition, $OP_{MUL}$ and $OP_{ADD}$ are formulated as follows:
where $K$ is the width of the convolution kernel; $C_{i}$ is the number of input feature
channels; $C_{o}$ is the number of output filters; $H$ is the height of OFMs; $W$
is the width of the OFMs.
The number of cycles can be obtained from RTL simulation. Therefore, PE utilization
can be measured using this equation.
4.2 Evaluation Result
This section shows the results of two optimization methods compared to the original
CNN accelerator [1]. All results are based on the YOLO-v3 network. As shown in Table 2, with the stride = 2 convolution optimization method, the average PE utilization
in stride = 2 convolution achieves 67.21%, which is improved by 33.20% from the baseline.
This is because the proposed flexible data prefetching scheme removes the unnecessary
stall cycles. In addition, with the row-based dataflow optimization method, it achieved
a PE utilization of 52.53%, which is improved by 19.1%. This shows that the reconstructed
output buffer successfully increased the PE utilization by pipelining the PE operation
and writing OFMs.
Fig. 5. shows the layer-wise PE utilization of the three different optimization steps. For
layers 1, 4, 9, 26, and 43, stride = 2 convolutions were performed, so the stride
= 2 convolution optimization (yellow in Fig. 5) shows the improvement. During the layer from 0 to 8, the accelerator used row-based
dataflow, so the row-based dataflow optimization (blue in Fig. 5) shows improvement.
Table 3 lists the overall PE utilization by applying optimizations. The results indicated
that ShortcutFusion++ achieved a PE utilization of 80.95%, showing 12.53% improvement
from the baseline.
Fig. 5. Layer-wise PE utilization with the optimization method.
Table 2. PE utilization of target layer with optimization method.
Optimization
|
PE utilization
|
Improvement
|
Before Opt.
|
After Opt.
|
Stride = 2
|
34.01%
|
67.21%
|
+33.20%
|
Row-based
|
44.09%
|
52.53%
|
+8.44%
|
Table 3. Overall PE utilization with optimization method.
Optimization
|
PE utilization
|
Baseline
|
68.42%
|
Stride=2 Conv. Optimization
|
77.75%
|
Stride=2 Conv. Optimization
+ Row-based Dataflow Optimization
|
80.95%
|
5. Conclusion
This paper reported the under-utilization of the processing element in ShortcutFusion
and proposed two optimization methods to increase PE utilization. By applying both
optimizations, ShortcutFusion++ can highly utilize the processing elements, outperforming
the baseline in PE utilization.
ACKNOWLEDGMENTS
This work was supported in part by the R&D Program of MOTIE/KEIT (No. 20010582,
Development of deep learning based low power HW IP design technology for image processing
of CMOS image sensors) and in part by the Technology Innovation Program (or Industrial
Strategic Technology Development Program – No. 20014490, Development of Technology
for Commercializing Lv.4 Self-driving Computing Platform Based on Centralized Architecture)
funded by the Ministry of Trade, Industry & Energy (MOTIE, Korea).
REFERENCES
Nguyen Duy Thanh, Je Hyeonseung, Nguyen Tuan Nghia, Ryu Soojung, Lee Kyujoong, Lee
Hyuk-Jae, ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware
memory allocation for shortcut data, IEEE Transactions on Circuits and Systems I:
Regular Papers, Vol. 69, No. 6, pp. 2477-2489
Yadav Satya Prakash., 2020, Vision-based detection, tracking, classification of vehicles.,
IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 6, pp. 427-434
Zhang Haijun, et al. , 2019:, Toward new retail: A benchmark dataset for smart unmanned
vending machines, IEEE Transactions on Industrial informatics, Vol. 16, No. 12, pp.
7722-7731
Choi Jiwoong, et al. , 2022, Efficient Object Detection Acceleration Methods for Autonomous-driving
Embedded Platforms, IEIE Transactions on Smart Processing & Computing, Vol. 11, No.
4, pp. 255-261
Redmon Joseph., Farhadi Ali., 2018, YOLOv3: An incremental improvement, arXiv preprint
arXiv: 1804.02767
Tan Mingxing., Le Quoc V., 2019, EfficientNet: Rethinking model scaling for convolutional
neural networks, In Proceedings of International Conference on Machine Learning (ICML)
Nguyen Duy Thanh, et al. , 2019, A high-throughput and power-efficient FPGA implementation
of YOLO CNN for object detection, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, Vol. 27, No. 8, pp. 1861-1873.
Nguyen Duy Thanh., Kim Hyun., Lee Hyuk-Jae., 2020, Layer-specific optimization for
mixed data flow with mixed precision in FPGA design for CNN-based object detectors,
IEEE Transactions on Circuits and Systems for Video Technology, Vol. 31, No. 6, pp.
2450-2464
Zhang Xiaofan, et al. , 2018, DNNBuilder: An automated tool for building high-performance
DNN hardware accelerators for FPGAs, 2018 IEEE/ACM International Conference on Computer-Aided
Design (ICCAD). IEEE
Dave Shail, et al. , 2019, Dmazerunner: Executing perfectly nested loops on dataflow
accelerators, ACM Transactions on Embedded Computing Systems (TECS), Vol. 18, No.
5s, pp. 1-27
Lai Liangzhen, Suda Naveen, Chandra Vikas, 2018, CMSIS-NN: Efficient neural network
kernels for arm cortex-m cpus, arXiv preprint arXiv:1801. 06601
Chen Yu-Hsin, et al. , 2016, Eyeriss: An energy-efficient reconfigurable accelerator
for deep convolutional neural networks, IEEE journal of solid-state circuits, Vol.
52, No. 1, pp. 127-138
Kung H., 1980, Algorithms for VLSI processor arrays, Introduction to VLSI systems,
pp. 271-292
Bai Lin, Zhao Yiming, Huang Xinming, 2018, A CNN accelerator on FPGA using depthwise
separable convolution, IEEE Transactions on Circuits and Systems II: Express Briefs,
Vol. 65, No. 10, pp. 1415-1419
Ma Yufei, et al., 2016, Scalable and modularized RTL compilation of convolutional
neural networks onto FPGA, 2016 26th international conference on field programmable
logic and applications (FPL). IEEE
Author
Chunmyung Park received his B.S. degree in electrical and computer engineering
from Seoul National University, Seoul, South Korea, in 2020. He is currently working
toward an integrated M.S. and Ph.D. degree in electrical and computer engineering
at Seoul National University, Seoul, South Korea. His current research interests include
computer architecture and SoC for neural network processing.
Jicheon Kim received his B.S. degree in electrical and computer engineering from
the University of Seoul in 2011, and M.S. degree from Seoul National University, Seoul,
South Korea, in 2013. From 2013 to 2017, he was with the SoC Division, GCT Semiconductor,
Seoul, South Korea. In 2017, he joined the S. LSI Division, Samsung Electronics Corporation.
His current research interests include computer architecture and SoC for machine learning.
Eunjae Hyun received his B.S. degree in biosystems engineering and M.S. degree
in bioengineering from Seoul National University, Seoul, South Korea, in 2010 and
2014, respectively. He is currently working toward a Ph.D. degree in electrical and
computer engineering at Seoul National Univer-sity, Seoul, South Korea. He has participated
in various projects at Samsung Electronics' DMC Research Center and S.LSI Division
since 2014 and has participated in the development of image signal processing algorithms
integrated into commercial image sensors for the past four years until 2021. His current
research interests include computer architecture and SoC for neural network processing.
Xuan Truong Nguyen received his B.S. in Electrical Engineering from Hanoi University
of Science and Technology, Hanoi, Vietnam, in 2011; M.S., and Ph.D. degrees in Electrical
Engineering and Computer Science from Seoul National University, Seoul, Korea, in
2015 and 2019, respectively. He is working as a postdoctoral fellow from BK21+ of
the Electrical and Computer Engineering Department of Seoul National University. His
research interests include algorithm and SoC design for low-complexity computer vision
and multimedia applications.
Hyuk-Jae Lee received his B.S. and M.S. degrees in electronics engi-neering from
Seoul National University, Seoul, South Korea, in 1987 and 1989, respectively, and
the Ph.D. degree in electrical and computer Engineering from Purdue University, West
Lafayette, IN, USA, in 1996. From 1998 to 2001, he was a Senior Component Design Engineer
at the Server and Workstation Chipset Division, Intel Corporation, Hillsboro, OR,
USA. From 1996 to 1998, he was a Faculty Member at the Department of Computer Science,
Louisiana Tech University, Ruston, LS, USA. In 2001, he joined the School of Electrical
Engineering and Computer Science at Seoul National University, where he is a Professor.
He is the Founder of Mamurian Design, Inc., Seoul, a fabless SoC design house for
multimedia applications. His current research interests include computer architecture
and SoC for multimedia applications.