Ryu Sungju1
-
(Sungju Ryu is with the School of Electronic Engineering, Soongsil University, Seoul,
Korea)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Index Terms
Hardware accelerator, MAC unit, neural processing unit, quantized neural networks, variable bit-precision
I. INTRODUCTION
Model complexity of neural networks has been rapidly increasing to meet the target
accuracy of neural network applications. However, edge devices usually have limited
computing capability due to power constraints, so meeting the target real-time computing
latency required for the modern complex network models is challenging.
Various methods for making deep neural networks compact have been explored to ease
the burden of real-time computing, which includes quantization, weight pruning, and
separable convolution. Among them, the quantization makes the deep neural networks
lighter by expressing the inputs and weights in lower bits. A disadvantage of such
an approximation of the neural network parameters and activations via quantization
is the inference accuracy loss. As a result, many quantization techniques have been
suggested to reduce the error from the approximation.
Recently, open-source machine learning frameworks such as PyTorch [1] and TensorFlow [2] started to provide quantization APIs for researchers to make the quantization of
neural networks easier, thereby reducing the service development time. As a result,
the quantization method to compress neural network is becoming more popular.
Meanwhile, mapping the quantized neural networks to the conventional fixed bit-width-based
hardware cannot maximize the computational efficiency. For example, an 8-bit input/weight
multiplication on the 32-bit multiplier circuit has the same throughput and similar
energy-efficiency to that of the a 32-bit input/weight multiplication. To maximize
the performance of the quantized neural networks, previous works have proposed variable-bit
multiply-accumulate (MAC) units. However, such variable-bit MAC microarchitectures
have been implemented in different experimental conditions, so it is difficult to
select the most suitable scheme for a target design space. Previous work [3] studied the precision-scalable MACs, but it did not perform the evaluation on the
real benchmarks. Only ideal workloads were used for the simulation.
Our contribution to analyze and compare those variable-bit MAC units is as follows.
1) We review the variable bit-precision MAC microarchitectures. Subword-parallel,
one-/two-sided bit-width flexible MAC arrays are studied.
2) We synthesize the MAC arrays using a 28 nm standard library cells. Area, energy
consumption, and throughput are analyzed using the real neural network benchmarks.
II. REVIEW OF PRECISION-SCALABLE MAC MICROARCHITECTURES
1. One-sided Flexible Bit-width Designs
1) Stripes
In the neural networks, required bit-precision of neurons varies across the layers.
The main concept of the Stripes [4] is that performance can be linearly improved if the computation time is scaled depending
on the bit-width of the neurons. Fig. 1(a) shows the baseline fixed bit-width MAC array. In the baseline design, inputs and
weights are first stored in the input/weight buffers. Inputs are multiplied by weights
after loaded from the buffers. Then, the partial-sum is added and accumulated until
an output number is constructed. Stripes accelerator proposed a serial inner product
(SIP) unit which includes input/weight buffers, AND gates, adder tree, accumulator,
and bit-shift logic. Considering that multipliers usually generate partial products
using AND gates, SIP multiplies the weights by input bits using the AND logic. Fig. 1(b) shows a 2-bit multiplication example. First, 2-bit weights are AND-ed by LSBs of
the 2-bit input numbers. Two partial sums are added and accumulated in the buffer.
Second, the 2-bit weights are AND-ed by 2nd bits of the 2-bit inputs. After the AND
and accumulate operations, the SIP unit finishes the 2-bit dot product computation.
The numerical result is exactly the same as the baseline. If the SIP has the same
number of AND gates as the baseline multiplier has, the throughput of the SIP becomes
the same as the baseline inner product unit. For example, the baseline inner product
unit includes two 2-bit multipliers. The SIP can have 8 AND gates, thereby achieving
the same throughput as the baseline design.
Fig. 1. (a) Baseline fixed bit-width MAC unit; (b) serial inner product unit of Stripes[4].
2) UNPU
The processing engine in UNPU [5] deals with fully variable weight bits from 1 to 16-bit precision (Fig. 2). Input number is stored in the buffer and AND-ed by weight bits for `W' clock cycles
(W: \# of weight bits). After the processing engine finishes the multiplications between
inputs/weight pairs, the results are sent to the adder/subtractor tree. Furthermore,
lookup table (LUT)-based bit-serial computation is adopted for energy-efficient matrix
multiplication. Possible partial products are pre-stored in the partial product table.
If the same bit-pattern is repeated, the partial product is only fetched from the
table hence maximizing the energy-efficiency.
Fig. 2. Processing engine with fully variable weight bits in UNPU[5].
2. Two-sided Flexible Bit-width Designs
1) Envision
Envision [6] introduced a subword-parallel MAC design scheme. The MAC unit consists of 16 submultipliers.
In the high bit-precision mode (16-bit, Fig. 3(a)), all the submultipliers are turned on to construct high-bit multiplication result.
On the other hand, when targeting low bit widths, a few submultipliers are turned
off by masking input signal for the part of the MAC unit. To improve the throughput
and energy-efficiency of the MAC, the scalable arithmetic unit reuses the inactive
submultiplier cells. In the 8-bit precision mode (Fig. 3(b)), 4 4x4 submultipliers are used for an 8-bit multiplication. In the case, 2 8-bit
multiplications are performed in parallel. Thus, 8 out of 16 submultipliers are used
in total. Moreover, when targeting the 4-bit precision (Fig. 3(c)), only 1 4x4 submultiplier is used. 4 4-bit multiplications are dealt with at the
same time. Hence, 4 out of 16 submultipliers are used in the case.
When bit-width is scaled, the critical-path delay is shortened (Fig. 3(d)). By combining the subword-parallel MAC microarchitecture with voltage scaling, the
precision-scaled arithmetic blocks show much higher energy-efficiency while maintaining
the same throughput as the high bit-precision mode.
Fig. 3. Subword-parallel MAC engine proposed in Envision[6]: (a) 16-bit; (b) 8-bit; (c) 4-bit multiplication modes; (d) Critical paths at different bit-precision modes.
2) Bit Fusion
Bit Fusion [7] proposed a bit-level dynamically composable MAC unit called a fusion unit (Fig. 4(a)). Bit Fusion performed the 2-dimensional physical grouping of its submultipliers
called BitBrick. The grouped BitBricks becomes a fused processing engine (fused-PE)
that executes a multiplication with required bit-width. Depending on the target bit-precision,
the fusion unit can have various numbers of fused-PEs. When an 8x8 multiplication
is performed (Fig. 4(b)), all the BitBricks in the fusion unit constitute 1 fused-PE. For an 8x4 multiplication
(Fig. 4(c)), 8 BitBricks are required. Considering that a fusion unit consists of 16 BitBricks,
2 8x4 multiplications are performed in parallel with the fusion unit. In the case
of 2x2 multiplication (Fig. 4(d)), only 1 BitBrick is used for each multiplication. 16 2x2 multiplications are computed
in a clock cycle using the fusion units. After 2-bit multiplications using the BitBricks,
the partial multiplication results are shifted depending on the target bit-precision.
For example, to construct an 8x8 multiplication, the 2-bit multiplication results
from the 16 BitBricks are shifted by 0-to-12 bits depending on the bit position. In
the same manner, for an 8x4 multiplication, outputs from the 8 BitBricks are shifted
by 0-8 bits. However, no shift operations are performed in a 2x2 multiplication, because
BitBrick can fully express the 2-bit multiplication by itself. Once the shift operations
are finished, the results are added through the adder tree to complete the dot product
computation.
Fig. 4. (a) Dynamically composable fusion unit of Bit Fusion[7]; (b) 8x8 multiplication; (c) 8x4 multiplications (2x parallelism); (d) 2x2 multiplications (16x parallelism).
3) BitBlade
To enable bit-precision flexibility in the Bit Fusion architecture, each BitBrick
in the fusion unit requires dedicated variable bit-shift logic. However, the variable
bit-shift logic leads to large area overhead. To mitigate the logic complexity, BitBlade
[8] architecture proposed a bitwise summation method. When a dot product computation
is performed, the inputs and weights are first divided into 2-bit numbers. The divided
2-bit input/weight pairs with the same index position from the different input/weight
numbers are grouped. The grouped input/weight pairs always share the same bit-shift
parameters. When a processing element dedicates a group, each processing element has
only 1 variable shift logic. As a result, the area overhead to realize the variable-bit
MAC unit is largely mitigated compared to the Bit Fusion architecture where each BitBrick
requires its own shift logic.
Fig. 5(a) and (b) illustrate how the bitwise summation method works. For a simple description, a PE
includes 4 BitBricks. In the 4x4 case (Fig. 5(a)), 4-bit numbers are divided into 2-bit partial numbers. The 2-bit partial numbers
from the same index position of the different input/weight numbers are grouped and
they are located at the same PE. Then, the 2-bit partial inputs are multiplied by
the 2-bit partial weight numbers. The multiplication results are added using the intra-PE
adder. The added numbers are shifted depending on the bit positions in each PE, and
they become a dot product result. Considering that 16 BitBricks are used for 4 PEs
in the example, 4 4x4 multiplications are performed in parallel. In the same manner,
the PE array achieves 8x parallelism with the 4x2 multiplication mode.
Fig. 5. Bitwise summation scheme proposed in BitBlade[8]. For a simple explanation, it is assumed that a PE consists of 4 BitBricks. Examples of (a) 4x4 multiplication; (b) 4x2 multiplication.
III. ANALYSIS ON VARIABLE BIT-PRECISION MAC ARRAYS
In this Section, we perform the analysis on the precision-scalable MAC microarchitectures.
One-sided and two-sided flexible bit-width designs, the utilization of the submultipliers,
and variable bit-shift logic are compared.
1. Under-utilization of Submultipliers
1) Two-sided Bit-width Scaling on One-sided Flexible Bit-Width Designs
Stripes and UNPU only support the bit-width flexibility for either inputs or weights.
However, most of the recent quantized neural networks require bit-width scaling for
both inputs and weights. When low bits for both operands are used for the computation,
a large portion of the multiplier logic remains idle. Fig. 6 shows an example of a
2x2 multiplication on UNPU hardware. Considering that one operand of the UNPU is expressed
in a 16-bit, 16-bit accumulation is repeated for 2 clock cycles for the 2x2 multiplication.
During the computation, 14 out of 16-bit positions are not used. A large part of the
MAC unit remains idle.
Fig. 6. Two-sided low-bit quantized neural network on one-sided flexible bit-width design[5]. 14 out of 16 AND gates are not used.
2) Performance Loss at Low-bit Precision
The subword-parallel multiplier proposed in the Envision turns on or off its submultiplier
blocks depending on the target bit-width. In the case of 16-bit multiplication (Fig. 3(a)), 16 out of 16 submultipliers are turned on. At the 8-bit multiplication (Fig. 3(b)), 4 out of 16 submultipliers are required. To perform 2 8-bit operations in parallel,
8 submultipliers are used, and the other 8 submultipliers are idle. When the 4-bit
multiplication is computed (Fig. 3(c)), only 1 out of 16 submultipliers is required. To maximize the throughput of the
MAC unit, 4 4-bit multiplications are performed in parallel, hence 4 submultipliers
are used. At the 16-bit multiplication, all the submultipliers are fully used. However,
half of the submultipliers are utilized at the 8-bit multiplication. At the 4-bit
operation, only 1/4 of the submultipliers are used, and the other 3/4 of the submultipliers
remain idle. As the bit-precision is scaled, the subword-parallel multiplier of the
Envision linearly loses the throughput due to the under-utilization of the submultipliers.
3) Asymmetric Bit-width Between Operands
A limited set of input/weight precisions is supported in the Envision. For example,
the bit-width of the inputs must be equal to the bit-width of weights such as 4(=input)/4(=weight)-bit,
8/8-bit, 16/16-bit. However, the optimal bit-width varies depending on the target
accuracy of the neural network applications. When the target neural network requires
8x4 multiplications (Fig. 7(a)), both 8-bit and 4-bit operands are mapped to 8x8 multiplication mode. The MAC Performance
of the 8x4 multiplication is equal to the 8x8 multiplication, which leads to under-utilization
of the submultipliers and 2x performance degradation compared with the ideal performance
evaluation. In the same manner, when 16x4 multiplication is necessary (Fig. 7(b)), it is mapped to 16x16 multiplication mode, which leads to 4x under-utilization
of arithmetic resources.
Fig. 7. Asymmetric bit-width between operands on subword-parallel MAC of Envision: (a) 8x4 MULs at 8x8 computation mode; (b) 16x4 MULs at 16x16 computation mode.
2. Logic Complexity of Bit-shift Logic
Fusion unit of the Bit Fusion architecture can deal with 2-bit to 8-bit configuration
for both inputs and weight. To implement such a dynamically composable scheme, a dynamic
bit-shift logic must be dedicated to each BitBrick. For a simple example, if 4 BitBricks
are included in a fusion unit and 4 Fusion units are used to perform a dot product,
4 variable bit-shift blocks are required to each fusion unit and 16 shift blocks are
used in total. On the other hand, BitBlade design groups the BitBricks with the same
variable-shift parameter from different input/weight pairs to a processing element.
By doing so, each processing element requires 1 bit-shift block and 4 shift blocks
are used in total, which is only 1/4 of the Bit Fusion design.
IV. EXPERIMENTAL RESULTS
Simulation Setup: We compare the variable-bit MAC microarchitectures in this Section.
For a fair comparison, we fixed the bit-width of submultipliers to 2-bit. We assumed
that 16384 dot product units (=4096 2-bit submultipliers) were used in the designs.
All the microarchitectures were synthesized using a 28nm standard library targeting
a clock frequency of 500MHz., we did not consider the voltage scaling on the subword-parallel
MAC array. For the evaluation (Fig. 8), we first extracted the area and power consumption for each design. Depending on
the bit-precision, the MAC array consumes different switching power. Therefore, we
performed the power simulation for all the bit-width modes, and the results were stored
in the look-up-table (LUT). Our simulator can read PyTorch-based [1] model definition, and hence we directly utilized the model definition classes for
the analysis. For the low-bit quantization models, the first and the last layers still
used 8-bit, and the rest layers were applied with low-bit widths.
We performed the analysis using weight stationary dataflow. Depending on dataflows,
loops with tiled matrices show various performance. However, we focused on MAC array
microarchitecture which is orthogonal to the dataflows, thereby we did not use other
dataflows in this work. Both of the Stripes and the UNPU target the one-sided bit-width
flexibility, so we only analyzed the Stripes design.
Area: Fig. 9 shows the area comparison between variable-bit MAC microarchitectures. Envision and
Bit Fusion show a large area for bit-shift and accumulation logic to implement variable-bit
MAC units. Envision supports a smaller number of bit-width modes, but subword-parallel
MAC scheme leads to larger area for accumulators. BitBlade introduced a bitwise summation
scheme thereby reducing the number of bit-shift circuits per processing element. Meanwhile,
Stripes used a bit-serial computing method which is typically adopted in area-constraint
small chip designs. Stripes shows the smallest logic area, but it cannot achieve the
maximum performance due to one-sided bit-width flexibility, which will be discussed
at throughput and energy analysis.
Energy Consumption: Fig. 10 shows the energy consumption of the MAC designs. To deal with the variable-bit cases,
shift-add-acc logic accounts for the largest part of the energy consumption. The optimized
versions of Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) reduces the switching
power of unused input buffer at high precisions by gating clock signals, and they
also show reduced energy consumption at low-bit (2-bit) mode because we still used
8-bit precision at the first and last layers. The reconfigurable logic of BitBlade
is much smaller than the Bit Fusion thanks to the bitwise summation scheme. Stripes
achieves the comparable energy-efficiency to BitBlade at 8-bit mode because it has
light reconfigurable logic due to the one-sided flexibility, but it shows energy-inefficiency
when low-bit precisions (especially at low input bit widths) because it operates only
8-bit mode for the input numbers.
Throughput: Fig. 11 shows the comparison of throughput/area between MAC units. Optimized versions of
Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) do not show the improvement
of the throughput over the original Bit Fusion and BitBlade designs, because the clock
gating technique is not related to throughput/area efficiency. BitBlade shows higher
throughput/area than Bit Fusion and Envision. Stripes support variable-bit precision
only for inputs, thereby it cannot maximize the performance at the low bit cases for
weights. Envision suffers processing element-level under-utilization at low precisions,
and thereby its throughput/area is smaller than other schemes. Bit-serial computing
with one-sided bit-flexibility shows similar energy efficiency to BitBlade_opt at
extremely asymmetric bit-width, but the performance degrades at other modes because
it cannot support low weight bits, thereby the MAC units always operate with 8-bit
weight modes.
Selection of Microarchitecture: When a chip has to be designed with very limited area
constraint (Fig. 9), Stripes can be an attractive solution with a smaller (27-57%) area than other microarchitectures
thanks to the bit-serial computation. Furthermore, in the extremely asymmetric bit-width
case (2x8b), the Stripes shows higher throughput/area (1.37-4.46x, Fig. 11) than others. In terms of energy consumption (Fig. 10), the Stripes at 2x8b outperforms other microarchitectures by 14-83%, but it is comparable
to BitBlade_opt. On the other hand, BitBlade shows the highest performance at usual
workloads due to light circuit overhead of variable-shift logic for the bitwise summation
method.
Fig. 8. Experimental Setup.
Fig. 9. Area comparison between variable-bit MAC microarchitectures.
Fig. 10. Energy consumption of variable-bit MACs. Symmetric bit-width cases (left) and asymmetric bit-width cases (right).
Fig. 11. Comparison of throughput/area. Symmetric bit-width cases (up) and asymmetric bit-width cases (down).
V. CONCLUSION
In this paper, we reviewed and analyzed various variable bit-precision MAC units
including subword-parallel scheme and one-/two-sided flexible bit-width designs. Those
designs have been implemented in different experimental conditions, so it is difficult
to compare those microarchitectures. We synthesized the MAC designs in the same design
condition/constraints, and analyzed the area, effective throughput, and energy consumption.
Our main contribution is to help researchers choose the most optimized microarchitecture
for numerous design conditions.
ACKNOWLEDGMENTS
This work was supported by the Soongsil University Research Fund (New Professor
Support Research) of 2021 (100%). The EDA tool was supported by the IC Design Education
Center(IDEC), Korea.
References
PyTorch
TensorFlow
Ibrahim E. M., et al. , 2022, Taxonomy and benchmarking of precision-scalable mac
arrays under enhanced dnn dataflow representation, IEEE Transactions on Circuits and
Systems I: Regular Papers, Vol. 69, No. 5, pp. 2013-2024
Judd et al P., 2016, Stripes: Bit-serial deep neural network computing, in 2016 49th
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp. 1-12
Lee J., et al. , 2018, Unpu: An energy-efficient deep neural network accelerator with
fully variable weight bit precision, IEEE Journal of Solid-State Circuits, Vol. 54,
No. 1, pp. 173-185
Moons B., et al. , 2017, envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable
convolutional neural network processor in 28nm fdsoi, in 2017 IEEE International Solid-State
Circuits Conference (ISSCC). IEEE, pp. 246-247
Sharma H., et al. , 2018, Bit fusion: Bit-level dynamically composable architecture
for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium
on Computer Architecture (ISCA). IEEE, pp. 764-775
Ryu S., et al. , 2019, Bitblade: Area and energy-efficient precision-scalable neural
network accelerator with bitwise summation, in Proceedings of the 56th Annual Design
Automation Conference, pp. 1-6
Sungju Ryu is an assistant professor at Soongsil University, Seoul, Korea. He was
a Staff Researcher at Samsung Advanced Institute of Technology (SAIT) where he focused
on the high-performance computer architecture design. He received the B.S. degree
from Pusan National University in 2015, and the Ph.D. degree from POSTECH in 2021.
His current research interests include energy-efficient hardware accelerators for
deep neural networks.