Ryu Sungju1
               
                  - 
                           
                        (Sungju Ryu is with the School of Electronic Engineering, Soongsil University, Seoul,
                        Korea)
                        
 
               
             
            
            
            Copyright © The Institute of Electronics and Information Engineers(IEIE)
            
            
            
            
            
               
                  
Index Terms
               
                Hardware accelerator,  MAC unit,  neural processing unit,  quantized neural networks,  variable bit-precision
             
            
          
         
            
                  I. INTRODUCTION
               Model complexity of neural networks has been rapidly increasing to meet the target
                  accuracy of neural network applications. However, edge devices usually have limited
                  computing capability due to power constraints, so meeting the target real-time computing
                  latency required for the modern complex network models is challenging.
               
               Various methods for making deep neural networks compact have been explored to ease
                  the burden of real-time computing, which includes quantization, weight pruning, and
                  separable convolution. Among them, the quantization makes the deep neural networks
                  lighter by expressing the inputs and weights in lower bits. A disadvantage of such
                  an approximation of the neural network parameters and activations via quantization
                  is the inference accuracy loss. As a result, many quantization techniques have been
                  suggested to reduce the error from the approximation.
               
               Recently, open-source machine learning frameworks such as PyTorch [1] and TensorFlow [2] started to provide quantization APIs for researchers to make the quantization of
                  neural networks easier, thereby reducing the service development time. As a result,
                  the quantization method to compress neural network is becoming more popular.
               
               Meanwhile, mapping the quantized neural networks to the conventional fixed bit-width-based
                  hardware cannot maximize the computational efficiency. For example, an 8-bit input/weight
                  multiplication on the 32-bit multiplier circuit has the same throughput and similar
                  energy-efficiency to that of the a 32-bit input/weight multiplication. To maximize
                  the performance of the quantized neural networks, previous works have proposed variable-bit
                  multiply-accumulate (MAC) units. However, such variable-bit MAC microarchitectures
                  have been implemented in different experimental conditions, so it is difficult to
                  select the most suitable scheme for a target design space. Previous work [3] studied the precision-scalable MACs, but it did not perform the evaluation on the
                  real benchmarks. Only ideal workloads were used for the simulation.
               
               Our contribution to analyze and compare those variable-bit MAC units is as follows.
               1) We review the variable bit-precision MAC microarchitectures. Subword-parallel,
                  one-/two-sided bit-width flexible MAC arrays are studied.
               
               2) We synthesize the MAC arrays using a 28 nm standard library cells. Area, energy
                  consumption, and throughput are analyzed using the real neural network benchmarks.
               
             
            
                  II. REVIEW OF PRECISION-SCALABLE MAC MICROARCHITECTURES
               
                     1. One-sided Flexible Bit-width Designs
                  
                        1) Stripes
                     In the neural networks, required bit-precision of neurons varies across the layers.
                        The main concept of the Stripes [4] is that performance can be linearly improved if the computation time is scaled depending
                        on the bit-width of the neurons. Fig. 1(a) shows the baseline fixed bit-width MAC array. In the baseline design, inputs and
                        weights are first stored in the input/weight buffers. Inputs are multiplied by weights
                        after loaded from the buffers. Then, the partial-sum is added and accumulated until
                        an output number is constructed. Stripes accelerator proposed a serial inner product
                        (SIP) unit which includes input/weight buffers, AND gates, adder tree, accumulator,
                        and bit-shift logic. Considering that multipliers usually generate partial products
                        using AND gates, SIP multiplies the weights by input bits using the AND logic. Fig. 1(b) shows a 2-bit multiplication example. First, 2-bit weights are AND-ed by LSBs of
                        the 2-bit input numbers. Two partial sums are added and accumulated in the buffer.
                        Second, the 2-bit weights are AND-ed by 2nd bits of the 2-bit inputs. After the AND
                        and accumulate operations, the SIP unit finishes the 2-bit dot product computation.
                        The numerical result is exactly the same as the baseline. If the SIP has the same
                        number of AND gates as the baseline multiplier has, the throughput of the SIP becomes
                        the same as the baseline inner product unit. For example, the baseline inner product
                        unit includes two 2-bit multipliers. The SIP can have 8 AND gates, thereby achieving
                        the same throughput as the baseline design.
                     
                     
                           Fig. 1. (a) Baseline fixed bit-width MAC unit; (b) serial inner product unit of Stripes[4].
 
                   
                  
                        2) UNPU
                     The processing engine in UNPU [5] deals with fully variable weight bits from 1 to 16-bit precision (Fig. 2). Input number is stored in the buffer and AND-ed by weight bits for `W' clock cycles
                        (W: \# of weight bits). After the processing engine finishes the multiplications between
                        inputs/weight pairs, the results are sent to the adder/subtractor tree. Furthermore,
                        lookup table (LUT)-based bit-serial computation is adopted for energy-efficient matrix
                        multiplication. Possible partial products are pre-stored in the partial product table.
                        If the same bit-pattern is repeated, the partial product is only fetched from the
                        table hence maximizing the energy-efficiency.
                     
                     
                           Fig. 2. Processing engine with fully variable weight bits in UNPU[5].
 
                   
                
               
                     2. Two-sided Flexible Bit-width Designs
                  
                        1) Envision
                     Envision [6] introduced a subword-parallel MAC design scheme. The MAC unit consists of 16 submultipliers.
                        In the high bit-precision mode (16-bit, Fig. 3(a)), all the submultipliers are turned on to construct high-bit multiplication result.
                        On the other hand, when targeting low bit widths, a few submultipliers are turned
                        off by masking input signal for the part of the MAC unit. To improve the throughput
                        and energy-efficiency of the MAC, the scalable arithmetic unit reuses the inactive
                        submultiplier cells. In the 8-bit precision mode (Fig. 3(b)), 4 4x4 submultipliers are used for an 8-bit multiplication. In the case, 2 8-bit
                        multiplications are performed in parallel. Thus, 8 out of 16 submultipliers are used
                        in total. Moreover, when targeting the 4-bit precision (Fig. 3(c)), only 1 4x4 submultiplier is used. 4 4-bit multiplications are dealt with at the
                        same time. Hence, 4 out of 16 submultipliers are used in the case.
                     
                     When bit-width is scaled, the critical-path delay is shortened (Fig. 3(d)). By combining the subword-parallel MAC microarchitecture with voltage scaling, the
                        precision-scaled arithmetic blocks show much higher energy-efficiency while maintaining
                        the same throughput as the high bit-precision mode.
                     
                     
                           Fig. 3. Subword-parallel MAC engine proposed in Envision[6]: (a) 16-bit; (b) 8-bit; (c) 4-bit multiplication modes; (d) Critical paths at different bit-precision modes.
 
                   
                  
                        2) Bit Fusion
                     Bit Fusion [7] proposed a bit-level dynamically composable MAC unit called a fusion unit (Fig. 4(a)). Bit Fusion performed the 2-dimensional physical grouping of its submultipliers
                        called BitBrick. The grouped BitBricks becomes a fused processing engine (fused-PE)
                        that executes a multiplication with required bit-width. Depending on the target bit-precision,
                        the fusion unit can have various numbers of fused-PEs. When an 8x8 multiplication
                        is performed (Fig. 4(b)), all the BitBricks in the fusion unit constitute 1 fused-PE. For an 8x4 multiplication
                        (Fig. 4(c)), 8 BitBricks are required. Considering that a fusion unit consists of 16 BitBricks,
                        2 8x4 multiplications are performed in parallel with the fusion unit. In the case
                        of 2x2 multiplication (Fig. 4(d)), only 1 BitBrick is used for each multiplication. 16 2x2 multiplications are computed
                        in a clock cycle using the fusion units. After 2-bit multiplications using the BitBricks,
                        the partial multiplication results are shifted depending on the target bit-precision.
                        For example, to construct an 8x8 multiplication, the 2-bit multiplication results
                        from the 16 BitBricks are shifted by 0-to-12 bits depending on the bit position. In
                        the same manner, for an 8x4 multiplication, outputs from the 8 BitBricks are shifted
                        by 0-8 bits. However, no shift operations are performed in a 2x2 multiplication, because
                        BitBrick can fully express the 2-bit multiplication by itself. Once the shift operations
                        are finished, the results are added through the adder tree to complete the dot product
                        computation.
                     
                     
                           Fig. 4. (a) Dynamically composable fusion unit of Bit Fusion[7]; (b) 8x8 multiplication; (c) 8x4 multiplications (2x parallelism); (d) 2x2 multiplications (16x parallelism).
 
                   
                  
                        3) BitBlade
                     To enable bit-precision flexibility in the Bit Fusion architecture, each BitBrick
                        in the fusion unit requires dedicated variable bit-shift logic. However, the variable
                        bit-shift logic leads to large area overhead. To mitigate the logic complexity, BitBlade
                        [8] architecture proposed a bitwise summation method. When a dot product computation
                        is performed, the inputs and weights are first divided into 2-bit numbers. The divided
                        2-bit input/weight pairs with the same index position from the different input/weight
                        numbers are grouped. The grouped input/weight pairs always share the same bit-shift
                        parameters. When a processing element dedicates a group, each processing element has
                        only 1 variable shift logic. As a result, the area overhead to realize the variable-bit
                        MAC unit is largely mitigated compared to the Bit Fusion architecture where each BitBrick
                        requires its own shift logic.
                     
                     Fig. 5(a) and (b) illustrate how the bitwise summation method works. For a simple description, a PE
                        includes 4 BitBricks. In the 4x4 case (Fig. 5(a)), 4-bit numbers are divided into 2-bit partial numbers. The 2-bit partial numbers
                        from the same index position of the different input/weight numbers are grouped and
                        they are located at the same PE. Then, the 2-bit partial inputs are multiplied by
                        the 2-bit partial weight numbers. The multiplication results are added using the intra-PE
                        adder. The added numbers are shifted depending on the bit positions in each PE, and
                        they become a dot product result. Considering that 16 BitBricks are used for 4 PEs
                        in the example, 4 4x4 multiplications are performed in parallel. In the same manner,
                        the PE array achieves 8x parallelism with the 4x2 multiplication mode.
                     
                     
                           Fig. 5. Bitwise summation scheme proposed in BitBlade[8]. For a simple explanation, it is assumed that a PE consists of 4 BitBricks. Examples of (a) 4x4 multiplication; (b) 4x2 multiplication.
 
                   
                
             
            
                  III. ANALYSIS ON VARIABLE BIT-PRECISION MAC ARRAYS
               In this Section, we perform the analysis on the precision-scalable MAC microarchitectures.
                  One-sided and two-sided flexible bit-width designs, the utilization of the submultipliers,
                  and variable bit-shift logic are compared.
               
               
                     1. Under-utilization of Submultipliers
                  
                        1) Two-sided Bit-width Scaling on One-sided Flexible Bit-Width Designs
                     Stripes and UNPU only support the bit-width flexibility for either inputs or weights.
                        However, most of the recent quantized neural networks require bit-width scaling for
                        both inputs and weights. When low bits for both operands are used for the computation,
                        a large portion of the multiplier logic remains idle. Fig. 6 shows an example of a
                        2x2 multiplication on UNPU hardware. Considering that one operand of the UNPU is expressed
                        in a 16-bit, 16-bit accumulation is repeated for 2 clock cycles for the 2x2 multiplication.
                        During the computation, 14 out of 16-bit positions are not used. A large part of the
                        MAC unit remains idle.
                     
                     
                           Fig. 6. Two-sided low-bit quantized neural network on one-sided flexible bit-width design[5]. 14 out of 16 AND gates are not used.
 
                   
                  
                        2) Performance Loss at Low-bit Precision
                     The subword-parallel multiplier proposed in the Envision turns on or off its submultiplier
                        blocks depending on the target bit-width. In the case of 16-bit multiplication (Fig. 3(a)), 16 out of 16 submultipliers are turned on. At the 8-bit multiplication (Fig. 3(b)), 4 out of 16 submultipliers are required. To perform 2 8-bit operations in parallel,
                        8 submultipliers are used, and the other 8 submultipliers are idle. When the 4-bit
                        multiplication is computed (Fig. 3(c)), only 1 out of 16 submultipliers is required. To maximize the throughput of the
                        MAC unit, 4 4-bit multiplications are performed in parallel, hence 4 submultipliers
                        are used. At the 16-bit multiplication, all the submultipliers are fully used. However,
                        half of the submultipliers are utilized at the 8-bit multiplication. At the 4-bit
                        operation, only 1/4 of the submultipliers are used, and the other 3/4 of the submultipliers
                        remain idle. As the bit-precision is scaled, the subword-parallel multiplier of the
                        Envision linearly loses the throughput due to the under-utilization of the submultipliers.
                     
                   
                  
                        3) Asymmetric Bit-width Between Operands
                     A limited set of input/weight precisions is supported in the Envision. For example,
                        the bit-width of the inputs must be equal to the bit-width of weights such as 4(=input)/4(=weight)-bit,
                        8/8-bit, 16/16-bit. However, the optimal bit-width varies depending on the target
                        accuracy of the neural network applications. When the target neural network requires
                        8x4 multiplications (Fig. 7(a)), both 8-bit and 4-bit operands are mapped to 8x8 multiplication mode. The MAC Performance
                        of the 8x4 multiplication is equal to the 8x8 multiplication, which leads to under-utilization
                        of the submultipliers and 2x performance degradation compared with the ideal performance
                        evaluation. In the same manner, when 16x4 multiplication is necessary (Fig. 7(b)), it is mapped to 16x16 multiplication mode, which leads to 4x under-utilization
                        of arithmetic resources.
                     
                     
                           Fig. 7. Asymmetric bit-width between operands on subword-parallel MAC of Envision: (a) 8x4 MULs at 8x8 computation mode; (b) 16x4 MULs at 16x16 computation mode.
 
                   
                
               
                     2. Logic Complexity of Bit-shift Logic
                  Fusion unit of the Bit Fusion architecture can deal with 2-bit to 8-bit configuration
                     for both inputs and weight. To implement such a dynamically composable scheme, a dynamic
                     bit-shift logic must be dedicated to each BitBrick. For a simple example, if 4 BitBricks
                     are included in a fusion unit and 4 Fusion units are used to perform a dot product,
                     4 variable bit-shift blocks are required to each fusion unit and 16 shift blocks are
                     used in total. On the other hand, BitBlade design groups the BitBricks with the same
                     variable-shift parameter from different input/weight pairs to a processing element.
                     By doing so, each processing element requires 1 bit-shift block and 4 shift blocks
                     are used in total, which is only 1/4 of the Bit Fusion design. 
                  
                
             
            
                  IV. EXPERIMENTAL RESULTS
               Simulation Setup: We compare the variable-bit MAC microarchitectures in this Section.
                  For a fair comparison, we fixed the bit-width of submultipliers to 2-bit. We assumed
                  that 16384 dot product units (=4096 2-bit submultipliers) were used in the designs.
                  All the microarchitectures were synthesized using a 28nm standard library targeting
                  a clock frequency of 500MHz., we did not consider the voltage scaling on the subword-parallel
                  MAC array. For the evaluation (Fig. 8), we first extracted the area and power consumption for each design. Depending on
                  the bit-precision, the MAC array consumes different switching power. Therefore, we
                  performed the power simulation for all the bit-width modes, and the results were stored
                  in the look-up-table (LUT). Our simulator can read PyTorch-based [1] model definition, and hence we directly utilized the model definition classes for
                  the analysis. For the low-bit quantization models, the first and the last layers still
                  used 8-bit, and the rest layers were applied with low-bit widths.
               
               We performed the analysis using weight stationary dataflow. Depending on dataflows,
                  loops with tiled matrices show various performance. However, we focused on MAC array
                  microarchitecture which is orthogonal to the dataflows, thereby we did not use other
                  dataflows in this work. Both of the Stripes and the UNPU target the one-sided bit-width
                  flexibility, so we only analyzed the Stripes design. 
               
               Area: Fig. 9 shows the area comparison between variable-bit MAC microarchitectures. Envision and
                  Bit Fusion show a large area for bit-shift and accumulation logic to implement variable-bit
                  MAC units. Envision supports a smaller number of bit-width modes, but subword-parallel
                  MAC scheme leads to larger area for accumulators. BitBlade introduced a bitwise summation
                  scheme thereby reducing the number of bit-shift circuits per processing element. Meanwhile,
                  Stripes used a bit-serial computing method which is typically adopted in area-constraint
                  small chip designs. Stripes shows the smallest logic area, but it cannot achieve the
                  maximum performance due to one-sided bit-width flexibility, which will be discussed
                  at throughput and energy analysis.
               
               Energy Consumption: Fig. 10 shows the energy consumption of the MAC designs. To deal with the variable-bit cases,
                  shift-add-acc logic accounts for the largest part of the energy consumption. The optimized
                  versions of Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) reduces the switching
                  power of unused input buffer at high precisions by gating clock signals, and they
                  also show reduced energy consumption at low-bit (2-bit) mode because we still used
                  8-bit precision at the first and last layers. The reconfigurable logic of BitBlade
                  is much smaller than the Bit Fusion thanks to the bitwise summation scheme. Stripes
                  achieves the comparable energy-efficiency to BitBlade at 8-bit mode because it has
                  light reconfigurable logic due to the one-sided flexibility, but it shows energy-inefficiency
                  when low-bit precisions (especially at low input bit widths) because it operates only
                  8-bit mode for the input numbers.
               
               Throughput: Fig. 11 shows the comparison of throughput/area between MAC units. Optimized versions of
                  Bit Fusion and BitBlade (BitFusion_opt and BitBlade_opt) do not show the improvement
                  of the throughput over the original Bit Fusion and BitBlade designs, because the clock
                  gating technique is not related to throughput/area efficiency. BitBlade shows higher
                  throughput/area than Bit Fusion and Envision. Stripes support variable-bit precision
                  only for inputs, thereby it cannot maximize the performance at the low bit cases for
                  weights. Envision suffers processing element-level under-utilization at low precisions,
                  and thereby its throughput/area is smaller than other schemes. Bit-serial computing
                  with one-sided bit-flexibility shows similar energy efficiency to BitBlade_opt at
                  extremely asymmetric bit-width, but the performance degrades at other modes because
                  it cannot support low weight bits, thereby the MAC units always operate with 8-bit
                  weight modes.
               
               Selection of Microarchitecture: When a chip has to be designed with very limited area
                  constraint (Fig. 9), Stripes can be an attractive solution with a smaller (27-57%) area than other microarchitectures
                  thanks to the bit-serial computation. Furthermore, in the extremely asymmetric bit-width
                  case (2x8b), the Stripes shows higher throughput/area (1.37-4.46x, Fig. 11) than others. In terms of energy consumption (Fig. 10), the Stripes at 2x8b outperforms other microarchitectures by 14-83%, but it is comparable
                  to BitBlade_opt. On the other hand, BitBlade shows the highest performance at usual
                  workloads due to light circuit overhead of variable-shift logic for the bitwise summation
                  method.
               
               
                     Fig. 8. Experimental Setup.
 
               
                     Fig. 9. Area comparison between variable-bit MAC microarchitectures.
 
               
                     Fig. 10. Energy consumption of variable-bit MACs. Symmetric bit-width cases (left) and asymmetric bit-width cases (right).
 
               
                     Fig. 11. Comparison of throughput/area. Symmetric bit-width cases (up) and asymmetric bit-width cases (down).
 
             
            
                  V. CONCLUSION
               
               			In this paper, we reviewed and analyzed various variable bit-precision MAC units
               including subword-parallel scheme and one-/two-sided flexible bit-width designs. Those
               designs have been implemented in different experimental conditions, so it is difficult
               to compare those microarchitectures. We synthesized the MAC designs in the same design
               condition/constraints, and analyzed the area, effective throughput, and energy consumption.
               Our main contribution is to help researchers choose the most optimized microarchitecture
               for numerous design conditions.
               
               
               		
            
 
          
         
            
                  ACKNOWLEDGMENTS
               
                  				This work was supported by the Soongsil University Research Fund (New Professor
                  Support Research) of 2021 (100%). The EDA tool was supported by the IC Design Education
                  Center(IDEC), Korea.
                  			
               
             
            
                  
                     References
                  
                     
                        
                        PyTorch

 
                      
                     
                        
                        TensorFlow

 
                      
                     
                        
                        Ibrahim E. M., et al. , 2022, Taxonomy and benchmarking of precision-scalable mac
                           arrays under enhanced dnn dataflow representation, IEEE Transactions on Circuits and
                           Systems I: Regular Papers, Vol. 69, No. 5, pp. 2013-2024

 
                      
                     
                        
                        Judd et al P., 2016, Stripes: Bit-serial deep neural network computing, in 2016 49th
                           Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, pp. 1-12

 
                      
                     
                        
                        Lee J., et al. , 2018, Unpu: An energy-efficient deep neural network accelerator with
                           fully variable weight bit precision, IEEE Journal of Solid-State Circuits, Vol. 54,
                           No. 1, pp. 173-185

 
                      
                     
                        
                        Moons B., et al. , 2017, envision: A 0.26-to-10tops/w subword-parallel dynamic-voltage-accuracy-frequency-scalable
                           convolutional neural network processor in 28nm fdsoi, in 2017 IEEE International Solid-State
                           Circuits Conference (ISSCC). IEEE, pp. 246-247

 
                      
                     
                        
                        Sharma H., et al. , 2018, Bit fusion: Bit-level dynamically composable architecture
                           for accelerating deep neural network, in 2018 ACM/IEEE 45th Annual International Symposium
                           on Computer Architecture (ISCA). IEEE, pp. 764-775

 
                      
                     
                        
                        Ryu S., et al. , 2019, Bitblade: Area and energy-efficient precision-scalable neural
                           network accelerator with bitwise summation, in Proceedings of the 56th Annual Design
                           Automation Conference, pp. 1-6

 
                      
                   
                
             
            
            
               			Sungju Ryu is an assistant professor at Soongsil University, Seoul, Korea. He was
               a Staff Researcher at Samsung Advanced Institute of Technology (SAIT) where he focused
               on the high-performance computer architecture design. He received the B.S. degree
               from Pusan National University in 2015, and the Ph.D. degree from POSTECH in 2021.
               His current research interests include energy-efficient hardware accelerators for
               deep neural networks.