Seok Hyelin1
Seo Hyoju1
Lee Jungwon1
Kim Yongtae1*
-
(School of Computer Science and Engineering, Kyungpook National University, Daegu,
Korea )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Approximate computing, Approximate multiplier, Approximate compressor, 4-2 compressor, Design optimization, Low-cost
1. Introduction
As battery-based devices have become more common in recent years, the volume of data
that needs to be handled has also been growing quickly. This has led to an increase
in the energy consumption of computing devices, but the capacity of batteries is limited.
Hence, energy efficiency becomes the primary design consideration. Approximate computing
can offer energy efficiency by sacrificing computation accuracy [1]. The goal of approximate computing is to reduce hardware costs, such as area, latency,
and power, while maintaining adequate accuracy. Since this technique degrades the
accuracy and processing quality, it can be applicable to error-tolerant applications.
Many of these applications typically involve human or application resilience [2]. For example, even if there is some noise or pixel loss in an image, humans may not
detect the errors because the brain recovers the image as similarly as possible to
the original. Therefore, taking advantage of this fact, we can make resource-saving
hardware designs by sacrificing accuracy [3].
Approximate computing technique is applicable to basic circuits such as adders and
multipliers. One of the most efficient ways to design approximate arithmetic circuits
is to split them into two parts: accurate and inaccurate parts. Higher bits are calculated
in the accurate part because higher-order bit errors have a more significant impact
on the results. In an approximate adder, for example, the accurate part is implemented
by a traditional accurate adder, such as a ripple carry adder (RCA) and a carry lookahead
adder (CLA), and many approximation techniques for the lower-order part have been
presented in the literature [16-23].
Multipliers can be implemented using a parallel configuration of compressors [4]. Compressor-based multiplication is composed of three steps: 1) generating partial
products, 2) reducing partial products, and 3) adding final partial products. The
second step is a computation-intensive one where compressors are mainly used to compress
partial products and generate the final two terms. The hardware cost of multipliers
can be determined by the compressors' complexity and bit-per-bit compression rate.
For example, an accurate multiplier requires significant resources because it consists
of a large number of full adders (3-2 compressor) and half adders (2-2 compressor).
On the other hand, the approximate multiplier is composed of approximate compressors,
enabling efficient design. Among the different sized compressors, 4-2 compressors
are widely used to build approximate multipliers.
In this paper, we propose an optimized 4-2 approximate compressor based on an existing
compressor. The original compressor was implemented using OR, NOR, and XOR gates,
but in our design, we elaborate the Boolean expressions of the compressor to obtain
optimizable forms. Then, we replace the gates of the compressor with compound gates,
such as AO and OA gates, to reduce hardware costs. As a result, the proposed approximate
design’s area and power are improved by 62.5% and 65.7%, respectively. The main contributions
of this paper are as follows:
· We systematically analyze an existing approximate 4-2 compressor and optimize it
by exploiting compound gates to reduce hardware costs using logic optimization techniques.
· We compare the hardware performance of the traditional and the proposed compressors
to show the improvement and use them in approximate multiplier designs to demonstrate
the efficacy of the proposed compressor.
· We compare the proposed compressor with other compressors in hardware aspects and
show the benefit of the proposed design.
2. Related Work
Various types of compressors have been presented in the literature, such as 4-2, 5-2,
and 7-3 compressors [5-15]. 4-2 approximate compressors are commonly used, and some are illustrated in Fig. 1. All the compressors except for the exact one belong to the low-accuracy compressor
category, which offers considerable hardware cost benefits [9]. The exact compressor is composed of two full adders, as illustrated in Fig. 1(a). With five bits of input (X$_{4}$-X$_{1}$ and C$_{IN}$), two full adders are connected
to generate three output bits (C$_{OUT}$, Carry, and Sum). Each input has equal binary
weight, and Sum has the same weight as the inputs, while the weights of C$_{OUT}$
and Carry are one bit higher. C$_{OUT}$ is independent from C$_{IN}$. Each output
is derived through the following equations.
As one of the earliest approximate compressor designs, Momeni et al. presented an
approximate 4-2 compressor (referred to as ``Momeni''), which is shown in Fig. 1(b) [5]. In contrast to an exact compressor, the Momeni design removes C$_{IN}$ and C$_{OUT}$
signals to reduce partial products effectively by simplifying carry propagation between
compressors. Furthermore, it reduces the hardware design complexity. Dual-quality
approximate 4-2 compressors were proposed [6]. This design has flexibility to switch between exact and approximate operating modes.
It is composed of an approximate part and a supplementary part, and each part is activated
according to the mode. The two designs presented here are illustrated in Figs. 1(c)
and (d), respectively. The carry prediction in the first design (referred to here
as ``Akbar1'') is enhanced in the second design (referred to as ``Akbar2''). Subsequently,
Approximate 4-2 compressors were later proposed and are referred here to as ``Venka''
[7] and ``Ahma'' [8]. These compressors were designed based on truth tables, as the Momeni compressor
was. The use of a truth table is a representative approach to check the error distance
of each input. The Venka compressor in Fig. 1(e) replaces several XOR gates with OR gates to decrease hardware costs since XOR gates
have a significant impact on hardware costs. The Ahma compressor has reasonable accuracy
and is a hardware-effective form since it consists of only three NOR gates and one
NAND gate, as illustrated in Fig. 1(f).
Fig. 1. Schematic of 4-2 compressors: (a) Exact 4-2; (b) Momeni et al.[5]; (c)-(d) Akbari et al.[6]; (e) Venkatachalam et al.[7]; (f) Ahmadinejad et al.[8].}
3. Design Optimization of Approximate Compressor
Among the approximate 4-2 compressors presented in Section 2, we focus on the Momeni
design. In this section, we briefly review the Momeni compressor and optimize it.
We use Boolean equations and De Morgan’s law for optimization.
3.1 Momeni Compressor
The Momeni compressor was designed to enhance design efficiency by eliminating the
carry input and output signals (C$_{IN}$ and C$_{OUT}$) of the exact 4-2 compressor.
Fig. 1(b) shows the circuit of the Momeni approximate 4-2 compressor. As can be seen in Fig. 1(b), the one OR, two XNOR, and three NOR gates are components of this compressor. The
expressions for the output Carry and Sum can be written as follows.
A truth table for the Momeni design for all possible input combinations is demonstrated
in Table 1. The errors occur in four input conditions (0000, 0011, 1100, and 1111), and the
error distance is limited by ±1.
Table 1. Truth table of the Momeni compressor.
X$_{4}$
|
X$_{3}$
|
X$_{2}$
|
X$_{1}$
|
Carry
|
Sum
|
Difference
|
0
|
0
|
0
|
0
|
0
|
1
|
+1
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
-1
|
0
|
1
|
0
|
0
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
1
|
1
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
0
|
1
|
1
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
0
|
0
|
1
|
0
|
1
|
1
|
1
|
1
|
0
|
1
|
1
|
0
|
0
|
0
|
1
|
-1
|
1
|
1
|
0
|
1
|
1
|
1
|
0
|
1
|
1
|
1
|
0
|
1
|
1
|
0
|
1
|
1
|
1
|
1
|
1
|
1
|
-1
|
3.2 Proposed Optimized Compressor
Digital logic circuits can be expressed in Boolean equations. Equations developed
in a specific form can be replaced by compound gates. Therefore, we examine the derived
form of an existing logic design. By optimizing the Boolean equation, we can make
an optimized compressor with compound gates.
Compound gates are mainly composed of AND and OR gates. If a circuit is in the sum-of-products
(SOP) or product-of-sums (POS) form, then the gates can be changed to a compound gate:
the AO type or OA type, respectively. The exact types of compound gates usually depend
on the order of gates and the number of inputs. For example, consider the following
expression:
This expression is made up of two OR gates and one AND gate, and the form of this
is POS. Since each OR gate has two inputs, this circuit can be replaced with one compound
gate, OA22. These compound gates can be implemented very efficiently in terms of hardware
by a combination of transistor connections, and most of the CMOS standard cell technology
library includes them. Therefore, the hardware cost of OA22 is lower than that of
the original circuit, while the result of the logic operation is identical.
The signals of the Momeni compressor can also be optimized in the same way. First,
if De Morgan’s law is applied to the Carry signal, a POS form is derived. The induction
equation is as follows.
The Carry signal can be implemented as one compound gate, the POS form. Also, the
generated Carry signal can be reused when generating the Sum signal. The Boolean equation
of Sum signal is derived by:
The XOR gates can be altered to AND and OR gates. Then, De Morgan’s law makes four
terms. Reapplying De Morgan's law to each term generates an expression that has the
form of a sum of four products. In $\overline{X_{1}}\overline{X_{2}}+\overline{X_{3}}\overline{X_{4}}$,
if the NOT gates are moved outward, a negative expression of Carry appears. Therefore,
the Sum signal is the sum of three products using the Carry signal.
Consequently, the Sum and Carry signals are changed into the forms of the POS and
SOP, respectively, so we can apply compound gates to the signals. Fig. 2 shows the proposed optimized compressor. The Carry signal is composed of two OR gates
and one AND gate. Therefore, the compound gate of the Carry signal is OA22. The Sum
signal consists of two AND gates and one OR gate with three input pins. The two output
bits of AND gates and the inverted Carry signal are used as inputs of the last OR
gate. Therefore, the optimized gate of the Sum signal is AO221.
Fig. 2. Diagram of the proposed optimized compressor.}
4. Experimental Result
In this section, we evaluate the hardware performance of the proposed and existing
compressors. To evaluate and compare the designs, we implemented them in Verilog HDL
and synthesized them with 32-nm CMOS technology using Synopsys Design Compiler.
4.1 Performance of 4-2 Compressors
Table 2 shows the simulation results of the compressors in terms of area, power, delay, and
energy. The proposed design’s area is reduced from 18.30 µm$^{2}$ to 6.86 µm$^{2}$,
which is about 2.7 times smaller than the original design. In addition, the power
consumption is also reduced by about 2.9 times compared to the original. The proposed
design achieves an energy reduction of 63.5% compared to the original one. This significant
hardware cost reduction is the result of the optimization using compound gates.
Table 2. Hardware performance summary of compressors.
|
Area (µm$^{2}$)
|
Power (µW)
|
Delay
(ns)
|
Energy (fJ)
|
Original
|
18.30
|
3.85
|
0.14
|
0.52
|
Optimized
|
06.86
|
1.32
|
0.14
|
0.19
|
Improvement
|
62.5%
|
65.7%
|
-
|
63.5%
|
4.2 Performance of Multipliers using 4-2 Compressors
We also simulated the hardware performance of an approximate multiplier using these
compressors. We used both C-N and C-FULL multiplier configurations. In the C-N configuration,
the approximate compressors are used for only the less significant half columns of
the partial product matrix, and in the C-FULL configuration, approximate compressors
are used for every column of the partial product matrix. Fig. 3 shows the applied reduction scheme for the unsigned $8\times 8$ multiplier using
the C-N configuration. Tables 3 and 4 show the hardware costs for 8$\times $8 multipliers for the C-N configuration and
the C-FULL configuration, respectively [10].
In the case of the C-N configuration, the area and power of the multiplier with the
optimized compressor decreased by 13.7% and 13.9% compared to the original one, respectively.
Additionally, there was a 13.8% reduction in energy. In the case of the C-FULL configuration,
the area and power of the multiplier with the optimized compressor decreased by 29.3%
and 34.0%, respectively. For both multiplier configurations, the multiplier’s delay
is not improved when adopting the optimized compressor whose speed is consistent compared
to the original counterpart as shown in Table 2. The hardware cost is more significantly improved in the C-FULL configuration where
the approximate compressor is mainly utilized rather than in C-N configuration.
Fig. 3. C-N approximate multiplier configuration.
Table 3. Hardware performance summary of C-N multipliers.
|
Area (µm$^{2}$)
|
Power (µW)
|
Delay
(ns)
|
Energy
(fJ)
|
Original
|
752.01
|
203.99
|
1.16
|
237.27
|
Optimized
|
649.08
|
175.73
|
1.16
|
204.42
|
Improvement
|
13.7%
|
13.9%
|
-
|
13.8%
|
Table 4. Hardware performance summary of C-FULL multipliers.
|
Area (µm$^{2}$)
|
Power (µW)
|
Delay
(ns)
|
Energy (fJ)
|
Original
|
662.55
|
160.42
|
1.02
|
163.18
|
Optimized
|
468.13
|
105.88
|
1.02
|
107.62
|
Improvement
|
29.3%
|
34.0%
|
-
|
34.0%
|
4.3 Comparison with Other Compressors
We compare the proposed compressor with eight other compressors. We divide them into
two groups according to their characteristics. The first group includes the four compressors
mentioned in Section 2 (Akbar1, Akbar2, Venka, and Ahma), and the second one contains
four other compressors that exploit compound gates (Yang1, Yang2, Yang 3 [13], and Ha [14]). The Yang1 compressor utilizes OAI212, the Yang2 and Yang3 compressors use AO222,
and the Ha compressor uses AO22.
Fig. 4 summarizes the implementation results of compressors in terms of area, power, delay,
and energy. Note that the blue and yellow bars correspond to the first and second
groups, respectively, and the red indicates the proposed design. When we compare the
original compressor to the designs in the first group, all the other compressors except
for Venka have better hardware performance than the original Momeni compressor. After
optimization, the proposed compressor’s area is approximately 43% smaller and dissipates
approximately 41% less power than the Akbar1 and Akbar2 compressors. Furthermore,
the proposed compressor consumes more than 60% less area, power, and energy than the
Venka compressor, which consumes the most resources in the group. Additionally, our
design is about 23% smaller in area and power consumption than the Ahma compressor.
Although the delay of the proposed design is not improved, the energy achieved is
0.19 fJ, thus outperforming other compressors because the power is greatly reduced.
Next, we compare the proposed compressor with the second group. The area and power
of the original Momeni compressor are worse than those of the Ha compressor, which
has the best hardware performance among the compressors in the second group. However,
after the optimization, the proposed compressor outperforms the second group in all
hardware aspects. Specifically, the proposed compressor has enhanced area, power,
delay, and energy by 48%, 58%, 22%, and 66% over the Ha compressor, respectively.
Compared to the Yang1 compressor, which requires the most hardware resources, the
proposed compressor's area, power, and delay are approximately 70%, 73%, and 41% smaller,
respectively. In particular, the energy is consumed about 84% less in the proposed
design than in the Yang1 compressor. Clearly, our design optimization allows the proposed
compressor to outperform the others in hardware cost.
Fig. 4. Comparison with other approximate 4-2 compressors in terms of area, power, delay, and energy.}
5. Conclusion
In this paper, we have presented an optimized Momeni 4-2 approximate compressor that
reduces hardware resource consumption considerably. When implemented using 32-nm CMOS
technology, the area and power of the proposed compressor are improved by 62.5% and
65.7%, respectively. In addition, our design allows the reduction of the area by 13.7%
and 29.3% in C-N and C-FULL multiplier configurations, respectively. In particular,
this optimized design reduced energy consumption by 34% in the C-FULL multiplier configuration.
Also, the proposed compressor outperforms the other compressors considered here in
terms of area, power, and energy.
ACKNOWLEDGMENTS
This work was supported in part by the Basic Science Research Program through
National Research Foundation of Korea (NRF) funded by the Korean Government (MSIT)
(NRF-2020R1A4A1019628) and the Ministry of Education (NRF-2019R1I1A3A01061266) and
in part by the BK21 FOUR project (AI-driven Convergence Software Education Research
Program) funded by the Ministry of Education, School of Computer Science and Engineering,
Kyungpook National University, Korea (4199990214394).
REFERENCES
Moreau T., Sampson A., Ceze L., 2015, Approximate Computing: Making Mobile Systems
More Efficient, IEEE Pervasive Computing, Vol. 14, No. 2, pp. 9-13
Chippa V. K., Chakradhar S. T., Roy K., Raghunathan A., 2013, Analysis and characterization
of inherent application resilience for approximate computing, ACM/EDAC/IEEE Design
Automation Conference (DAC), Vol. article 113, pp. 1-9
Gupta V., Mohapatra D., Raghunathan A., Roy K., 2013, Low-Power Digital Signal Processing
Using Approximate Adders, IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, Vol. 32, No. 1, pp. 124-137
Wallace C. S., 1964, A Suggestion for a Fast Multiplier, IEEE Transactions on Electronic
Computers, Vol. EC-13, No. 1, pp. 14-17
Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and Analysis of Approximate
Compressors for Multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp.
984-994
Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-Quality 4:2 Compressors
for Utilizing in Dynamic Accuracy Configurable Multipliers, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361
Venkatachalam S., Ko S., 2017, Design of Power and Area Efficient Approximate Multipliers,
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5,
pp. 1782-1786
Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise
compressors for approximate multiplication at nanoscale, AEU - International Journal
of Electronics and Communications, Vol. 110
Kong T., Li S., 2021, Design and Analysis of Approximate 4-2 Compressors for High-Accuracy
Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol.
29, No. 10, pp. 1771-1781
Strollo A. G. M., Napoli E., De Caro D., Petra N., Meo G. D., 2020, Comparison and
Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers, IEEE
Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034
Chang C.-H., Gu J., Zhang M., 2004, Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors
for fast arithmetic circuits, IEEE Trans. Circuits Syst. I Reg. Papers, Vol. 51, No.
10, pp. 1985-1997
Saha A., Pal R., Naik A. G., Pal D., 2018, Novel CMOS multi-bit counter for speed-power
optimization in multiplier design, AEU - International Journal of Electronics and
Communications, Vol. 95, pp. 189-198
Yang Z., Han J., Lombardi F., 2015, Approximate compressors for error-resilient multiplier
design, IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology
Systems (DFTS), pp. 183-186
Ha M., Lee S., 2018, Multipliers With Approximate 4-2 Compressors and Error Recovery
Modules, IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9
Lin C. -H., Lin I. -C., 2013, High accuracy approximate multiplier with error correction,
IEEE International Conference on Computer Design (ICCD), pp. 33-38
Kim Y., 2019, An Accuracy Enhanced Error Tolerant Adder with Carry Prediction for
Approximate Computing, IEIE Transactions on Smart Computing and Processing, Vol. 8,
No. 4, pp. 324-330
Kim Y., 2019, A Novel Approximate Adder with Enhanced Low-Cost Carry Prediction for
Error Tolerant Computing, IEIE Transactions on Smart Computing and Processing, Vol.
8, No. 4, pp. 506-510
Seo H., Yang Y. S., Kim Y., 2020, Design and Analysis of an Approximate Adder with
Hybrid Error Reduction, Electronics, Vol. 9, No. 3, pp. 471:1-13
Seo H., Lee J., Lee Donghui, Kim B., Kim Y., 2021, Design and Analysis of a Low-Cost
Approximate Adder with OR and Zero Truncation, IEIE Transactions on Smart Computing
and Processing, Vol. 10, No. 4, pp. 309-314
Lee J., Seo H., Seok H., Kim Y., 2021, A Novel Approximate Adder Design using Error
Reduced Carry Prediction and Constant Truncation, IEEE Access, Vol. 9, pp. 119939-119953
Seok H., Seo H., Lee J., Kim Y., 2021, COREA: Delay- and Energy-Efficient Approximate
Adder Using Effective Carry Speculation, Electronics, Vol. 10, No. 18, pp. 2234: 1-12
Lee J., Seo H., Kim Y., Kim Y., 2020, Approximate adder design with simplified lower-part
approximation, IEICE Electronics Express, Vol. 17, No. 15, pp. 20200218
Choi W., Shim M., Seok H., Kim Y., 2021., DCPA: approximate adder design exploiting
dual carry prediction, IEICE Electronics Express, Vol. 18, No. 23, pp. 20210431
Author
Hyelin Seok received a B.S. degree from the School of Computer Science and Engineering,
Kyungpook National University, Daegu, Republic of Korea in 2022, where she is pursuing
an M.S. degree. Her research interests include computer architecture, approximate
arithmetic, and new computing systems.
Hyoju Seo received a B.S and M.S. degrees at the School of Computer Science and
Engineering from Kyungpook National University, Daegu, Republic of Korea, in 2020
and 2022, respectively, where she is currently pursuing a Ph.D. Her research interests
include approximate computing, neuromorphic computing, deep learning accelerator,
and image processing.
Jungwon Lee received a B.S. degree from the School of Computer Science and Engineering,
Kyungpook National University, Daegu, Republic of Korea in 2021, where she is pursuing
an M.S. degree. Her research interests include deep learning, approximate arithmetic,
and approximate DRAM.
Yongtae Kim received B.S. and M.S. degrees in electrical engineering from Korea
University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D.
degree from the Depart-ment of Electrical and Computer Engineering from Texas A&M
University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer
with the Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School
of Computer Science and Engineering at Kyungpook National University, Daegu, Republic
of Korea, where he is currently an assistant professor. His research interests are
energy-efficient integrated circuits and systems, particularly neuromorphic computing
and approximate computing, and new memory devices and architectures.