ZhangYongqiang1
HeCong1
ChenXiaoyue1
XieGuangjun1*
-
(School of Microelectronics, Hefei University of Technology / Hefei, China
ahzhangyq@hfut.edu.cn, 2191158315@qq.com, 1617090911@qq.com, gjxie8005@hfut.edu.cn
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Approximate computing, Multiplier, Compressor, Energy consumption, Image multiplication
1. Introduction
Approximate computing is an attractive paradigm in circuit design, lowering the demand
for accurate operations, and reducing power, speed, and area at the expense of a reduction
in computing accuracy. The trade-off between hardware cost and computing accuracy
is especially relevant to error-resilient applications, such as machine learning and
multimedia processing.
Multipliers are the basic blocks of digital systems, and usually consist of three
steps: 1) generating partial products, 2) reducing the partial products, and 3) summing
the final results. Among them, the second step accounts for the dominant hardware
cost. Using efficient compressors can significantly reduce the complexity of this
step, and thus, improves the performance of multipliers [1], and 4-2 compressors are widely applied to multipliers to accelerate the reduction
of partial products. In [2], a compressor ignored input signal cin and output signal cout to improve the performance
of multipliers in terms of power and delay. The multiplier that utilizes the proposed
compressor shows a great reduction in hardware requirements and transistor count,
compared to the existing designs. Three 4-2 compressors were proposed in [3] by modifying the truth table of an exact compressor. However, the multipliers using
these compressors were inferior in overall performance. In [4], the partial-product-altering method was applied to a 4-2 compressor, and realized
a balance between hardware cost and multiplier accuracy. A compressor using a majority
gate was designed in [5] by ignoring input signal x$_{2}$, cin, and the cout signal to achieve excellent power
and delay performance. The stacking circuit technique was adopted in [6] to design approximate multipliers with high computing accuracy while leading to high
hardware costs. In [7], a new compressor was designed using only simple AND-OR gates, and the multiplier
utilizing this compressor provided a good error-electrical performance trade-off.
The dual-quality 4-2 compressors introduced in [8] can be flexibly switched between precise and approximate operating modes. Therefore,
multipliers using these compressors can realize dynamic change in accuracy at runtime.
To improve the trade-off between hardware cost and computing accuracy in approximate
circuits, this paper proposes a set of approximate 8${\times}$8 Dadda multipliers.
To that end, an imprecise 4-2 compressor using only OR and XNOR gates is designed
by introducing symmetrical errors into the truth table of the exact compressor. The
errors can counteract each other in a multiplier. This method will optimize the design
complexity of multipliers in area, power, and delay while generating satisfying results.
The main contributions of this paper are summarized as follows.
1) An approximate 4-2 compressor is proposed to simplify the design complexity of
the partial production reduction step in multipliers.
2) A set of approximate Dadda multipliers is built from the compressors to find a
better structure with a lower hardware cost and higher computing accuracy.
3) The image multiplication operation is realized through these multipliers to evaluate
computing accuracy in real applications.
4) The trade-off between hardware cost and accuracy in the multipliers is comprehensively
analyzed through various evaluation criteria as an example in approximate computing.
This paper proceeds as follows. In Section 2, the previous approximate 4-2 compressors
are reviewed. Section 3 presents the proposed approximate compressor and multipliers.
The synthesis results and their application to image processing are presented in Section
4. Section 5 concludes this paper.
2. Related Work
In this paper, we look to 4-2 compressors to build 8${\times}$8 Dadda multipliers
owing to their simplified structure and high efficiency in transistor-level implementations.
In recent years, several methods have been proposed to design imprecise 4-2 compressors,
and they were utilized to design approximate multipliers. Some previous approximate
designs that ignored cin and cout are summarized and compared in this section.
In the approximate 4-2 compressor presented in [2], the delay of the critical path is less than the previous design, and the number
of gates was further reduced. Three approximate 4-2 compressors were proposed in [3]; they use a k-map to obtain simplified logical expressions that reduce errors while
providing a significant performance improvement over previous 4-2 compressors. The
first and the second designs in [3] only have four gates, which greatly simplifies the structural complexity. The third
design is the most accurate while having a more complex structure compared with other
designs. In [4], to simplify the circuit of the 4-2 compressor, an OR gate replaces an XOR gate to
compute a sum, thus introducing additional errors. An ultra-efficient compressor proposed
in [5] consists of one majority gate, which is different from conventional designs. Since
input x$_{2}$ is omitted, and output sum is always equal to 1, this approximate compressor
reaches a simpler logic implementation. The compressors in [6] have high accuracy, using the stacking circuit technique. A hardware-efficient approximate
compressor proposed in [9] was obtained by modifying the truth table of the exact compressor, and consists of
only three NOR gates and one NAND gate. In [10], an ultra-compact 4-2 compressor was proposed based on simple AND-OR logic, which
leads to a trade-off between hardware cost and precision. In [11], the proposed compressor was obtained by modifying an approximate compressor, and
the performance of the applied multiplier improved. Three approximate compressors
were presented in [12], and they all innovatively reduced the number of outputs to one, thus significantly
reducing the hardware cost.
3. The Proposed Compressor and Multipliers
3.1 The Compressor
As shown in Fig. 1, an exact 4-2 compressor generally consists of two full adders with five inputs (x$_{1}$,
x$_{2}$, x$_{3}$, x$_{4}$, and cin) and three outputs (sum, carry, and cout) [13]. The number for logic 1 in five inputs is counted by the output according to (1), (2), and (3):
The four inputs, x$_{1}$, x$_{2}$, x$_{3}$, and x$_{4}$, and the output sum have the
same weight, whereas the weights of cout and carry are one binary bit order higher
[12,14]. Therefore, cout and carry are delivered to the next module of higher significance.
In this work, the proposed 4-2 compressor (Fig. 2) is derived by modifying the truth table of the exact compressor to obtain simpler
logic expressions, as seen in (4) and (5), along with ignoring signals cin and cout for design efficiency, as seen in previous
work [2]. Input x$_{1}$ and x$_{2}$ are also omitted to simplify the compressor and greatly
reduce the energy and critical path delay further. Thus, it has only OR and XNOR gates.
Although the omission of x$_{1}$ and x$_{2}$ introduces certain errors, the proposed
compressors are only used for the approximate part in multipliers, which has little
impact on computing accuracy. Thus, attention will be paid more to the hardware/accuracy
trade-off of the multipliers, rather than only a specific indicator.
As seen in the truth table in Table 1, the proposed design has eight erroneous outputs out of 16 outputs. Error is defined
as the arithmetic distance between the exact and approximate values [15]. For example, when all inputs are 1, the exact output is 4, and the proposed compressor
produces a 1 for both sum and carry. In this case, the decimal output is 3, so the
error distance is 1. The maximum error generated by this design is 1 (-1), which could
avoid unacceptable results when the compressor is applied to approximate multipliers.
Besides, in the structure of a multiplier, error distance with opposite signs of -1
and 1 will counteract each other [5].
Fig. 1. The conventional 4-2 compressor.
Fig. 2. The proposed 4-2 compressor.
Table 1. Truth table of the proposed 4-2 compressor.
x4
|
x3
|
x2
|
x1
|
exact
|
carry
|
sum
|
approximate
|
error
|
0
|
0
|
0
|
0
|
0
|
0
|
1
|
1
|
-1
|
0
|
0
|
0
|
1
|
1
|
0
|
1
|
1
|
0
|
0
|
0
|
1
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
0
|
1
|
1
|
2
|
0
|
1
|
1
|
1
|
0
|
1
|
0
|
0
|
1
|
1
|
0
|
2
|
-1
|
0
|
1
|
0
|
1
|
2
|
1
|
0
|
2
|
0
|
0
|
1
|
1
|
0
|
2
|
1
|
0
|
2
|
0
|
0
|
1
|
1
|
1
|
3
|
1
|
0
|
2
|
1
|
1
|
0
|
0
|
0
|
1
|
1
|
0
|
2
|
-1
|
1
|
0
|
0
|
1
|
2
|
1
|
0
|
2
|
0
|
1
|
0
|
1
|
0
|
2
|
1
|
0
|
2
|
0
|
1
|
0
|
1
|
1
|
3
|
1
|
0
|
2
|
1
|
1
|
1
|
0
|
0
|
2
|
1
|
1
|
3
|
-1
|
1
|
1
|
0
|
1
|
3
|
1
|
1
|
3
|
0
|
1
|
1
|
1
|
0
|
3
|
1
|
1
|
3
|
0
|
1
|
1
|
1
|
1
|
4
|
1
|
1
|
3
|
1
|
3.2 The Approximate Multipliers
To investigate the impact of the proposed compressor on multiplication, 8${\times}$8
Dadda multipliers with various levels of accuracy are designed. The basic structure
of the approximate Dadda multiplier was described in [2] where the multiplier uses AND gates to generate all partial products in the first
step, and then uses approximate compressors to compress them into, at most, two rows.
In the last step, an exact ripple carry adder computes the results.
In designing multipliers, the second step plays a critical role in terms of delay,
power consumption, and area. The proposed multipliers are denoted M${\alpha}$${\beta}$${\gamma}$,
where ${\alpha}$, ${\beta}$, and ${\gamma}$, respectively, represent the number of
columns using exact compressors, approximate compressors, and truncation to compress
partial products. To find an effective way to improve the performance of multipliers,
the least significant bits of the partial products are truncated. In some applications,
such as image processing, it is not desirable to obtain more than a certain level
of accuracy. Furthermore, the related exact operations consume relatively high amounts
of energy. Therefore, exact compressors are utilized for the most significant bits
to make up for the lack of computing accuracy, while the proposed approximate compressors
are applied to the middle of the partial products to reduce the hardware cost. To
investigate the trade-off between hardware cost and accuracy, a set of multipliers
was designed. Obviously, M7${\beta}$${\gamma}$ and M6${\beta}$${\gamma}$ aim at improving
computing accuracy, while M5${\beta}$${\gamma}$ is used to reduce the hardware cost.
For example, the partial product reduction step of the proposed M654 is shown in Fig. 3, where each dot represents a partial product bit. In the first two stages, three
half adders, three full adders, 10 of the proposed imprecise 4-2 compressors, and
six exact 4-2 compressors are utilized. In the last stage, a half adder and nine full
adders are applied to compute the results.
Fig. 3. Partial product reduction of the proposed M654.
4. Simulation Results and Application
In this section, all designs were described in Verilog HDL and synthesized through
the Synopsys Design Compiler NXT with a TSMC 65 nm standard cell library at 100MHz
to evaluate performance. Note that the standard CMOS cell library does not include
a special module, so all circuits were synthesized using the compile\_ultra command
to provide a fair comparison, and the logic functions of the existing designs were
optimized under the same conditions. Power data reported are from the Synopsys PrimePower
tool using the vector-free power analysis model. In addition, the error metrics and
an application for image processing with multipliers were programmed in Matlab.
4.1 The Approximate Compressor
A comparison of the proposed compressor and the existing exact and approximate compressors
in terms of area, power, and delay is shown in Table 2. For clarity, the three designs proposed in [3] are represented by [3]1, [3]2, and [3]3, and the three methods in [6] are denoted [6]1, [6]2, and [6]3. To comprehensively evaluate efficiency from the proposed design, power-delay product
(PDP) and energy-delay product (EDP) are also listed [9,16].
As can be seen from Table 2, the proposed approximate compressor has a 74% reduction in area, a 27% reduction
in delay, and a 91% reduction in PDP, compared to the exact 4-2 compressor. Besides,
it is noteworthy that the proposed compressor has the lowest area and power, compared
to state-of-the-art 4-2 compressors. Although PDP is a little higher than [5], EDP is equal. In summary, the proposed approximate 4-2 compressor has an advantage
in hardware overhead, owing to the optimized structure using only one OR gate and
one XNOR gate. Although the compressor in [5] has better delay and power than the one proposed here, the approximate multiplier
in [5] is inferior to the multipliers proposed here, as is explained later.
Table 2. Hardware comparison of 4-2 compressors.
Design
|
Area
(${\mu}$m$^{2}$)
|
Power
(mW)
|
Delay
(ns)
|
PDP
(fJ)
|
EDP
(fJ∙ns)
|
Proposed
|
4.68
|
4.93×10$^{-4}$
|
0.30
|
0.15
|
0.04
|
[2]
|
6.84
|
1.26×10$^{-3}$
|
0.46
|
0.58
|
0.27
|
[3]1
|
14.04
|
1.36×10$^{-3}$
|
0.35
|
0.48
|
0.17
|
[3]2
|
13.32
|
1.66×10$^{-3}$
|
0.34
|
0.56
|
0.19
|
[3]3
|
14.40
|
1.40×10$^{-3}$
|
0.32
|
0.45
|
0.14
|
[4]
|
11.52
|
1.27×10$^{-3}$
|
0.36
|
0.46
|
0.17
|
[5]
|
5.04
|
5.46×10$^{-4}$
|
0.25
|
0.14
|
0.04
|
[6]1
|
11.16
|
2.00×10$^{-3}$
|
0.33
|
0.66
|
0.22
|
[6]2
|
15.84
|
2.29×10$^{-3}$
|
0.43
|
0.98
|
0.42
|
[6]3
|
17.28
|
2.42×10$^{-3}$
|
0.45
|
1.09
|
0.49
|
Exact
|
18.00
|
3.95×10$^{-3}$
|
0.41
|
1.62
|
0.66
|
4.2 The Approximate Multipliers
4.2.1 Hardware Cost
The area, power, delay, PDP, and EDP of the approximate and exact multipliers are
listed in Table 3. The proposed multipliers are divided into three types (M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$,
and M5${\beta}$${\gamma}$) to get the trade-off between hardware cost and computing
accuracy.
Table 3. Hardware comparison of 8${\times}$8 multipliers.
Design
|
Area
(${\mu}$m$^{2}$)
|
Power
(mW)
|
Delay
(ns)
|
PDP
(fJ)
|
EDP
(fJ∙ns)
|
M753
|
360.00
|
4.76×10$^{-2}$
|
1.55
|
73.78
|
114.36
|
M744
|
342.36
|
4.51×10$^{-2}$
|
1.56
|
70.36
|
109.76
|
M735
|
331.92
|
4.36×10$^{-2}$
|
1.54
|
67.14
|
103.40
|
M726
|
329.76
|
4.13×10$^{-2}$
|
1.56
|
64.43
|
100.51
|
M717
|
292.68
|
3.69×10$^{-2}$
|
1.63
|
60.15
|
98.04
|
M663
|
314.64
|
4.03×10$^{-2}$
|
1.46
|
58.84
|
85.90
|
M654
|
298.80
|
3.83×10$^{-2}$
|
1.44
|
55.15
|
79.42
|
M645
|
285.84
|
3.61×10$^{-2}$
|
1.42
|
51.26
|
72.79
|
M636
|
267.84
|
3.42×10$^{-2}$
|
1.42
|
48.56
|
68.96
|
M627
|
246.24
|
3.04×10$^{-2}$
|
1.32
|
40.13
|
52.97
|
M618
|
227.16
|
2.71×10$^{-2}$
|
1.35
|
36.59
|
49.39
|
M573
|
275.40
|
3.38×10$^{-2}$
|
1.38
|
46.64
|
64.37
|
M564
|
258.84
|
3.16×10$^{-2}$
|
1.26
|
39.82
|
50.17
|
M555
|
245.88
|
2.99×10$^{-2}$
|
1.29
|
38.57
|
49.76
|
M546
|
226.08
|
2.78×10$^{-2}$
|
1.27
|
35.31
|
44.84
|
M537
|
207.36
|
2.47×10$^{-2}$
|
1.27
|
31.37
|
39.84
|
M528
|
185.40
|
2.18×10$^{-2}$
|
1.21
|
26.38
|
31.92
|
M519
|
160.56
|
1.83×10$^{-2}$
|
1.26
|
23.06
|
29.05
|
[2]
|
389.52
|
3.73×10$^{-2}$
|
1.71
|
63.78
|
109.07
|
[3]1
|
398.52
|
3.52×10$^{-2}$
|
1.58
|
55.62
|
87.87
|
[3]2
|
423.36
|
3.72×10$^{-2}$
|
1.85
|
68.82
|
127.32
|
[3]3
|
420.12
|
3.36×10$^{-2}$
|
1.89
|
63.50
|
120.02
|
[4]
|
325.44
|
3.13×10$^{-2}$
|
1.52
|
47.58
|
72.32
|
[5]
|
264.24
|
2.76×10$^{-2}$
|
1.35
|
37.26
|
50.30
|
[6]1
|
498.96
|
6.4×10$^{-2}$
|
1.66
|
106.24
|
176.36
|
[6]2
|
510.84
|
6.9×10$^{-2}$
|
1.73
|
119.37
|
206.51
|
[6]3
|
567.72
|
7.35×10$^{-2}$
|
1.77
|
130.10
|
230.27
|
Exact
|
577.80
|
7.81×10$^{-2}$
|
1.81
|
141.36
|
255.86
|
As seen from the results in Table 3, M5${\beta}$${\gamma}$ has the smallest area, power, and delay of the three types
of multipliers, whereas M7${\beta}$${\gamma}$ has the highest, and M6${\beta}$${\gamma}$
is in the middle, as influenced by ${\alpha}$. Obviously, for each type of multiplier
(like M7${\beta}$${\gamma}$), when ${\gamma}$ increases, ${\beta}$ will decrease,
and the hardware cost is also reduced by the impact of ${\gamma}$. PDP and EDP are
reported to further assess the performance of these multipliers, and they change in
the way described above.
Note that from the data, the proposed multipliers greatly outperformed the exact design.
The proposed multipliers reduce the area, delay, and power by 38%-72%, 14%-33%, and
39%-77%, respectively, compared to the exact multiplier. Besides, most of the M5${\beta}$${\gamma}$
multipliers reached significant hardware improvement over previous designs, especially
M519, which had the best hardware performance compared to all designs, reducing PDP
and EDP on average by 67% and 75%, respectively.
4.2.2 Computing Accuracy
To evaluate the output quality from approximate multipliers, error rate (ER), mean
error distance (MED), and normalized mean error distance (NMED) were computed by applying
all 65,536 possible input combinations [16]. ER is the possibility of producing an erroneous result, and MED is calculated with
(6):
where N is the bit width of a multiplier, and ED$_{i}$ represents the arithmetic difference
between approximate and exact results. NMED from the maximum output of the exact multiplier
is expressed in (7):
The accuracy metrics of the proposed multipliers are listed in Table 4. In the three types of multiplier, M7${\beta}$${\gamma}$ has a relatively small ER,
MED, and NMED. Besides, all the multipliers have a high ER, mainly due to the truncated
structure. ER decreases as the number of truncated columns increases. As for MED and
NMED, they decrease as ${\gamma}$ increases, and drop to a minimum when ${\beta}$
is 2, then increase again. When the number of truncated columns reached the highest
level, the multipliers had the worst computing accuracy, but the accuracy of M717
was higher than M663, and M618 was better than M573 due to the exact part of the most
significant bits.
Table 4. ER, MED, and NMED of approximate 8${\times}$8 multipliers.
Design
|
ER (%)
|
MED
|
NMED
|
M753
|
99.77
|
1.96×10$^{2}$
|
3.01×10$^{-3}$
|
M744
|
99.83
|
1.88×10$^{2}$
|
2.89×10$^{-3}$
|
M735
|
99.80
|
1.68×10$^{2}$
|
2.58×10$^{-3}$
|
M726
|
99.51
|
1.31×10$^{2}$
|
2.01×10$^{-3}$
|
M717
|
99.22
|
1.72×10$^{2}$
|
2.65×10$^{-3}$
|
M663
|
99.89
|
3.49×10$^{2}$
|
5.36×10$^{-3}$
|
M654
|
99.91
|
3.41×10$^{2}$
|
5.25×10$^{-3}$
|
M645
|
99.91
|
3.22×10$^{2}$
|
4.95×10$^{-3}$
|
M636
|
99.83
|
2.81×10$^{2}$
|
4.33×10$^{-3}$
|
M627
|
99.66
|
2.63×10$^{2}$
|
4.04×10$^{-3}$
|
M618
|
99.51
|
4.29×10$^{2}$
|
6.60×10$^{-3}$
|
M573
|
99.95
|
6.78×10$^{2}$
|
10.42×10$^{-3}$
|
M564
|
99.95
|
6.71×10$^{2}$
|
10.32×10$^{-3}$
|
M555
|
99.95
|
6.55×10$^{2}$
|
10.08×10$^{-3}$
|
M546
|
99.92
|
6.11×10$^{2}$
|
9.40×10$^{-3}$
|
M537
|
99.85
|
5.64×10$^{2}$
|
8.67×10$^{-3}$
|
M528
|
99.83
|
4.79×10$^{2}$
|
7.36×10$^{-3}$
|
M519
|
99.80
|
8.01×10$^{2}$
|
12.33×10$^{-3}$
|
[2]
|
99.10
|
3.15×10$^{3}$
|
48.46×10$^{-3}$
|
[3]1
|
87.19
|
3.62×10$^{3}$
|
55.73×10$^{-3}$
|
[3]2
|
87.19
|
4.17×10$^{3}$
|
64.2×10$^{-3}$
|
[3]3
|
97.26
|
5.91×10$^{3}$
|
90.92×10$^{-3}$
|
[4]
|
85.73
|
2.24×10$^{3}$
|
34.41×10$^{-3}$
|
[5]
|
99.82
|
4.94×10$^{2}$
|
7.60×10$^{-3}$
|
[6]1
|
55.34
|
0.70×10$^{2}$
|
1.07×10$^{-3}$
|
[6]2
|
17.96
|
0.17×10$^{2}$
|
0.26×10$^{-3}$
|
[6]3
|
3.59
|
0.03×10$^{2}$
|
0.04×10$^{-3}$
|
Compared to previous work, NMED from the proposed multipliers was not the lowest;
however, it was acceptable for most image processing applications [17]. M528 had the best accuracy, compared to all designs except [6]. Although the multipliers in [6] have advantages in the accuracy metrics, they carried the highest hardware cost,
as shown in Table 3. Therefore, all performance evaluation metrics should be taken into account.
The error distribution of the proposed multipliers, including M7${\beta}$${\gamma}$,
M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$, is shown in Fig. 4, where the errors were mainly in the ranges [-600, 600], [\hbox{-}1000, 1000], and
[-2000, 1000], respectively, accounting on average for about 83%, 84%, and 84% of
the whole range. Thus, the reservation of an appropriate number of the most significant
bits will preserve the accuracy of a multiplier.
Fig. 4. Error distance from the multipliers: (a) M5${\beta}$${\gamma}$; (b) M6${\beta}$${\gamma}$; (c) M7${\beta}$${\gamma}$.
As seen from the results above, M5${\beta}$${\gamma}$ had the better hardware metrics
but a worse NMED, while M7${\beta}$${\gamma}$ had the better NMED and a worse hardware
cost. Thus, to reconcile the trade-off between accuracy and hardware cost, a figure
of merit (FOM) was suggested in [8]. Due to the relatively small delay from the proposed multiplier, for a fair comparison,
delay was removed and modified as seen in (8) [5]:
Fig. 5 shows FOM1 for the proposed and existing approximate 8${\times}$8 multipliers. The
smaller the value of FOM1, the better the trade-off between accuracy and hardware.
Thus, M627, M618, M564, M555, M546, M537, M528, and M519 have a lower FOM1 compared
with other designs, indicating that most of the proposed multipliers offer a better
trade-off than previous designs.
Fig. 5. FOM of approximate 8${\times}$8 multipliers.
4.3 Image Multiplication
To assess the practicality of approximate multipliers in real applications, they were
applied to image multiplication as a widely used operation in image processing. The
discussed multipliers handled two images, pixel by pixel, thereby combining two images
into a single image [18-21].
The peak signal-to-noise ratio (PSNR) and the mean structural similarity index metric
(MSSIM) [22] were computed to evaluate the quality of the processed images. PSNR is expressed
in (9):
where w and r are the width and height of the image, $\textit{S'(i, j)}$ and S(i,
j) represent the exact and approximate value of each pixel, respectively, and MAX
is the maximum pixel value. The larger the PSNR, the better the image. MSSIM is expressed
in (10):
where X and Y represent two images. Other parameters can be found in detail in [22]. MSSIM reaches 1 when the two processed images are the same.
Table 5 shows PSNR and MSSIM values for five image multiplication examples. All the proposed
multipliers achieved PSNR values higher than 30dB for various images, with a PSNR
higher than 30dB certified as good enough [23]. Besides, the results of MSSIM for all approximate multipliers are very close to
the exact design (MSSIM=1). Moreover, both PSNR and MSSIM values increase as the number
of exact columns increases.
Table 5. PSNR and MSSIM of multiplied images using the 8${\times}$8 multipliers.
|
PSNR (dB)
|
MSSIM
|
Lena×
LenaRGB
|
Baboon×
BaboonRGB
|
Goldhill×
Goldhill
|
Goldhill×
LenaRGB
|
Goldhill×
BaboonRGB
|
Lena×
LenaRGB
|
Baboon×
BaboonRGB
|
Goldhill×
Goldhill
|
Goldhill×
LenaRGB
|
Goldhill×
BaboonRGB
|
M753
|
46.03
|
45.13
|
46.20
|
45.97
|
45.72
|
0.9985
|
0.9989
|
0.9966
|
0.9984
|
0.9980
|
M744
|
46.33
|
45.43
|
46.50
|
46.25
|
46.02
|
0.9985
|
0.9990
|
0.9965
|
0.9984
|
0.9980
|
M735
|
47.15
|
46.17
|
47.26
|
46.97
|
46.72
|
0.9986
|
0.9990
|
0.9966
|
0.9984
|
0.9980
|
M726
|
48.56
|
48.24
|
48.46
|
48.89
|
48.80
|
0.9988
|
0.9992
|
0.9960
|
0.9987
|
0.9983
|
M717
|
46.66
|
47.30
|
45.68
|
46.60
|
46.73
|
0.9987
|
0.9990
|
0.9943
|
0.9984
|
0.9980
|
M663
|
41.55
|
40.19
|
38.99
|
41.55
|
41.25
|
0.9957
|
0.9967
|
0.9855
|
0.9953
|
0.9943
|
M654
|
41.70
|
40.32
|
39.08
|
41.70
|
41.41
|
0.9957
|
0.9968
|
0.9851
|
0.9953
|
0.9943
|
M645
|
42.12
|
40.74
|
39.44
|
42.11
|
41.82
|
0.9958
|
0.9968
|
0.9855
|
0.9953
|
0.9943
|
M636
|
43.02
|
41.79
|
40.33
|
43.25
|
43.12
|
0.9960
|
0.9971
|
0.9858
|
0.9957
|
0.9947
|
M627
|
43.64
|
42.99
|
41.64
|
43.60
|
43.65
|
0.9962
|
0.9972
|
0.9846
|
0.9956
|
0.9944
|
M618
|
39.72
|
39.55
|
36.71
|
39.51
|
39.42
|
0.9955
|
0.9964
|
0.9742
|
0.9950
|
0.9929
|
M573
|
34.54
|
34.98
|
34.36
|
36.07
|
35.65
|
0.9813
|
0.9896
|
0.9631
|
0.9847
|
0.9827
|
M564
|
34.61
|
35.05
|
34.39
|
36.15
|
35.73
|
0.9814
|
0.9896
|
0.9629
|
0.9848
|
0.9827
|
M555
|
34.79
|
35.22
|
34.43
|
36.29
|
35.90
|
0.9814
|
0.9896
|
0.9614
|
0.9844
|
0.9822
|
M546
|
35.13
|
35.73
|
34.91
|
36.83
|
36.52
|
0.9815
|
0.9897
|
0.9616
|
0.9848
|
0.9826
|
M537
|
35.87
|
36.45
|
35.50
|
37.52
|
37.32
|
0.9825
|
0.9900
|
0.9577
|
0.9849
|
0.9823
|
M528
|
38.76
|
37.94
|
35.27
|
38.48
|
38.48
|
0.9902
|
0.9913
|
0.9444
|
0.9884
|
0.9848
|
M519
|
33.77
|
33.86
|
31.04
|
33.98
|
34.07
|
0.9846
|
0.9872
|
0.9226
|
0.9827
|
0.9778
|
[2]
|
22.77
|
23.44
|
21.61
|
24.03
|
23.68
|
0.8630
|
0.8600
|
0.7214
|
0.7864
|
0.7994
|
[3] 1
|
13.72
|
13.85
|
12.48
|
13.84
|
13.67
|
0.6534
|
0.7018
|
0.5411
|
0.6542
|
0.6626
|
[3] 2
|
13.71
|
13.85
|
12.48
|
13.86
|
13.68
|
0.6550
|
0.7015
|
0.5416
|
0.6342
|
0.6507
|
[3] 3
|
14.09
|
14.19
|
12.72
|
14.35
|
14.16
|
0.6239
|
0.6753
|
0.4938
|
0.6049
|
0.6035
|
[4]
|
28.17
|
27.83
|
25.35
|
28.59
|
28.94
|
0.9367
|
0.9534
|
0.9464
|
0.9533
|
0.9478
|
[5]
|
38.73
|
39.09
|
36.70
|
38.73
|
38.61
|
0.9897
|
0.9916
|
0.9645
|
0.9873
|
0.9827
|
[6] 1
|
51.35
|
52.64
|
49.11
|
51.78
|
51.99
|
0.9995
|
0.9997
|
0.9982
|
0.9995
|
0.9994
|
[6] 2
|
59.41
|
59.47
|
54.20
|
58.56
|
58.80
|
0.9999
|
0.9999
|
0.9990
|
0.9999
|
0.9998
|
[6] 3
|
68.77
|
68.78
|
62.52
|
67.65
|
67.70
|
1.0000
|
1.0000
|
0.9998
|
1.0000
|
1.0000
|
To visualize the effect of approximate multiplication on image quality, multiplied
images LenaRGB and Lena (using the considered multipliers) are shown in Fig. 6. The results indicate no obvious differences between the proposed designs and the
exact design.
For comprehensively evaluating the efficiency of the discussed approximate designs
in image processing, both hardware cost and image quality should be considered simultaneously,
rather than under specific assessment. To intensify the practicability of approximate
multipliers, FOM2 is expressed in (11) [24]:
Fig. 6. The multiplied images for LenaRGB and Lena using 8${\times}$8 multipliers.
A smaller FOM2 value indicates a better compromise between hardware efficiency and
accuracy. Fig. 5 shows FOM2 from the discussed multipliers when saving space. The results indicate
a decreasing trend. Among them, M627, M618, M537, M528, and M519 provided a better
FOM2 than the other designs. Specifically, FOM2 for M528 takes first place in this
regard, with a 63% reduction, on average, compared to the existing designs, followed
by M519 and M537.
5. Conclusion
In this work, an ultra-efficient approximate 4-2 compressor was proposed by introducing
symmetrical errors into the truth table of the exact compressor. A set of Dadda multipliers,
denoted as M${\alpha}$${\beta}$${\gamma}$, was designed to investigate the hardware/accuracy
trade-off. Image multiplication was considered as an example to evaluate computing
accuracy. Experimental results showed that the accuracy of a multiplier is mainly
dominated by the exact part, while the hardware cost is affected by the approximate
and truncated parts. Furthermore, two figures of merit show that a comprehensive indicator
should be considered to reach a compromise between hardware and accuracy, because
a multiplier having high accuracy will consume high amounts of energy. In addition,
several proposed multipliers surpassed their counterparts under the considered criteria.
ACKNOWLEDGMENTS
This work was supported by the Fundamental Research Funds for the Central Universities
of China (Grant No. JZ2020HGQA0162, Grant No. JZ2020HGTA0085).
REFERENCES
Angizi S., Jiang H., DeMara R. F., Han J., Fan D., 2018, Majority-Based Spin-CMOS
Primitives for Approximate Computing, IEEE Transactions on Nanotechnology, Vol. 17,
No. 4, pp. 795-806
Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and Analysis of Approximate
Compressors for Multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp.
984-994
Gorantla A., P D., 2017., Design of Approximate Compressors for Multiplication, ACM
J. Emerg. Technol. Comput. Syst., Vol. 13, No. 3, pp. article 44
Venkatachalam S., Ko S., 2017, Design of Power and Area Efficient Approximate Multipliers,
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5,
pp. 1782-1786
Sabetzadeh F., Moaiyeri M., Ahmadinejad M., 2019, A Majority-Based Imprecise Multiplier
for Ultra-Efficient Approximate Image Multiplication, IEEE Transactions on Circuits
and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208
Strollo A., Napoli E., Caro D., Petra N., Meo G., 2020, Comparison and Extension of
Approximate 4-2 Compressors for Low-Power Approximate Multipliers, IEEE Transactions
on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034
Esposito D., Strollo A. G. M., Napoli E., Caro D. D., Petra N., 2018, Approximate
Multipliers Based on New Approximate Compressors, IEEE Transactions on Circuits and
Systems I: Regular Papers, Vol. 65, No. 12, pp. 4169-4182
Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-Quality 4:2 Compressors
for Utilizing in Dynamic Accuracy Configurable Multipliers, IEEE Transactions on Very
Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361
Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise
compressors for approximate multiplication at nanoscale, (in English), Aeu-International
Journal of Electronics and Communications, Vol. 110
Salmanpour F., Moaiyeri M. H., Sabetzadeh F., 2021, Ultra-Compact Imprecise 4:2 Compressor
and Multiplier Circuits for Approximate Computing in Deep Nanoscale, Circuits Systems
and Signal Processing
Ha M., Lee S., Mar 2018, Multipliers With Approximate 4-2 Compressors and Error Recovery
Modules, IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9
Pei H., Yi X., Zhou H., He Y., Jan 2021, Design of Ultra-Low Power Consumption Approximate
4-2 Compressors Based on the Compensation Characteristic, IEEE Transactions on Circuits
and Systems II-Express Briefs, Vol. 68, No. 1, pp. 461-465
Chiphong C., Jiangmin G., Mingyan Z., 2004, Ultra low-voltage low-power CMOS 4-2 and
5-2 compressors for fast arithmetic circuits, IEEE Transactions on Circuits and Systems
I: Regular Papers, Vol. 51, No. 10, pp. 1985-1997
Yi X., Pei H., Zhang Z., Zhou H., He Y., 2019, Design of an Energy-Efficient Approximate
Compressor for Error-Resilient Multiplications, in 2019 IEEE International Symposium
on Circuits and Systems (ISCAS), pp. 1-5
Liang J., Han J., Lombardi F., 2013, New Metrics for the Reliability of Approximate
and Probabilistic Adders, IEEE Transactions on Computers, Vol. 62, No. 9, pp. 1760-1771
Guo W., Li S., 2021, Fast Binary Counters and Compressors Generated by Sorting Network,
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 29, No. 6,
pp. 1220-1230
Jiang H., Santiago F. J. H., Mo H., Liu L., Han J., 2020, Approximate Arithmetic Circuits:
A Survey, Characterization, Recent Applications, Proceedings of the IEEE, Vol. 108,
No. 12, pp. 2108-2135
Strollo A. G. M., Caro D. D., Napoli E., Petra N., Meo G. D., 2020, Low-Power Approximate
Multiplier with Error Recovery using a New Approximate 4-2 Compressor, in 2020 IEEE
International Symposium on Circuits and Systems (ISCAS), pp. 1-4
Toan N. V., Lee J., 2019, Energy-Area-Efficient Approximate Multipliers for Error-Tolerant
Applications on FPGAs, in 2019 32nd IEEE International System-on-Chip Conference (SOCC),
pp. 336-341
Savithaa N., Poornima A., 2019, A High speed Area Efficient Compression technique
of Dadda multiplier for Image Blending Application, in 2019 Third International conference
on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 426-430
Savio M. M. D., Deepa T., 2020, Design of Higher Order Multiplier with Approximate
Compressor, in 2020 IEEE International Conference on Electronics, Computing and Communication
Technologies (CONECCT), pp. 1-6
Zhou W., Bovik A. C., Sheikh H. R., Simoncelli E. P., 2004, Image quality assessment:
from error visibility to structural similarity, IEEE Transactions on Image Processing,
Vol. 13, No. 4, pp. 600-612
Ansari M. S., Jiang H., Cockburn B. F., Han J., 2018, Low-Power Approximate Multipliers
Using Encoded Partial Products and Approximate Compressors, IEEE Journal on Emerging
and Selected Topics in Circuits and Systems, Vol. 8, No. 3, pp. 404-416
Ahmadinejad M., Moaiyeri M. H., 2021, Energy- and Quality-Efficient Approximate Multipliers
for Neural Network and Image Processing Applications, IEEE Transactions on Emerging
Topics in Computing, pp. 1-1
Author
Yongqiang Zhang received the B.S. degree in electronic science and technology from
Anhui Jianzhu University, Hefei, China, in 2013, and the Ph.D. degree in integrated
circuits and systems from the Hefei University of Technology, Hefei, in 2018. He was
a Visiting Student with the Department of Electrical and Computer Engineering, University
of Alberta, for one year. He is currently with the School of Microelectronics, Hefei
University of Technology. His research interests include approximate computing, stochastic
computing, VLSI design, and nanoelectronics circuits and systems.
Cong He received her B.S. degree in Electronic Information and Engi-neering from
Anhui Jianzhu University, Hefei, China, in 2019. She is currently pursuing the M.S.
degree in Micro-electronics with the Hefei University of Technology. Her research
interests include approximate computing, and emerging technologies in computing systerms.
Xiaoyue Chen received her B.S. degree in Electronic and Information Engineering
from the Liaoning University of Engineering and Technology, Huludao, China, in 2021.
She is currently pursuing the M.S. degree in Microelectronics with the Hefei University
of Technology. Her research interests include approximate computing and stochastic
computing.
Guangjun Xie received the B.S. degree and M.S. degrees in microelectronics from
the Hefei University of Technology, Hefei, China, in 1992 and 1995, respectively,
and the Ph.D. degree in signal and information processing from the University of Science
and Technology of China, Hefei, in 2002. He worked as a Post-Doctoral Researcher in
optics with the University of Science and Technology of China from 2003 to 2005. He
was a Senior Visitor with IMEC in 2007 and ASIC in 2011. He is currently a Professor
with the School of Microelectronics, Hefei University of Technology. His research
interests include integrated circuit design and nanoelectronics. Dr. Xie is a Senior
Member of the Chinese Institute of Electronics.