Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 11, No. 03, p.174-182

ISSN (online) :

2287-5255

Received : 11 April 2022Revised : 09 May 2022

DOI :

https://doi.org/10.5573/IEIESPC.2022.11.3.174

Regular Paper

The Hardware Cost and Computing Accuracy Trade-off in Multipliers using Imprecise 4-2 Compressors for Image Processing Applications

ZhangYongqiang¹ HeCong¹ ChenXiaoyue¹ XieGuangjun^1*

(School of Microelectronics, Hefei University of Technology / Hefei, China ahzhangyq@hfut.edu.cn, 2191158315@qq.com, 1617090911@qq.com, gjxie8005@hfut.edu.cn )

^* Corresponding Author: Guangjun Xie

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Approximate computing has been widely used in image processing applications to significantly reduce the hardware cost of circuits; however, this induces a sacrifice in computing accuracy. The compromise between accuracy and hardware cost in approximate multipliers has not been investigated yet. To address this issue, this paper proposes a set of approximate 8×8 Dadda multipliers built by using an efficient imprecise 4-2 compressor. The compressor introduces symmetrical errors into the truth table of the exact design to reach a simpler structure. Furthermore, as an important image processing application, image multiplication is implemented with the proposed multipliers. Synthesis and simulation results show that the overall performance of the multipliers varies depending on the various assessment criteria. Utilization of the modified compressor in the multipliers results in area, delay, and power reductions of 38%-72%, 14%-33%, and 39%-77%, respectively, compared to the exact design, while maintaining acceptable computing accuracy in image multiplication. According to the results, the proposed multipliers achieve a better trade-off between energy efficacy and computing accuracy than the existing designs, which could be certified as options for exact multipliers in image processing.

Keywords

Approximate computing, Multiplier, Compressor, Energy consumption, Image multiplication

1. Introduction

Approximate computing is an attractive paradigm in circuit design, lowering the demand for accurate operations, and reducing power, speed, and area at the expense of a reduction in computing accuracy. The trade-off between hardware cost and computing accuracy is especially relevant to error-resilient applications, such as machine learning and multimedia processing.

Multipliers are the basic blocks of digital systems, and usually consist of three steps: 1) generating partial products, 2) reducing the partial products, and 3) summing the final results. Among them, the second step accounts for the dominant hardware cost. Using efficient compressors can significantly reduce the complexity of this step, and thus, improves the performance of multipliers ^[1], and 4-2 compressors are widely applied to multipliers to accelerate the reduction of partial products. In ^[2], a compressor ignored input signal cin and output signal cout to improve the performance of multipliers in terms of power and delay. The multiplier that utilizes the proposed compressor shows a great reduction in hardware requirements and transistor count, compared to the existing designs. Three 4-2 compressors were proposed in ^[3] by modifying the truth table of an exact compressor. However, the multipliers using these compressors were inferior in overall performance. In ^[4], the partial-product-altering method was applied to a 4-2 compressor, and realized a balance between hardware cost and multiplier accuracy. A compressor using a majority gate was designed in ^[5] by ignoring input signal x$_{2}$, cin, and the cout signal to achieve excellent power and delay performance. The stacking circuit technique was adopted in ^[6] to design approximate multipliers with high computing accuracy while leading to high hardware costs. In ^[7], a new compressor was designed using only simple AND-OR gates, and the multiplier utilizing this compressor provided a good error-electrical performance trade-off. The dual-quality 4-2 compressors introduced in ^[8] can be flexibly switched between precise and approximate operating modes. Therefore, multipliers using these compressors can realize dynamic change in accuracy at runtime.

To improve the trade-off between hardware cost and computing accuracy in approximate circuits, this paper proposes a set of approximate 8${\times}$8 Dadda multipliers. To that end, an imprecise 4-2 compressor using only OR and XNOR gates is designed by introducing symmetrical errors into the truth table of the exact compressor. The errors can counteract each other in a multiplier. This method will optimize the design complexity of multipliers in area, power, and delay while generating satisfying results. The main contributions of this paper are summarized as follows.

1) An approximate 4-2 compressor is proposed to simplify the design complexity of the partial production reduction step in multipliers.

2) A set of approximate Dadda multipliers is built from the compressors to find a better structure with a lower hardware cost and higher computing accuracy.

3) The image multiplication operation is realized through these multipliers to evaluate computing accuracy in real applications.

4) The trade-off between hardware cost and accuracy in the multipliers is comprehensively analyzed through various evaluation criteria as an example in approximate computing.

This paper proceeds as follows. In Section 2, the previous approximate 4-2 compressors are reviewed. Section 3 presents the proposed approximate compressor and multipliers. The synthesis results and their application to image processing are presented in Section 4. Section 5 concludes this paper.

2. Related Work

In this paper, we look to 4-2 compressors to build 8${\times}$8 Dadda multipliers owing to their simplified structure and high efficiency in transistor-level implementations. In recent years, several methods have been proposed to design imprecise 4-2 compressors, and they were utilized to design approximate multipliers. Some previous approximate designs that ignored cin and cout are summarized and compared in this section.

In the approximate 4-2 compressor presented in ^[2], the delay of the critical path is less than the previous design, and the number of gates was further reduced. Three approximate 4-2 compressors were proposed in ^[3]; they use a k-map to obtain simplified logical expressions that reduce errors while providing a significant performance improvement over previous 4-2 compressors. The first and the second designs in ^[3] only have four gates, which greatly simplifies the structural complexity. The third design is the most accurate while having a more complex structure compared with other designs. In ^[4], to simplify the circuit of the 4-2 compressor, an OR gate replaces an XOR gate to compute a sum, thus introducing additional errors. An ultra-efficient compressor proposed in ^[5] consists of one majority gate, which is different from conventional designs. Since input x$_{2}$ is omitted, and output sum is always equal to 1, this approximate compressor reaches a simpler logic implementation. The compressors in ^[6] have high accuracy, using the stacking circuit technique. A hardware-efficient approximate compressor proposed in ^[9] was obtained by modifying the truth table of the exact compressor, and consists of only three NOR gates and one NAND gate. In ^[10], an ultra-compact 4-2 compressor was proposed based on simple AND-OR logic, which leads to a trade-off between hardware cost and precision. In ^[11], the proposed compressor was obtained by modifying an approximate compressor, and the performance of the applied multiplier improved. Three approximate compressors were presented in ^[12], and they all innovatively reduced the number of outputs to one, thus significantly reducing the hardware cost.

3. The Proposed Compressor and Multipliers

3.1 The Compressor

As shown in Fig. 1, an exact 4-2 compressor generally consists of two full adders with five inputs (x$_{1}$, x$_{2}$, x$_{3}$, x$_{4}$, and cin) and three outputs (sum, carry, and cout) ^[13]. The number for logic 1 in five inputs is counted by the output according to (1), (2), and (3):

(1)

$ sum=x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\oplus c_{in} \\ $

(2)

$ cout=\left(x_{1}\oplus x_{2}\right)x_{3}+\overline{\left(x_{1}\oplus x_{2}\right)}x_{1} \\ $

(3)

$ carry=\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)c_{in}+\overline{\left(x_{1}\oplus x_{2}\oplus x_{3}\oplus x_{4}\right)}x_{4} $

The four inputs, x$_{1}$, x$_{2}$, x$_{3}$, and x$_{4}$, and the output sum have the same weight, whereas the weights of cout and carry are one binary bit order higher ^[12,^14]. Therefore, cout and carry are delivered to the next module of higher significance.

In this work, the proposed 4-2 compressor (Fig. 2) is derived by modifying the truth table of the exact compressor to obtain simpler logic expressions, as seen in (4) and (5), along with ignoring signals cin and cout for design efficiency, as seen in previous work ^[2]. Input x$_{1}$ and x$_{2}$ are also omitted to simplify the compressor and greatly reduce the energy and critical path delay further. Thus, it has only OR and XNOR gates. Although the omission of x$_{1}$ and x$_{2}$ introduces certain errors, the proposed compressors are only used for the approximate part in multipliers, which has little impact on computing accuracy. Thus, attention will be paid more to the hardware/accuracy trade-off of the multipliers, rather than only a specific indicator.

(4)

$ carry=x_{3}+x_{4} $

(5)

$ sum=x_{3}\odot x_{4} $

As seen in the truth table in Table 1, the proposed design has eight erroneous outputs out of 16 outputs. Error is defined as the arithmetic distance between the exact and approximate values ^[15]. For example, when all inputs are 1, the exact output is 4, and the proposed compressor produces a 1 for both sum and carry. In this case, the decimal output is 3, so the error distance is 1. The maximum error generated by this design is 1 (-1), which could avoid unacceptable results when the compressor is applied to approximate multipliers. Besides, in the structure of a multiplier, error distance with opposite signs of -1 and 1 will counteract each other ^[5].

Fig. 1. The conventional 4-2 compressor.

Fig. 2. The proposed 4-2 compressor.

Table 1. Truth table of the proposed 4-2 compressor.

x4	x3	x2	x1	exact	carry	sum	approximate	error
0	0	0	0	0	0	1	1	-1
0	0	0	1	1	0	1	1	0
0	0	1	0	1	0	1	1	0
0	0	1	1	2	0	1	1	1
0	1	0	0	1	1	0	2	-1
0	1	0	1	2	1	0	2	0
0	1	1	0	2	1	0	2	0
0	1	1	1	3	1	0	2	1
1	0	0	0	1	1	0	2	-1
1	0	0	1	2	1	0	2	0
1	0	1	0	2	1	0	2	0
1	0	1	1	3	1	0	2	1
1	1	0	0	2	1	1	3	-1
1	1	0	1	3	1	1	3	0
1	1	1	0	3	1	1	3	0
1	1	1	1	4	1	1	3	1

3.2 The Approximate Multipliers

To investigate the impact of the proposed compressor on multiplication, 8${\times}$8 Dadda multipliers with various levels of accuracy are designed. The basic structure of the approximate Dadda multiplier was described in ^[2] where the multiplier uses AND gates to generate all partial products in the first step, and then uses approximate compressors to compress them into, at most, two rows. In the last step, an exact ripple carry adder computes the results.

In designing multipliers, the second step plays a critical role in terms of delay, power consumption, and area. The proposed multipliers are denoted M${\alpha}$${\beta}$${\gamma}$, where ${\alpha}$, ${\beta}$, and ${\gamma}$, respectively, represent the number of columns using exact compressors, approximate compressors, and truncation to compress partial products. To find an effective way to improve the performance of multipliers, the least significant bits of the partial products are truncated. In some applications, such as image processing, it is not desirable to obtain more than a certain level of accuracy. Furthermore, the related exact operations consume relatively high amounts of energy. Therefore, exact compressors are utilized for the most significant bits to make up for the lack of computing accuracy, while the proposed approximate compressors are applied to the middle of the partial products to reduce the hardware cost. To investigate the trade-off between hardware cost and accuracy, a set of multipliers was designed. Obviously, M7${\beta}$${\gamma}$ and M6${\beta}$${\gamma}$ aim at improving computing accuracy, while M5${\beta}$${\gamma}$ is used to reduce the hardware cost.

For example, the partial product reduction step of the proposed M654 is shown in Fig. 3, where each dot represents a partial product bit. In the first two stages, three half adders, three full adders, 10 of the proposed imprecise 4-2 compressors, and six exact 4-2 compressors are utilized. In the last stage, a half adder and nine full adders are applied to compute the results.

Fig. 3. Partial product reduction of the proposed M654.

4. Simulation Results and Application

In this section, all designs were described in Verilog HDL and synthesized through the Synopsys Design Compiler NXT with a TSMC 65 nm standard cell library at 100MHz to evaluate performance. Note that the standard CMOS cell library does not include a special module, so all circuits were synthesized using the compile\_ultra command to provide a fair comparison, and the logic functions of the existing designs were optimized under the same conditions. Power data reported are from the Synopsys PrimePower tool using the vector-free power analysis model. In addition, the error metrics and an application for image processing with multipliers were programmed in Matlab.

4.1 The Approximate Compressor

A comparison of the proposed compressor and the existing exact and approximate compressors in terms of area, power, and delay is shown in Table 2. For clarity, the three designs proposed in ^[3] are represented by ^[3]1, ^[3]2, and ^[3]3, and the three methods in ^[6] are denoted ^[6]1, ^[6]2, and ^[6]3. To comprehensively evaluate efficiency from the proposed design, power-delay product (PDP) and energy-delay product (EDP) are also listed ^[9,^16].

As can be seen from Table 2, the proposed approximate compressor has a 74% reduction in area, a 27% reduction in delay, and a 91% reduction in PDP, compared to the exact 4-2 compressor. Besides, it is noteworthy that the proposed compressor has the lowest area and power, compared to state-of-the-art 4-2 compressors. Although PDP is a little higher than ^[5], EDP is equal. In summary, the proposed approximate 4-2 compressor has an advantage in hardware overhead, owing to the optimized structure using only one OR gate and one XNOR gate. Although the compressor in ^[5] has better delay and power than the one proposed here, the approximate multiplier in ^[5] is inferior to the multipliers proposed here, as is explained later.

Table 2. Hardware comparison of 4-2 compressors.

Design	Area (${\mu}$m$^{2}$)	Power (mW)	Delay (ns)	PDP (fJ)	EDP (fJ∙ns)
Proposed	4.68	4.93×10$^{-4}$	0.30	0.15	0.04
[2]	6.84	1.26×10$^{-3}$	0.46	0.58	0.27
[3]1	14.04	1.36×10$^{-3}$	0.35	0.48	0.17
[3]2	13.32	1.66×10$^{-3}$	0.34	0.56	0.19
[3]3	14.40	1.40×10$^{-3}$	0.32	0.45	0.14
[4]	11.52	1.27×10$^{-3}$	0.36	0.46	0.17
[5]	5.04	5.46×10$^{-4}$	0.25	0.14	0.04
[6]1	11.16	2.00×10$^{-3}$	0.33	0.66	0.22
[6]2	15.84	2.29×10$^{-3}$	0.43	0.98	0.42
[6]3	17.28	2.42×10$^{-3}$	0.45	1.09	0.49
Exact	18.00	3.95×10$^{-3}$	0.41	1.62	0.66

4.2 The Approximate Multipliers

4.2.1 Hardware Cost

The area, power, delay, PDP, and EDP of the approximate and exact multipliers are listed in Table 3. The proposed multipliers are divided into three types (M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$) to get the trade-off between hardware cost and computing accuracy.

Table 3. Hardware comparison of 8${\times}$8 multipliers.

Design	Area (${\mu}$m$^{2}$)	Power (mW)	Delay (ns)	PDP (fJ)	EDP (fJ∙ns)
M753	360.00	4.76×10$^{-2}$	1.55	73.78	114.36
M744	342.36	4.51×10$^{-2}$	1.56	70.36	109.76
M735	331.92	4.36×10$^{-2}$	1.54	67.14	103.40
M726	329.76	4.13×10$^{-2}$	1.56	64.43	100.51
M717	292.68	3.69×10$^{-2}$	1.63	60.15	98.04
M663	314.64	4.03×10$^{-2}$	1.46	58.84	85.90
M654	298.80	3.83×10$^{-2}$	1.44	55.15	79.42
M645	285.84	3.61×10$^{-2}$	1.42	51.26	72.79
M636	267.84	3.42×10$^{-2}$	1.42	48.56	68.96
M627	246.24	3.04×10$^{-2}$	1.32	40.13	52.97
M618	227.16	2.71×10$^{-2}$	1.35	36.59	49.39
M573	275.40	3.38×10$^{-2}$	1.38	46.64	64.37
M564	258.84	3.16×10$^{-2}$	1.26	39.82	50.17
M555	245.88	2.99×10$^{-2}$	1.29	38.57	49.76
M546	226.08	2.78×10$^{-2}$	1.27	35.31	44.84
M537	207.36	2.47×10$^{-2}$	1.27	31.37	39.84
M528	185.40	2.18×10$^{-2}$	1.21	26.38	31.92
M519	160.56	1.83×10$^{-2}$	1.26	23.06	29.05
[2]	389.52	3.73×10$^{-2}$	1.71	63.78	109.07
[3]1	398.52	3.52×10$^{-2}$	1.58	55.62	87.87
[3]2	423.36	3.72×10$^{-2}$	1.85	68.82	127.32
[3]3	420.12	3.36×10$^{-2}$	1.89	63.50	120.02
[4]	325.44	3.13×10$^{-2}$	1.52	47.58	72.32
[5]	264.24	2.76×10$^{-2}$	1.35	37.26	50.30
[6]1	498.96	6.4×10$^{-2}$	1.66	106.24	176.36
[6]2	510.84	6.9×10$^{-2}$	1.73	119.37	206.51
[6]3	567.72	7.35×10$^{-2}$	1.77	130.10	230.27
Exact	577.80	7.81×10$^{-2}$	1.81	141.36	255.86

As seen from the results in Table 3, M5${\beta}$${\gamma}$ has the smallest area, power, and delay of the three types of multipliers, whereas M7${\beta}$${\gamma}$ has the highest, and M6${\beta}$${\gamma}$ is in the middle, as influenced by ${\alpha}$. Obviously, for each type of multiplier (like M7${\beta}$${\gamma}$), when ${\gamma}$ increases, ${\beta}$ will decrease, and the hardware cost is also reduced by the impact of ${\gamma}$. PDP and EDP are reported to further assess the performance of these multipliers, and they change in the way described above.

Note that from the data, the proposed multipliers greatly outperformed the exact design. The proposed multipliers reduce the area, delay, and power by 38%-72%, 14%-33%, and 39%-77%, respectively, compared to the exact multiplier. Besides, most of the M5${\beta}$${\gamma}$ multipliers reached significant hardware improvement over previous designs, especially M519, which had the best hardware performance compared to all designs, reducing PDP and EDP on average by 67% and 75%, respectively.

4.2.2 Computing Accuracy

To evaluate the output quality from approximate multipliers, error rate (ER), mean error distance (MED), and normalized mean error distance (NMED) were computed by applying all 65,536 possible input combinations ^[16]. ER is the possibility of producing an erroneous result, and MED is calculated with (6):

(6)

$ MED=\frac{1}{2^{2N}}\sum _{i=1}^{2^{2N}}\left| ED_{i}\right| $

where N is the bit width of a multiplier, and ED$_{i}$ represents the arithmetic difference between approximate and exact results. NMED from the maximum output of the exact multiplier is expressed in (7):

(7)

$ NMED=\frac{1}{\left(2^{N}-1\right)^{2}}\sum _{i=1}^{2^{2N}}\frac{\left| ED_{i}\right| }{2^{2N}} $

The accuracy metrics of the proposed multipliers are listed in Table 4. In the three types of multiplier, M7${\beta}$${\gamma}$ has a relatively small ER, MED, and NMED. Besides, all the multipliers have a high ER, mainly due to the truncated structure. ER decreases as the number of truncated columns increases. As for MED and NMED, they decrease as ${\gamma}$ increases, and drop to a minimum when ${\beta}$ is 2, then increase again. When the number of truncated columns reached the highest level, the multipliers had the worst computing accuracy, but the accuracy of M717 was higher than M663, and M618 was better than M573 due to the exact part of the most significant bits.

Table 4. ER, MED, and NMED of approximate 8${\times}$8 multipliers.

Design	ER (%)	MED	NMED
M753	99.77	1.96×10$^{2}$	3.01×10$^{-3}$
M744	99.83	1.88×10$^{2}$	2.89×10$^{-3}$
M735	99.80	1.68×10$^{2}$	2.58×10$^{-3}$
M726	99.51	1.31×10$^{2}$	2.01×10$^{-3}$
M717	99.22	1.72×10$^{2}$	2.65×10$^{-3}$
M663	99.89	3.49×10$^{2}$	5.36×10$^{-3}$
M654	99.91	3.41×10$^{2}$	5.25×10$^{-3}$
M645	99.91	3.22×10$^{2}$	4.95×10$^{-3}$
M636	99.83	2.81×10$^{2}$	4.33×10$^{-3}$
M627	99.66	2.63×10$^{2}$	4.04×10$^{-3}$
M618	99.51	4.29×10$^{2}$	6.60×10$^{-3}$
M573	99.95	6.78×10$^{2}$	10.42×10$^{-3}$
M564	99.95	6.71×10$^{2}$	10.32×10$^{-3}$
M555	99.95	6.55×10$^{2}$	10.08×10$^{-3}$
M546	99.92	6.11×10$^{2}$	9.40×10$^{-3}$
M537	99.85	5.64×10$^{2}$	8.67×10$^{-3}$
M528	99.83	4.79×10$^{2}$	7.36×10$^{-3}$
M519	99.80	8.01×10$^{2}$	12.33×10$^{-3}$
[2]	99.10	3.15×10$^{3}$	48.46×10$^{-3}$
[3]1	87.19	3.62×10$^{3}$	55.73×10$^{-3}$
[3]2	87.19	4.17×10$^{3}$	64.2×10$^{-3}$
[3]3	97.26	5.91×10$^{3}$	90.92×10$^{-3}$
[4]	85.73	2.24×10$^{3}$	34.41×10$^{-3}$
[5]	99.82	4.94×10$^{2}$	7.60×10$^{-3}$
[6]1	55.34	0.70×10$^{2}$	1.07×10$^{-3}$
[6]2	17.96	0.17×10$^{2}$	0.26×10$^{-3}$
[6]3	3.59	0.03×10$^{2}$	0.04×10$^{-3}$

Compared to previous work, NMED from the proposed multipliers was not the lowest; however, it was acceptable for most image processing applications ^[17]. M528 had the best accuracy, compared to all designs except ^[6]. Although the multipliers in ^[6] have advantages in the accuracy metrics, they carried the highest hardware cost, as shown in Table 3. Therefore, all performance evaluation metrics should be taken into account.

The error distribution of the proposed multipliers, including M7${\beta}$${\gamma}$, M6${\beta}$${\gamma}$, and M5${\beta}$${\gamma}$, is shown in Fig. 4, where the errors were mainly in the ranges [-600, 600], [\hbox{-}1000, 1000], and [-2000, 1000], respectively, accounting on average for about 83%, 84%, and 84% of the whole range. Thus, the reservation of an appropriate number of the most significant bits will preserve the accuracy of a multiplier.

Fig. 4. Error distance from the multipliers: (a) M5${\beta}$${\gamma}$; (b) M6${\beta}$${\gamma}$; (c) M7${\beta}$${\gamma}$.

As seen from the results above, M5${\beta}$${\gamma}$ had the better hardware metrics but a worse NMED, while M7${\beta}$${\gamma}$ had the better NMED and a worse hardware cost. Thus, to reconcile the trade-off between accuracy and hardware cost, a figure of merit (FOM) was suggested in ^[8]. Due to the relatively small delay from the proposed multiplier, for a fair comparison, delay was removed and modified as seen in (8) ^[5]:

(8)

$ FOM1=PDP\times Area/\left(1-NMED\right) $

Fig. 5 shows FOM1 for the proposed and existing approximate 8${\times}$8 multipliers. The smaller the value of FOM1, the better the trade-off between accuracy and hardware. Thus, M627, M618, M564, M555, M546, M537, M528, and M519 have a lower FOM1 compared with other designs, indicating that most of the proposed multipliers offer a better trade-off than previous designs.

Fig. 5. FOM of approximate 8${\times}$8 multipliers.

4.3 Image Multiplication

To assess the practicality of approximate multipliers in real applications, they were applied to image multiplication as a widely used operation in image processing. The discussed multipliers handled two images, pixel by pixel, thereby combining two images into a single image ^[18-^21].

The peak signal-to-noise ratio (PSNR) and the mean structural similarity index metric (MSSIM) ^[22] were computed to evaluate the quality of the processed images. PSNR is expressed in (9):

(9)

$ PSNR=10\log _{10}\left(\frac{w\times r\times MAX^{2}}{\sum _{i=0}^{w-1}\sum _{j=0}^{r-1}\left[S'\left(i,j\right)-S\left(i,j\right)\right]^{2}}\right) $

where w and r are the width and height of the image, $\textit{S'(i, j)}$ and S(i, j) represent the exact and approximate value of each pixel, respectively, and MAX is the maximum pixel value. The larger the PSNR, the better the image. MSSIM is expressed in (10):

(10)

$ MSSIM\left(X,Y\right)=\frac{1}{k}\sum _{i=1}^{k}\frac{\left(2\mu _{x}\mu _{y}+C_{1}\right)\left(2\sigma _{xy}+C_{2}\right)}{\left(\mu _{x}^{2}+\mu _{y}^{2}+C_{1}\right)\left(\sigma _{x}^{2}+\sigma _{y}^{2}+C_{2}\right)} $

where X and Y represent two images. Other parameters can be found in detail in ^[22]. MSSIM reaches 1 when the two processed images are the same.

Table 5 shows PSNR and MSSIM values for five image multiplication examples. All the proposed multipliers achieved PSNR values higher than 30dB for various images, with a PSNR higher than 30dB certified as good enough ^[23]. Besides, the results of MSSIM for all approximate multipliers are very close to the exact design (MSSIM=1). Moreover, both PSNR and MSSIM values increase as the number of exact columns increases.

Table 5. PSNR and MSSIM of multiplied images using the 8${\times}$8 multipliers.

	PSNR (dB)					MSSIM
	Lena× LenaRGB	Baboon× BaboonRGB	Goldhill× Goldhill	Goldhill× LenaRGB	Goldhill× BaboonRGB	Lena× LenaRGB	Baboon× BaboonRGB	Goldhill× Goldhill	Goldhill× LenaRGB	Goldhill× BaboonRGB
M753	46.03	45.13	46.20	45.97	45.72	0.9985	0.9989	0.9966	0.9984	0.9980
M744	46.33	45.43	46.50	46.25	46.02	0.9985	0.9990	0.9965	0.9984	0.9980
M735	47.15	46.17	47.26	46.97	46.72	0.9986	0.9990	0.9966	0.9984	0.9980
M726	48.56	48.24	48.46	48.89	48.80	0.9988	0.9992	0.9960	0.9987	0.9983
M717	46.66	47.30	45.68	46.60	46.73	0.9987	0.9990	0.9943	0.9984	0.9980
M663	41.55	40.19	38.99	41.55	41.25	0.9957	0.9967	0.9855	0.9953	0.9943
M654	41.70	40.32	39.08	41.70	41.41	0.9957	0.9968	0.9851	0.9953	0.9943
M645	42.12	40.74	39.44	42.11	41.82	0.9958	0.9968	0.9855	0.9953	0.9943
M636	43.02	41.79	40.33	43.25	43.12	0.9960	0.9971	0.9858	0.9957	0.9947
M627	43.64	42.99	41.64	43.60	43.65	0.9962	0.9972	0.9846	0.9956	0.9944
M618	39.72	39.55	36.71	39.51	39.42	0.9955	0.9964	0.9742	0.9950	0.9929
M573	34.54	34.98	34.36	36.07	35.65	0.9813	0.9896	0.9631	0.9847	0.9827
M564	34.61	35.05	34.39	36.15	35.73	0.9814	0.9896	0.9629	0.9848	0.9827
M555	34.79	35.22	34.43	36.29	35.90	0.9814	0.9896	0.9614	0.9844	0.9822
M546	35.13	35.73	34.91	36.83	36.52	0.9815	0.9897	0.9616	0.9848	0.9826
M537	35.87	36.45	35.50	37.52	37.32	0.9825	0.9900	0.9577	0.9849	0.9823
M528	38.76	37.94	35.27	38.48	38.48	0.9902	0.9913	0.9444	0.9884	0.9848
M519	33.77	33.86	31.04	33.98	34.07	0.9846	0.9872	0.9226	0.9827	0.9778
[2]	22.77	23.44	21.61	24.03	23.68	0.8630	0.8600	0.7214	0.7864	0.7994
[3] 1	13.72	13.85	12.48	13.84	13.67	0.6534	0.7018	0.5411	0.6542	0.6626
[3] 2	13.71	13.85	12.48	13.86	13.68	0.6550	0.7015	0.5416	0.6342	0.6507
[3] 3	14.09	14.19	12.72	14.35	14.16	0.6239	0.6753	0.4938	0.6049	0.6035
[4]	28.17	27.83	25.35	28.59	28.94	0.9367	0.9534	0.9464	0.9533	0.9478
[5]	38.73	39.09	36.70	38.73	38.61	0.9897	0.9916	0.9645	0.9873	0.9827
[6] 1	51.35	52.64	49.11	51.78	51.99	0.9995	0.9997	0.9982	0.9995	0.9994
[6] 2	59.41	59.47	54.20	58.56	58.80	0.9999	0.9999	0.9990	0.9999	0.9998
[6] 3	68.77	68.78	62.52	67.65	67.70	1.0000	1.0000	0.9998	1.0000	1.0000

To visualize the effect of approximate multiplication on image quality, multiplied images LenaRGB and Lena (using the considered multipliers) are shown in Fig. 6. The results indicate no obvious differences between the proposed designs and the exact design.

For comprehensively evaluating the efficiency of the discussed approximate designs in image processing, both hardware cost and image quality should be considered simultaneously, rather than under specific assessment. To intensify the practicability of approximate multipliers, FOM2 is expressed in (11) ^[24]:

(11)

$ FOM2=PDP/\left(MSSIM\times PSNR\right) $

Fig. 6. The multiplied images for LenaRGB and Lena using 8${\times}$8 multipliers.

A smaller FOM2 value indicates a better compromise between hardware efficiency and accuracy. Fig. 5 shows FOM2 from the discussed multipliers when saving space. The results indicate a decreasing trend. Among them, M627, M618, M537, M528, and M519 provided a better FOM2 than the other designs. Specifically, FOM2 for M528 takes first place in this regard, with a 63% reduction, on average, compared to the existing designs, followed by M519 and M537.

5. Conclusion

In this work, an ultra-efficient approximate 4-2 compressor was proposed by introducing symmetrical errors into the truth table of the exact compressor. A set of Dadda multipliers, denoted as M${\alpha}$${\beta}$${\gamma}$, was designed to investigate the hardware/accuracy trade-off. Image multiplication was considered as an example to evaluate computing accuracy. Experimental results showed that the accuracy of a multiplier is mainly dominated by the exact part, while the hardware cost is affected by the approximate and truncated parts. Furthermore, two figures of merit show that a comprehensive indicator should be considered to reach a compromise between hardware and accuracy, because a multiplier having high accuracy will consume high amounts of energy. In addition, several proposed multipliers surpassed their counterparts under the considered criteria.

ACKNOWLEDGMENTS

This work was supported by the Fundamental Research Funds for the Central Universities of China (Grant No. JZ2020HGQA0162, Grant No. JZ2020HGTA0085).

REFERENCES

Angizi S., Jiang H., DeMara R. F., Han J., Fan D., 2018, Majority-Based Spin-CMOS Primitives for Approximate Computing, IEEE Transactions on Nanotechnology, Vol. 17, No. 4, pp. 795-806

Momeni A., Han J., Montuschi P., Lombardi F., 2015, Design and Analysis of Approximate Compressors for Multiplication, IEEE Transactions on Computers, Vol. 64, No. 4, pp. 984-994

Gorantla A., P D., 2017., Design of Approximate Compressors for Multiplication, ACM J. Emerg. Technol. Comput. Syst., Vol. 13, No. 3, pp. article 44

Venkatachalam S., Ko S., 2017, Design of Power and Area Efficient Approximate Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 5, pp. 1782-1786

Sabetzadeh F., Moaiyeri M., Ahmadinejad M., 2019, A Majority-Based Imprecise Multiplier for Ultra-Efficient Approximate Image Multiplication, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208

Strollo A., Napoli E., Caro D., Petra N., Meo G., 2020, Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9, pp. 3021-3034

Esposito D., Strollo A. G. M., Napoli E., Caro D. D., Petra N., 2018, Approximate Multipliers Based on New Approximate Compressors, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 65, No. 12, pp. 4169-4182

Akbari O., Kamal M., Afzali-Kusha A., Pedram M., 2017, Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361

Ahmadinejad M., Moaiyeri M. H., Sabetzadeh F., 2019, Energy and area efficient imprecise compressors for approximate multiplication at nanoscale, (in English), Aeu-International Journal of Electronics and Communications, Vol. 110

Salmanpour F., Moaiyeri M. H., Sabetzadeh F., 2021, Ultra-Compact Imprecise 4:2 Compressor and Multiplier Circuits for Approximate Computing in Deep Nanoscale, Circuits Systems and Signal Processing

Ha M., Lee S., Mar 2018, Multipliers With Approximate 4-2 Compressors and Error Recovery Modules, IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9

Pei H., Yi X., Zhou H., He Y., Jan 2021, Design of Ultra-Low Power Consumption Approximate 4-2 Compressors Based on the Compensation Characteristic, IEEE Transactions on Circuits and Systems II-Express Briefs, Vol. 68, No. 1, pp. 461-465

Chiphong C., Jiangmin G., Mingyan Z., 2004, Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors for fast arithmetic circuits, IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 51, No. 10, pp. 1985-1997

Yi X., Pei H., Zhang Z., Zhou H., He Y., 2019, Design of an Energy-Efficient Approximate Compressor for Error-Resilient Multiplications, in 2019 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-5

Liang J., Han J., Lombardi F., 2013, New Metrics for the Reliability of Approximate and Probabilistic Adders, IEEE Transactions on Computers, Vol. 62, No. 9, pp. 1760-1771

Guo W., Li S., 2021, Fast Binary Counters and Compressors Generated by Sorting Network, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 29, No. 6, pp. 1220-1230

Jiang H., Santiago F. J. H., Mo H., Liu L., Han J., 2020, Approximate Arithmetic Circuits: A Survey, Characterization, Recent Applications, Proceedings of the IEEE, Vol. 108, No. 12, pp. 2108-2135

Strollo A. G. M., Caro D. D., Napoli E., Petra N., Meo G. D., 2020, Low-Power Approximate Multiplier with Error Recovery using a New Approximate 4-2 Compressor, in 2020 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1-4

Toan N. V., Lee J., 2019, Energy-Area-Efficient Approximate Multipliers for Error-Tolerant Applications on FPGAs, in 2019 32nd IEEE International System-on-Chip Conference (SOCC), pp. 336-341

Savithaa N., Poornima A., 2019, A High speed Area Efficient Compression technique of Dadda multiplier for Image Blending Application, in 2019 Third International conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), pp. 426-430

Savio M. M. D., Deepa T., 2020, Design of Higher Order Multiplier with Approximate Compressor, in 2020 IEEE International Conference on Electronics, Computing and Communication Technologies (CONECCT), pp. 1-6

Zhou W., Bovik A. C., Sheikh H. R., Simoncelli E. P., 2004, Image quality assessment: from error visibility to structural similarity, IEEE Transactions on Image Processing, Vol. 13, No. 4, pp. 600-612

Ansari M. S., Jiang H., Cockburn B. F., Han J., 2018, Low-Power Approximate Multipliers Using Encoded Partial Products and Approximate Compressors, IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 8, No. 3, pp. 404-416

Ahmadinejad M., Moaiyeri M. H., 2021, Energy- and Quality-Efficient Approximate Multipliers for Neural Network and Image Processing Applications, IEEE Transactions on Emerging Topics in Computing, pp. 1-1

Author

Yongqiang Zhang

Yongqiang Zhang received the B.S. degree in electronic science and technology from Anhui Jianzhu University, Hefei, China, in 2013, and the Ph.D. degree in integrated circuits and systems from the Hefei University of Technology, Hefei, in 2018. He was a Visiting Student with the Department of Electrical and Computer Engineering, University of Alberta, for one year. He is currently with the School of Microelectronics, Hefei University of Technology. His research interests include approximate computing, stochastic computing, VLSI design, and nanoelectronics circuits and systems.

Cong He

Cong He received her B.S. degree in Electronic Information and Engi-neering from Anhui Jianzhu University, Hefei, China, in 2019. She is currently pursuing the M.S. degree in Micro-electronics with the Hefei University of Technology. Her research interests include approximate computing, and emerging technologies in computing systerms.

Xiaoyue Chen

Xiaoyue Chen received her B.S. degree in Electronic and Information Engineering from the Liaoning University of Engineering and Technology, Huludao, China, in 2021. She is currently pursuing the M.S. degree in Microelectronics with the Hefei University of Technology. Her research interests include approximate computing and stochastic computing.

Guangjun Xie

Guangjun Xie received the B.S. degree and M.S. degrees in microelectronics from the Hefei University of Technology, Hefei, China, in 1992 and 1995, respectively, and the Ph.D. degree in signal and information processing from the University of Science and Technology of China, Hefei, in 2002. He worked as a Post-Doctoral Researcher in optics with the University of Science and Technology of China from 2003 to 2005. He was a Senior Visitor with IMEC in 2007 and ASIC in 2011. He is currently a Professor with the School of Microelectronics, Hefei University of Technology. His research interests include integrated circuit design and nanoelectronics. Dr. Xie is a Senior Member of the Chinese Institute of Electronics.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

The Hardware Cost and Computing Accuracy Trade-off in Multipliers using Imprecise 4-2 Compressors for Image Processing Applications

Abstract

Keywords

1. Introduction

2. Related Work

3. The Proposed Compressor and Multipliers

3.1 The Compressor

(1)

(2)

(3)

(4)

(5)

Fig. 1. The conventional 4-2 compressor.

Fig. 2. The proposed 4-2 compressor.

Table 1. Truth table of the proposed 4-2 compressor.

3.2 The Approximate Multipliers

Fig. 3. Partial product reduction of the proposed M654.

4. Simulation Results and Application

4.1 The Approximate Compressor

Table 2. Hardware comparison of 4-2 compressors.

4.2 The Approximate Multipliers

4.2.1 Hardware Cost

Table 3. Hardware comparison of 8${\times}$8 multipliers.

4.2.2 Computing Accuracy

(6)

(7)

Table 4. ER, MED, and NMED of approximate 8${\times}$8 multipliers.

Fig. 4. Error distance from the multipliers: (a) M5${\beta}$${\gamma}$; (b) M6${\beta}$${\gamma}$; (c) M7${\beta}$${\gamma}$.

(8)

Fig. 5. FOM of approximate 8${\times}$8 multipliers.

4.3 Image Multiplication

(9)

(10)

Table 5. PSNR and MSSIM of multiplied images using the 8${\times}$8 multipliers.

(11)

Fig. 6. The multiplied images for LenaRGB and Lena using 8${\times}$8 multipliers.

5. Conclusion

ACKNOWLEDGMENTS

REFERENCES

Author

Yongqiang Zhang

Cong He

Xiaoyue Chen

Guangjun Xie

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing