Mobile QR Code QR CODE

Main Menu

The Journal of Semiconductor Technology and Science (JSTS) is an international, peer-reviewed, and open-access journal that is published bimonthly.
- Scope: semiconductor processes, devices, circuits, and MEMS.
- Editor-in-Chief: Prof. Woo Young Choi (ECE, Seoul National University)
- Indexed within Science Citation Index Expanded (SCIE), SCOPUS, Korea Citation Index (KCI), and other databases.

Journal Search

[

Research article

]

JSTS(Journal of Semiconductor Technology and Science)

IEIE Vol. 24, No. 04, p.305-315

ISSN (print) :

1598-1657

ISSN (online) :

2233-4866

Received : 24 Nov. 2023Revised : 11 Apr. 2024Accepted : 8 May 2024

DOI :

https://doi.org/10.5573/JSTS.2024.24.4.305

Design of an Approximate 4-2 Compressor with Error Recovery for Efficient Approximate Multiplication

HwangSungyoun¹ SeokHyelin¹ KimYongtae^†

(The School of Computer Science and Engineering, Kyungpook National University, Daegu 41566, Korea )

^*Sungyoun Hwang and Hyelin Seok contributed equally to this work. E-mail: yongtae@knu.ac.kr(Corresponding author : Yongtae Kim)

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

This paper introduces a novel and efficient approximate 4-2 compressor and multipliers that significantly improve overall computation accuracy with marginal hardware overhead. The proposed compressor incorporates an error recovery logic to rectify output errors under specific input conditions. As a result, the proposed multipliers, featuring this error recovery compressor, exhibit substantial improvements in normalized mean error distance (NMED) and mean relative error distance (MRED) by up to 89.8% and 97.1%, respectively, compared to existing approximate multipliers considered in this paper. Furthermore, when implemented in a 32-nm CMOS technology, the proposed designs enable noteworthy reductions of up to 25.2%, 22.9%, and 23.4% in area, power, and energy, respectively, in comparison to the alternative designs. The effectiveness of the proposed design is further validated through its application in a digital image processing algorithm.

Index Terms

Approximate computing, approximate multiplier, approximate compressor, error recovery, energy efficiency

I. INTRODUCTION

The trend toward device miniaturization and the prevalence of battery-powered devices are becoming increasingly evident in the present day ^[1]. Within these devices, the execution of applications demanding complex data processing is progressively increasing. This has led to a surge in power and energy consumption for the devices and prompted researchers to explore innovative ways to reduce hardware resource consumption. In such contexts, deliberately sacrificing accuracy to reduce energy consumption is an appealing technique. In this regard, approximate computing emerges as a promising solution for balancing energy efficiency and precision ^[2]. Particularly, multimedia processing, which includes tasks such as image and video processing, audio compression, and speech recognition, is a representative application where approximate computing can be readily applied ^[3-^11]. These tasks primarily involve the processing of visual and auditory information, making them less sensitive to computational errors, given the inherent limitations of human sensory perception, which often fails to detect minor discrepancies or distortions ^[12]. Hence, by tactically sacrificing precision, the power and energy efficiency of multimedia devices can be significantly enhanced without any noticeable loss in processing quality.

While arithmetic operations serve as the foundation for numerous computational tasks, multiplication stands out as one of the most fundamental and frequently utilized arithmetic operations ^[13]. Multipliers in digital systems often demand extensive computational resources, leading to substantial power consumption and latency. To address these challenges, researchers have explored the design of energy-efficient approximate multipliers tailored for error-tolerant applications ^[14-^17]. By intentionally sacrificing computational accuracy, these multipliers aim to improve hardware efficiency. Allowing acceptable levels of error in their output, approximate multipliers offer the potential to reduce power consumption and enhance throughput. One of the highly effective design methodologies for constructing an approximate multiplier involves the utilization of approximate compressors ^[18-^23]. The multiplication process based on the compressors encompasses three stages: 1) partial product generation; 2) partial product reduction; and 3) summation of the final two partial product terms. In this approach, the utilization of approximate compressors in the second stage plays a crucial role, significantly affecting the overall performance of the multiplier. In this regard, to achieve a balance between the accuracy and hardware resource consumption of approximate multipliers, a considerable number of approximate 4-2 compressors have been introduced in the literature ^[24-^27].

Momeni et al. introduced a pair of approximate 4-2 compressors, one of which eliminated the input bit C$_{in}$ and output bit C$_{out}$ of an exact 4-2 compressor, resulting in a significant reduction in hardware resource consumption ^[24]. This design approach, characterized by its simplicity and excellent performance, had a notable impact on subsequent compressors. Akbari et al. proposed a set of four dual-quality approximate 4-2 compressor designs ^[25]. These compressors were integrated into their proposed multiplier architecture, which facilitates switching between accurate and approximate operational modes to effectively adjust approximate multiplication accuracy. Additionally, Venkatachalam et al. presented a novel approach featuring approximate versions of a half adder, a full adder, and a 4-2 compressor to realize an approximate multiplier considering area, power, and accuracy ^[26]. Furthermore, Pei et al. introduced approximate multipliers by proposing three new cost-effective approximate compressors and an error compensation module aimed at reducing the overall error distance in its output under specific input conditions ^[27].

In this paper, we present novel approximate multiplier designs based on an efficient 4-2 compressor, demonstrating substantial improvements in compu-tational accuracy with minimal hardware overhead. Our design approach involves a systematic analysis of the conventional approximate compressor, aiming to significantly enhance its performance through the introduction of an error recovery logic. When implemented in a 32-nm CMOS technology, the proposed multiplier achieves considerable reductions in area, power, and energy by up to 25.2%, 22.9%, and 24.0%, respectively, compared to the approximate multipliers considered in this paper. The main contributions of this paper are summarized as follows:

• We systematically analyze an existing 4-2 compressor to mitigate its error distance and propose a novel error recovery logic that markedly enhances overall accuracy.

• Two distinct approximate multiplier designs are presented, leveraging different configurations of the proposed compressors, and their performance is shown.

• To validate the proposed multiplier’s efficacy, we integrate it into an image processing application, conducting a comparative analysis with outcomes of the others.

II. PROPOSED APPROXIMATE COMPRESSOR AND MULTIPLIER DESIGNS

In this section, we provide a brief overview of the conventional 4-2 Momeni compressor, delineating its characteristics and limitations. Then, we propose a novel approximate compressor designed to overcome the shortcomings of this traditional counterpart. Our compressor integrates a cost-effective error recovery logic to rectify errors that occur under a specific input condition, thereby enhancing overall accuracy. Furthermore, we present architectures for approximate multipliers based on the proposed compressor.

1. Proposed Approximate 4-2 Compressor with Error Recovery Logic

The Momeni compressor is an approximate 4-2 compressor that simplifies its design by eliminating one input C$_{in}$ signal and one output C$_{out}$ signal of an exact 4-2 compressor ^[24]. This simplification leads to reduced logic complexity and lowers hardware costs, requiring only three NOR gates for Carry generation and two XNOR gates, along with one OR gate, for Sum computation. Additionally, Seok et al. have further optimized the Momeni compressor, enhancing hardware efficiency by leveraging compound gates, known for their cost-effectiveness compared to regular gate combinations ^[28]. To achieve this, De Morgan’s law was employed to minimize the expressions of the Carry and Sum signals, leading to the Seok compressor necessitating only two compound gates along with a NOT gate (i.e., inverter). Although the Momeni and Seok compressors provide a hardware-efficient implemen-tation of approximate multipliers, they suffer from a significant drawback related to poor accuracy. Table 1 presents the truth table for the Momeni compressor, considering all possible input combinations. This compressor exhibits output errors when the input X$_{\mathrm{4\colon 1}}$ is either 0000, 0011, 1100, or 1111. The compressor's inputs are derived from the partial products of the two multiplier operands, and due to the nature of partial products involving AND operations, each bit's probability of being 0 or 1 is 3/4 or 1/4, respectively. This implies that the likelihood of each input case for the compressor, under uniformly distributed input patterns of the multiplier, depends on the count of zeros and ones. For instance, considering the input pattern X$_{\mathrm{4\colon 1}}$ = 0101, the probability of occurrence is calculated as 3/4 ${\times}$ 1/4 ${\times}$ 3/4 ${\times}$ 1/4 = 9/256, equivalent to 3.5%, as shown in Table 1. Certainly, in this context, our primary observation is that the Momeni compressor introduces an error in its Sum signal in the most likely input pattern X$_{\mathrm{4\colon 1}}$=0000, with an occurrence probability of 81/256, representing 31.6%, thereby substantially degrading the overall accuracy of approximate multiplications.

Table 1. The truth table of the Momeni compressor

X₄	X₃	X₂	X₁	Carry	Sum	Error Distance	Probability
0	0	0	0	0	1	+1	81/256
0	0	0	1	0	1	0	27/256
0	0	1	0	0	1	0	27/256
0	0	1	1	0	1	-1	9/256
0	1	0	0	0	1	0	27/256
0	1	0	1	1	0	0	9/256
0	1	1	0	1	0	0	9/256
0	1	1	1	1	1	0	3/256
1	0	0	0	0	1	0	27/256
1	0	0	1	1	0	0	9/256
1	0	1	0	1	0	0	9/256
1	0	1	1	1	1	0	3/256
1	1	0	0	0	1	-1	9/256
1	1	0	1	1	1	0	3/256
1	1	1	0	1	1	0	3/256
1	1	1	1	1	1	-1	1/256

In this regard, to improve the accuracy by rectifying the output for the input pattern X$_{\mathrm{4\colon 1}}$=0000, we introduce a specialized error recovery logic designed for this compressor. The proposed recovery logic ensures that the compressor output Sum is adjusted to 0 from 1 when the input pattern X$_{\mathrm{4\colon 1}}$ equals 0000, correcting the Sum signal while keeping the Carry signal unchanged. Therefore, the truth table of the proposed compressor is identical to that of Momeni, except for the input X$_{\mathrm{4\colon 1}}$ = 0000. The proposed compressor is characterized by the following equations:

(1)

$ Carry=(X_{1}+X_{2})\cdot (X_{3}+X_{4}), \\ $

(2)

$ Sum=(X_{1}X_{2}+X_{3}X_{4}+\overline{Carry})\cdot (X_{1}+X_{2}+X_{3}+X_{4}). $

To realize these equations in digital logic, the proposed compressor design demands a few additional logic gates compared to the conventional counterpart. Fig. 1 illustrates the implementation of the proposed compressor design. In contrast to the traditional compressor, which employs two compound gates (AO221 and OA22) along with an inverter, the proposed design integrated an additional compound gate (OA21) and a three-input OR gate for the error recovery, as depicted in Fig. 1. Additionally, it is noteworthy that our error recovery logic can also be realized using a four-input OR gate and a two-input AND gate. In the following sections, we particularly used the former design of the two error recovery logic configurations. Consequently, the proposed compressor yields identical Sum and Carry outputs to the conventional design for all input patterns as listed in Table 1, except in the case of X$_{\mathrm{4\colon 1}}$=0000. In this specific case, the proposed compressor generates the accurate Sum and Carry values, both being 0, thereby enhancing the overall accuracy.

Fig. 1. Proposed approximate 4-2 compressor.

2. Proposed Approximate Multiplier Architecture

To build an N${\times}$N approximate multiplier using compressors, two primary schemes for partial product reduction, namely the C-N and C-FULL configurations, are commonly employed. These reduction schemes aim to reduce the partial product matrix height to two rows, which are then summed up by an adder to produce the final multiplication output. The C-N configuration incorporates approximate 4-2 compressors solely in the N least significant columns of the partial product matrix to minimize errors. On the other hand, the C-FULL configuration utilizes compressors across all columns in the matrix to optimize hardware efficiency at the expense of reduced accuracy.

Fig. 2 shows the proposed architecture of approximate multipliers designed for 8${\times}$8 multiplications, employing two distinctive partial product reduction schemes. In the C-N configuration, illustrated in Fig. 2(a), the higher-order N-1 columns employ exact 4-2 compressors, full adders, and half adders to yield accurate N most significant bit (MSB) multiplication outputs. Simultaneously, the lower-order N columns utilize approximate alternatives along with half adders to produce N least significant bit (LSB) approximate outcomes. Importantly, the proposed compressors are exclusively deployed in the most significant column of the N least significant columns of the partial product matrix while the remaining columns adhere to the conventional compressor designs. This proposed configuration enhances overall accuracy while maintaining hardware efficiency. In the C-FULL configuration, as depicted in Fig. 2(b), we replace the traditional compressors with the proposed alternatives in the N-1 most significant columns to enhance accuracy while the traditional compressors remain unchanged in the N least significant columns. Consequently, in the proposed multiplier architectures, three and eight conventional compressors are replaced with the proposed compressors equipped with error recovery logic in the C-N and C-FULL configurations, respectively. It is noteworthy that our multiplier architecture is rather scalable and can be readily applied to wider multiplications beyond the 8${\times}$8 scale.

Fig. 2. Proposed approximate multiplier architecture using two partial product reduction schemes for 8×8 multiplication: (a) C-N configuration using the proposed compressor only in the most significant columns of the N least significant columns in the partial product matrix; (b) C-FULL configuration using the proposed compressor in the N-1 most significant columns.

III. EXPERIMENTAL RESULTS

In this section, we conduct a comprehensive assessment of the performance of our proposed multipliers, incorporating the proposed compressor with error recovery logic, with a focus on both hardware efficiency and computation accuracy. We undertake a comparative analysis of the proposed multiplier against existing counterparts to illustrate the superiority of our design. The comparative counterparts encompass five other cost-effective approximate 4-2 compressor-based multipliers, namely Momeni ^[24], Akbar1, 2 ^[25], Venka ^[26] and Pei ^[27].

To assess and compare hardware performance, our 8${\times}$8 multiplier, along with the other five same-sized multiplier designs, was designed in Verilog HDL and synthesized using Synopsys Design Compiler in a 32-nm CMOS technology. Furthermore, the accuracy performance was evaluated through software-based simulations for all the multiplier designs.

1. Computation Accuracy Performance Analysis

The normalized mean error distance (NMED) and mean relative error distance (MRED) serve as standard error metrics for evaluating the accuracy performance of approximate multipliers. These metrics were calculated for all considered multipliers across the entire range of possible input combinations, as defined by the following equations:

(3)

$ NMED=\frac{1}{2^{2N}(2^{N}-1)^{2}}\sum _{i=1}^{2^{2N}}\left| ED_{i}\right| , \\ $

(4)

$ MRED=\frac{1}{2^{2N}}\sum _{i=1}^{2^{2N}}\frac{\left| ED_{i}\right| }{M_{i}}, $

where n represents the number of bits of the multiplier and the error distance is defined as ED$_{i}$={\textbarM}$_{i}$-M’$_{i}${\textbar}, in which M$_{i}$ and M’$_{i}$ correspond to the exact output and the approximate output for the (i)$^{th}$ input data, respectively ^[23].

Fig. 3 shows the NMED and MRED values of various multipliers under both the C-N and C-FULL configurations. Under the C-N configuration, our proposed multiplier demonstrates superior performance in NMED compared to the other five approximate alternatives while the Pei multiplier exhibits the worst NMED performance.

Fig. 3. Accuracy performance comparison: (a) NMED of multipliers with C-N configuration; (b) MRED of multipliers with C-N configuration; (c) NMED of multipliers with C-FULL configuration; (d) MRED of multipliers with C-FULL configuration.

Moreover, the MRED of the proposed multiplier is comparable to that of the Akbar1 multiplier, falling between the values for Momeni/Pei and Akbar2/Venka multipliers. Specifically, when compared to the Pei multiplier, which exhibits the worst NMED and the second-worst MRED performance, our proposed multiplier achieves a remarkable reduction of 79.94% and 49.27% in NMED and MRED, respectively. Moreover, it achieves reductions of 23.84% in NMED and 49.94% in MRED when compared with the Momeni multiplier. A similar trend is found in the multipliers under the C-FULL configuration, with our proposed multiplier surpassing the other counterparts in NMED performance, while the Pei multiplier exhibits the least favorable performance. Additionally, our multiplier aligns with those of the Akbar2 and Venka multipliers, demonstrating the best MRED performance, while the Momeni multiplier exhibits the least favorable MRED value. Remarkably, the proposed multiplier achieves notable reductions of up to 89.81% and 97.10% in NMED and MRED, respectively, compared to the other five counterparts considered in this paper.

Overall, the observed trend indicates that the proposed multiplier performs notably better under the C-FULL configuration than the C-N alternative. As depicted in Fig. 2, while the NMED of our multiplier is comparable to that of the Akbar2 and Venka multipliers under the C-N configuration, it exhibits a substantial reduction of 51.40% and 48.26% in NMED, respectively, when compared to those under the C-FULL configuration. In summary, the proposed error recovery logic effectively enhances the accuracy performance of the multiplier by rectifying output errors from the compressor.

2. Hardware Performance Analysis

Table 2 shows the hardware performance of the exact and approximate compressors. The hardware performance metric includes area, power, critical path delay, and PDP. The proposed compressor outperforms the exact compressor in each metric by 59.50%, 73.13%, and 52.50%, respectively. Additionally, the PDP achieved a reduction of up to 87.39%. When compared to the Momeni compressor, the compound gate decreases area and power by 34.75% and 39.22%, even with error recovery logic incorporated.

Table 2. Hardware performance of compressors

Compressor	area (µm²)	power (µW)	delay (ns)	PDP (fJ)
Exact	29.48	8.72	0.40	3.49
Momeni [24]	18.30	3.85	0.14	0.54
Akbar1 [25]	12.20	2.58	0.12	0.31
Akbar2 [25]	14.74	2.27	0.12	0.27
Venka [26]	19.31	3.88	0.15	0.58
Pei [27]	14.23	3.19	0.18	0.57
Proposed	11.94	2.34	0.19	0.44

Table 3 provides an overview of the hardware performance of the approximate multipliers, considering the aforementioned four hardware metrics. It is worth noting that the proposed multiplier adopts the proposed C-N and C-FULL configurations, as shown in Fig. 2, while the existing multipliers employ the conventional C-N and C-FULL alternatives in ^[17]. In the C-N configuration, our proposed multiplier demonstrates the most efficient performance in terms of area and the second-best performance in power and PDP. On the contrary, the Momeni multiplier consistently performs least favorably across all hardware metrics. This result indicates that the compound gates are more hardware-efficient than the regular logic gate combinations ^[28]. Specifically, our multiplier achieves notable reductions of up to 12.72%, 10.78%, and 10.78% in area, power, and PDP, respectively, compared to the other five designs. Under the C-FULL configuration, a similar trend persists, with our multiplier occupying the smallest area and showcasing the second-best perfor-mance in power and PDP. The Venka multiplier exhibits the least favorable performance in terms of area and PDP, while the Momeni multiplier consumes the highest power. Notably, the Pei multiplier excels in power and PDP, demonstrating the most efficient performance in these aspects.

Table 3. Hardware performance of approximate multipliers under C-N and C-FULL configurations

Compressor	C-N configuration					C-FULL configuration
Compressor	area (µm²)	power (µW)	delay (ns)	PDP (fJ)	FOM	area (µm²)	power (µW)	delay (ns)	PDP (fJ)	FOM
Momeni ^[24]	752.01	203.99	1.16	236.63	337.37	662.55	160.42	1.02	163.63	5348.44
Akbar1 ^[25]	697.12	192.97	1.16	223.85	531.81	558.86	138.37	1.01	139.75	3107.43
Akbar2 ^[25]	719.99	191.00	1.16	221.56	239.41	602.07	138.93	1.02	141.71	1867.74
Venka ^[26]	761.16	202.37	1.16	234.75	259.78	679.84	160.01	1.03	164.81	2330.63
Pei ^[27]	715.42	176.27	1.15	202.71	1034.52	593.43	112.04	1.07	119.88	7769.28
Proposed	664.33	182.00	1.16	211.12	202.29	508.80	123.75	1.02	126.23	678.88

Furthermore, for a comprehensive evaluation of the accuracy and energy efficiency tradeoff in the approximate multipliers, we utilize a figure of merit (FOM) that takes into account energy, area, delay, and NMED, as defined in ^[29]. The FOM is defined by:

(5)

$ FOM=PDP\times Delay\times Area\times NMED. $

The FOM values are presented in Table 3, where a lower FOM value indicates a more favorable tradeoff performance. Notably, upon analysis, our proposed multiplier stands out as the most efficient design when considering both accuracy and hardware aspects. In contrast, the Pei multiplier demonstrates the least favorable tradeoff performance under both configurations. Specifically, the proposed multiplier exhibits an 80.52% and 91.26% reduction in FOM factor under the C-N and C-FULL configurations, respectively, compared to the Pei alternative. Overall, these results underscore the well-balanced nature of the proposed multiplier in achieving accuracy while maintaining hardware efficiency.

IV. CASE STUDY: IMAGE PROCESSING

To evaluate the practical performance of the proposed multiplier in real-world scenarios, we applied the proposed design alongside several existing approximate multipliers in a digital image processing task, specifically image blending, where multiplication plays a crucial role ^[23]. Utilizing the 8-bit unsigned approximate multipliers with both C-N and C-FULL configurations, we evaluated image quality using the peak signal-to-noise ratio (PSNR) metric. Additionally, we selected nine distinct images and specifically chose two of them for pixel-wise image blending through multiplication.

Fig. 4 shows the blended images of two well-known benchmark images, House and Jetplane, utilizing various approximate multipliers with the C-N configuration. The corresponding PSNR values for each blended image are also added. Notably, the proposed multiplier outperforms the other approximate multipliers, achieving exceptional image processing quality with a PSNR of 53.03 dB for the blended image output. In contrast, the Akbar1 and Pei multipliers yield output images with PSNRs below 50 dB.

Fig. 4. The original images and the output images with PSNR values for image blending.

Table 4 provides the outcomes of the image blending using four different benchmark image sets and average values. Remarkably, the proposed multiplier consistently yields the highest PSNR values across all four scenarios, regardless of whether the C-N or C-FULL configuration is employed. In the C-N configuration, the proposed multiplier exhibits an average improvement of 6.90% in image quality, as measured by the PSNR metric, compared to the other designs. Specifically, it achieves 19.29% higher PSNR compared to the Pei alternative, which shows the lowest PSNR. Moreover, the proposed design consistently demonstrates superior processing quality in the C-FULL configuration. Exclusively, the proposed design generates blended images with PSNR values surpassing 36 dB, while a majority of the other counterparts result in images with PSNR values below 30 dB. Particularly, in comparison to the Pei multiplier, our design shows an average PSNR enhancement of 101.68%. On average, it achieves a PSNR improvement of 44.53%.

Table 4. The PSNRs for the image blending using various approximate multipliers under the C-N and C-FULL configurations

	Moon Cameraman	Einstein Baboon	Barbara Lake	Barbara Lena	Average	Moon Cameraman	Einstein Baboon	Barbara Lake	Barbara Lena	Average
Momeni ^[24]	52.28	52.10	52.08	52.21	52.17	24.42	26.76	25.97	25.85	25.75
Akbar1 ^[25]	48.28	48.36	48.55	48.11	48.33	29.13	25.03	27.12	26.90	27.04
Akbar2 ^[25]	52.05	52.33	52.17	52.10	52.16	32.49	29.39	30.37	32.60	31.20
Venka ^[26]	52.17	52.47	52.35	52.25	52.31	32.62	29.91	30.41	32.63	31.39
Pei ^[27]	45.11	45.05	44.43	44.73	44.83	19.62	18.53	19.23	19.40	19.19
Proposed	53.61	53.12	53.00	53.13	53.21	39.57	36.35	37.84	36.89	37.67

To holistically evaluate both image processing quality and hardware efficiency, the prior works introduced two metrics, namely QUPD and QUAP ^[18,^30]. The QUPD involves PSNR, power savings, and delay savings, while the QUAP considers PSNR, area savings, and power savings, and they are mathematically defined as follows:

(6)

$ QUPD=PSNR^{2}\times PowerSavings\times DelaySavings, \\ $

(7)

$ QUAP=PSNR^{2}\times AreaSavings\times PowerSavings. $

In both metrics, the PSNR is squared to underscore its importance in representing image quality. Fig. 5 illustrates the QUPD and QUAP values for different multipliers under both the C-N and C-FULL configurations. Higher values in these metrics indicate superior performance. Impressively, the proposed multiplier consistently emerges as the best-performing design in both QUPD and QUAP. Specifically, our design exhibits an average improvement of 34.35% and 158.99% in QUPD under the C-N and C-FULL configurations, respectively, compared to other counterparts. Moreover, in QUAP, the proposed multiplier demonstrates an average enhancement of 105.96% and 265.24% under the same configurations. Particularly noteworthy are the significant improvements of up to 46.25% and 168.2% in QUAP and QUPD, respectively, under the C-N configuration, and up to 310.14% and 413.23%, respectively, under the C-FULL configuration. In summary, the proposed multiplier design not only effectively reduces hardware resource consumption but also provides outstanding image processing quality.

Fig. 5. Performance comparison with various multipliers in QUPD and QUAP: (a) C-N configuration; (b) C-FULL configuration.

V. CONCLUSIONS

In this paper, we introduced a cost-effective approximate multiplier by proposing a novel approximate 4-2 compressor integrated with an error recovery logic. The proposed error recovery mechanism plays a crucial role in diminishing error distances and enhancing the overall accuracy of approximate multiplication. Consequently, the multipliers utilizing our proposed compressor, equipped with the error recovery logic, achieved substantial reductions of up to 89.8% and 97.1% in NMED and MRED, respectively, compared to other existing approximate multipliers examined in this paper. Additionally, our multiplier design demonstrated reductions of up to 25.2%, 22.9%, and 24.0% in area, power, and PDP, respectively, when contrasted with the alternative designs. Moreover, our proposed design exhibited superior processing quality in a digital image blending application compared to other alternatives, all while efficiently reducing hardware resource consumption.

ACKNOWLEDGMENTS

This work was supported in part by the BK21 FOUR project (AI-driven Convergence Software Education Research Program) funded by the Ministry of Education, School of Computer Science and Engineering, Kyungpook National University, Korea (41202420214871) and in part by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (RS-2023-00279770).

References

J. H. Kim, C. Kim, K. Kim, J. Lee. H.-J. Yoo, and J.-Y. Kim, “An Ultra-Low-Power Mixed-Mode Face Recognition Processor for Always-on User Authentication in Mobile Devices,” IEIE Journal of Semiconductor Technology and Science, Vol. 20, No. 6, pp. 499-509, Dec. 2020.

T. Moreau, A. Sampson, and L. Ceze, “Approximate Computing: Making Mobile Systems More Efficient,” IEEE Pervasive Computing, Vol. 14, No. 2, pp. 9-13, Apr.-Jun. 2015.

H. Seo, H. Seok, J. Lee, Y. Han, and Y. Kim, “Design of an Approximate Adder based on Modified Full Adder and Nonzero Truncation for Machine Learning,” IEIE Journal of Semiconductor Technology and Science, Vol. 23, No. 2, pp. 138-148, Apr. 2023.

S. Kim and Y. Kim, "Novel XNOR-Based Approximate Computing for Energy-Efficient Image Processors," IEIE Journal of Semiconductor Technology and Science, vol. 18, no. 5, pp. 602-608, Oct. 2018.

J. Lee, H. Seo, H. Seok, and Y. Kim, “A Novel Approximate Adder Design using Error Reduced Carry Prediction and Constant Truncation,” IEEE Access, Vol. 9, pp. 119939-119953, Aug. 2021.

G. Park, J. Kung, and Y. Lee, “Design and Analysis of Approximate Compressors for Balanced Error Accumulation in MAC Operator,” IEEE Transactions on Circuits and Systems I: Regular Papers, vol. 68, no. 7, pp. 2950-2961, July 2021.

H. Seok, H. Seo, J. Lee, and Y. Kim, “COREA: Delay- and Energy-Efficient Approximate Adder Using Effective Carry Speculation,” Electronics, Vol. 10, No. 18, pp. 2234:1-2234:12, Sept. 2023.

W. Choi, M. Shim, H. Seok, and Y. Kim, “DCPA: Approximate Adder Design Exploiting Dual Carry Prediction,” IEICE Electronics Express, Vol. 18, No. 23, pp. 1-4, Dec. 2021.

H. Seo, Y. S. Yang, and Y. Kim, “Design and Analysis of an Approximate Adder with Hybrid Error Reduction,” Electronics, Vol. 9, No. 3, pp. 471:1-471:13, Mar. 2020.

J. Lee, H. Seo, Y. Kim, and Y. Kim, “Approximate Adder Design with Simplified Lower-part Approximation,” IEICE Electronics Express, Vol. 17, No. 15, pp. 1-3, Aug. 2020.

H. Seo and Y. Kim, “A Low Latency Approximate Adder Design based on Dual Sub-Adders with Error Recovery,” IEEE Transactions on Emerging Topics in Computing, Vol. 11, No. 3, pp. 811-816, Jul.-Sep. 2023.

V. K. Chippa, S. T. Chakradhar, and A. Raghunathan, “Analysis and Characterization of Inherent Application Resilience for Approximate Computing,” ACM/IEEE Design Automation Conference (DAC), pp. 1-9, May 2013.

H. Kim, E. Ham, S. Park, H. Kim, and J.-H Kim, “A DRAM Bandwidth-Scalable Sparse Matrix-Vector Multiplication Accelerator with 89% Bandwidth Utilization Efficiency for Large Sparse Matrix,” IEEE Asian Solid-State Circuits Conference (A-SSCC), Nov. 2023.

C.-H. Lin, and I.-C. Lin, “High Accuracy Approximate Multiplier with Error Correction,” IEEE International Conference on Computer Design (ICCD), pp. 33-38, Oct. 2013.

C.-H. Chang, J. Gu, and M. Zhang, “Ultra Low-Voltage Low-Power CMOS 4-2 and 5-2 Compressors for Fast Arithmetic Circuits,” IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 51, No. 10, Oct. 2004.

A. Saha, R. Pal, A. G. Naik, and D. Pal, “Novel CMOS Multi-bit Counter for Speed-Power Optimization in Multiplier Design,” AEU-International Journal of Electronics and Communications, Vol. 95, pp. 189-198, Oct. 2018.

A. G. M. Strollo, E. Napoli, D. De Caro, N. Petra, and G. D. Meo, “Comparison and Extension of Approximate 4-2 Compressors for Low-Power Approximate Multipliers,” IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 67, No. 9. pp. 3021-3034, Sep. 2020.

Z. Yang, J. Han, and F. Lombardi, “Approximate Compressors for Error-Resilient Multiplier Design,” IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFTS), pp. 183-186, Oct. 2015.

M. Ha and S. Lee, “Multipliers with Approximate 4-2 Compressors and Error Recovery Modules,” IEEE Embedded Systems Letters, Vol. 10, No. 1, pp. 6-9, Mar., 2018.

M. Ahmadinejad, M. H. Moaiyeri, and F. Sabetzadeh, “Energy and Area Efficient Imprecise Compressors for Approximate Multiplication at Nanoscale,” AEU-International Journal of Electronics and Communications, Vol. 110, No. 152859, Oct. 2019.

T. Kong and S. Li, “Design and Analysis of Approximate 4-2 Compressors for High-Accuracy Multipliers,” IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, Vol. 29, No. 10, pp. 1771-1781, Oct., 2021.

M. Zhang, S. Nishizawa, and S. Kimura, “Area Efficient Approximate 4-2 Compressor and Probability-Based Error Adjustment for Approximate Multiplier,” IEEE Transactions on Circuits and Systems II: Express Briefs, Vol. 70, No. 5, May 2023.

F. Sabetzadeh, M. H. Moaiyeri, and M. Ahmadinejad, “A Majority-Based Imprecise Multiplier for Ultra-Efficient Approximate Image Multiplication,” IEEE Transactions on Circuits and Systems I: Regular Papers, Vol. 66, No. 11, pp. 4200-4208, Nov. 2019.

A. Momeni, J. Han, P. Montuschi, and F. Lombardi, “Design and Analysis of Approximate Compressors for Multiplication,” IEEE Transactions on Computers, Vol. 64, No. 4, pp. 984-994, Apr. 2015.

O. Akbari, M. Kamal, A. Afzali-Kusha, and M. Pedram, “Dual-Quality 4:2 Compressors for Utilizing in Dynamic Accuracy Configurable Multipliers,” IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, Vol. 25, No. 4, pp. 1352-1361, Apr., 2017.

S. Venkatachalam and S. Ko, “Design of Power and Area Efficient Approximate Multipliers,” IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, Vol. 25, No. 5, pp. 1782-1786, May 2017.

H. Pei, X. Yi, H. Zhou, and Y. He, “Design of Ultra-Low Power Consumption Approximate 4-2 Compressors Based on the Compensation Characteristic,” IEEE Transactions on Circuits and Systems Ⅱ: Express Briefs, Vol. 68, No. 1, pp. 461-465, Jan. 2021.

H. Seok, H. Seo, J. Lee, and Y. Kim, “Design Optimization of a 4-2 Compressor for Low-Cost Approximate Multipliers,” IEIE Transactions on Smart Processing and Computing, Vol. 11, No. 6, pp. 455-461, Dec. 2022.

J. Lee, H. Seo, H. Seok, and Y. Kim, "A Novel Approximate Adder Design Using Error Reduced Carry Prediction and Constant Truncation," IEEE Access, Vol. 9, pp.119939-119953, Aug. 2021.

V. Gupta, D. Mohapatra, S. P. Park, A. Raghunathan, and K. Roy, “IMPACT: IMPrecise adders for low-power approximate computing,” IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED), pp. 409-414, Aug. 2011.

Sungyoun

Sungyoun Hwang received her B.S. degree from the School of Computer Science and Engineering at Kyungpook National University, Daegu, Republic of Korea in 2024, where she is currently pursuing an M.S. degree. Her research interests include approximate multiplier, computer arithmetic, and quantum computing.

Hyelin Seok

Hyelin Seok received her B.S. and M.S. degrees from the School of Computer Science and Engineering, Kyungpook National University, Daegu, Republic of Korea in 2022 and 2024, respectively. Her research interests include computer archi-ecture, approximate arithmetic, and new computing systems.

Yongtae Kim

Yongtae Kim received B.S. and M.S. degrees in electrical engineering from the Korea University, Seoul, Republic of Korea, in 2007 and 2009, respectively, and a Ph.D. degree from the Department of Electrical and Computer Engineering from the Texas A&M University, College Station, TX, in 2013. From 2013 to 2018, he was a software engineer with Intel Corporation, Santa Clara, CA. Since 2018, he has been with the School of Computer Science and Engineering at Kyungpook National University, Daegu, South Korea, where he is currently an Associate Professor. His research interests are in energy-efficient integrated circuits and systems, particularly, neuromorphic computing, approximate computing, quantum computing, and new memory devices and architecture.