Mobile QR Code QR CODE

  1. (TLi, 12th 405 beon-gil Yanghyeonro Joongwongu Sungnam-si Kyungido, Korea)
  2. (School of Electronic and Electrical Engineering, Hong Ik University, 72-1, Sangsu-Dong, Mapo-gu, Seoul, 121-791, Korea)



LUT, image enhancement, tone curve, LUT ROM lossless compression, dynamic range reduction, straight line-based compression

I. INTRODUCTION

To meet the needs of the consumers, the complexity of the image processing system, including both conventional image processing and perceptual image quality enhancement, has been increased. The lookup table (LUT) based processing can be an efficient approach for low-power, real-time image processing, especially portable devices because the conventional processing unit requires a large amount of hardware consuming lots of power. There are wide range of LUT applications as follows. In [1], 3D RGB LUT is used for CNN-based image adaptive hue, saturation, exposure, color, and, tone enhancement photos where lots of LUT entries are required. In [2], gamma correction lookup table-based dehazing methods are used for deweathering image enhancements for low computation real-time processing. In [3], LUT is used for DPCM-based image compression.

However, conventional LUT-based approaches need a large amount of memory and high bandwidth data bus for real-time high-resolution image processing. Studies have been conducted on improving the quality of images using LUT [4] with a large amount of memories. In [5], only the part of the data is stored selectively according to the image in LUT losing accuracies. [6] used LUT for retinex operation with reduced input bit widths with errors even if optimized. As larger resolution and more color depth of display systems are required, and the processing speed increases, the amount of LUTs can greatly increase. Hence methods to reduce the amounts LUTs are necessary. The conventional LUT compression methods store data with large intervals and process the gap between them with interpolations. [7-9] compress LUT by dividing quantized data and error ROM of differences from the quantized data values [10]. In [11], the approximation is also used, losing accuracy. Another method of compressing LUT is the direct digital frequency synthesizer sine/cos LUT [12] for a special application. However, there are limitations in the methods mentioned above, such as deterioration of accuracy due to the interpolation, or increased hardware complexity for partitioning quantization and error ROM. In the case of a system-on-panel (SOP) application, it is necessary to minimize the hardware complexity of LUT to overcome the low processing yield. For conventional LUT-based image processing, there are lots of limitations in applications. For computer-generated holograms requiring large amounts of LUT entries in [13,14], shifted modulation factors are computed from the basic modulation factor LUT, which works only for the processing methods with limitations in other applications. In [15], radial interpolation needs to be used losing accuracy. In [16], lossless lookup table compression methods are presented but it is limited to transcendental functions only. In [17], lookup table compression methods are presented but they can be used only for polynomial approximations with input interval segmentation with loss of precision and they need multipliers using large chip area and power.

Therefore, more general lookup table compression methodologies will be discussed in this work utilizing the shape features of LUT curve for a wide range of applications, with small chip area and power consumption. This paper describes lossless compression methodologies without losing accuracies for the data stored in LUT ROM with low-complexity hardware in general. Various tone curve characteristics are analyzed in Section II, and the method of reducing the dynamic range is described in section III. The implementation method of the straight-line required for the compression is described in Section IV, LUT architecture is presented in Section V, and MATLAB simulations and implementation methodologies are discussed in Section VI. Finally, the evaluation results and performance comparisons of the conventional LUT, such as ROM count, are discussed in Section VII.

II. ANALYSIS OF VARIOUS TONE CURVE CHARACTERISTICS

To reduce the LUT ROM count for the tone curves used in image signal processing, various tone curves are analyzed and presented in Fig. 1. Fig. 1(a) shows the tone curve [4] for the various contrast processing according to the characteristics of the image, and Fig. 1(b) shows the tone curve of the cumulative density function (CDF) [18] for histogram equalization. Fig. 1(c) shows the tone curve representing the reduction of the dynamic range of the dark and bright areas [4,19,20]. Fig. 1(d) shows the tone curve of different gamma values [21]. Fig. 1(e) shows the interpolation curve [22] for gamma adjustment by dividing the pixel values into multiple gray levels. Fig. 1(f) shows the tone curve for maintaining the luminance of the original image with the sampled image [23]. Tone curves are nonlinear, but all of them have increasing and decreasing characteristics similar to a straight line, as shown in the curves in Fig. 1. In particular, even though Fig. 1(d) doesn’t start from the origin, the curve Fig. 1(e) retains the shape of a straight line with a negative slope, and they both have the shape of a straight line. Furthermore, both curves in Fig. 1(a) and (c) are symmetrical, with a mid-pixel point as the reference. In addition, most gamma curves in [24] have this feature. For LUT, there are many cases of straight-line shapes like [25]. In this work, a method for reducing the size of the LUT ROM is discussed based on the analysis of the tone curve and LUT characteristics like the above for a wide range of image signal processing.

Fig. 1. LUT tone curves: (a) Contrast control according to a histogram; (b) The cumulative density function of the histogram; (c) Contrast sensitivity-based enhancement curve; (d) Dynamic gamma; (e) Interpolated pixel inversion curve; (f) Luminance vote curve.

../../Resources/ieie/JSTS.2023.23.3.162/fig1-1.png

../../Resources/ieie/JSTS.2023.23.3.162/fig1-2.png

III. LUT ROM COMPRESSION

1. Quantization Noise

In LUT, the values to be displayed can be stored as a quantized value for the input pixel value. Therefore, to reduce the size of the LUT ROM, it is important to analyze the factors which determine the LUT ROM count with dynamic range (DR), quantization error, quantization level, and characteristics of the LUT tone curve. If F(x) is the output of a LUT tone curve, the dynamic range is the difference between the maximum and minimum values of F(x). And if DR$_{\mathrm{Origin}}$is the DR of F(x), then it can be expressed as:

(1)
$_{\mathrm{Origin}}$ = F(x)$_{\mathrm{MAX}}$ – F(x)$_{\mathrm{MIN}}$

If L$_{\mathrm{Origin}}$ is the number of quantization levels within the dynamic range, the quantization error e is defined as

(2)
e = DR$_{\mathrm{Origin}}$ / L$_{\mathrm{Origin}}$

The output bit width (OBW) of the LUT can be represented as:

(3)
OBW = log$_{2}$L

LUT count (LUTC), as shown in (4), can be calculated using the bit width of the LUT address, n, and OBW.

(4)
LUTC=2$^{\mathrm{n}}$•OBW=2$^{\mathrm{n}}$•log$_{2}$L=2$^{\mathrm{n}}$•log$_{2}$(DR/e)

LUTC can be reduced by reducing the input bit width and OBW, and a method to reduce the input bit width is described in a later section. To reduce OBW, it is necessary to reduce DR while maintaining e below the required level, as shown in (5). That is, if reduced dynamic range (DR$_{\mathrm{Reduced}}$) can be obtained to minimize LUTC while keeping e below the required level, L$_{\mathrm{Reduced}}$becomes the reduced quantization level with respect to DR$_{\mathrm{Reduced}}$. Methods to reduce the dynamic range are described in a later section.

(5)
e = DR$_{\mathrm{Origin}}$ / L$_{\mathrm{Origin}}$ = DR$_{\mathrm{Reduced}}$ / L$_{\mathrm{Reduced}}$

Let OBW$_{\mathrm{Origin}}$ (OBW$_{\mathrm{Reduced}}$) the original (reduced) LUT output bit width. If DR$_{\mathrm{Origin}}$is reduced to DR$_{\mathrm{Reduced}}$according to Eq. (5), the reduced OBW can be represented as:

(6)

OBW$_{\mathrm{Reduced}}$

= OBW$_{\mathrm{Origin}}$ - log$_{2}$(DR$_{\mathrm{Origin}}$ / DR$_{\mathrm{Reduced}}$)

= OBW$_{\mathrm{Origin}}$ - log$_{2}$(L$_{\mathrm{Origin}}$ / L$_{\mathrm{Reduced}}$)

Let LUTC$_{\mathrm{Origin}}$ (LUTC$_{\mathrm{Reduced}}$) original (reduced) LUT count. The LUTCs for DR$_{\mathrm{Origin}}$and DR$_{\mathrm{Reduced}}$ are represented in Eqs. (4) and (7), respectively.

(7)
LUTC$_{\mathrm{Reduced}}$=2$^{\mathrm{n}}$•OBW$_{\mathrm{Reduced}}$=2$^{\mathrm{n}}$•log$_{2}$(DR$_{\mathrm{Reduced}}$/e)

Therefore, the ratio between the LUTC$_{\mathrm{Origin}}$and LUTC$_{\mathrm{Reduced}}$ can be represented as:

(8)

LUTC$_{\mathrm{ratio}}$

= LUTC$_{\mathrm{Reduced}}$/ LUTC$_{\mathrm{Origin}}$

= log$_{2}$(DR$_{\mathrm{Reduced}}$ / e) / log$_{2}$(DR$_{\mathrm{Origin}}$ / e)

With the same e, the LUTC depends on the log$_{2}$of the dynamic range. For example, if DR$_{\mathrm{Reduced}}$ is reduced to 1.6 when e = 0.1, n = 8, DR$_{\mathrm{Origin}}$ is 25.6, and OBW$_{\mathrm{Origin}}$ is 8, OBW$_{\mathrm{Reduced}}$ becomes 4, and LUTC halves. LUTC$_{\mathrm{Origin}}$becomes 2048 (Eq. (4)), and the LUTC$_{\mathrm{Reduced}}$ becomes 1024 (halved) by Eq. (7). In particular, as the pixel bit width (n) of the image increases, which is a current trend, LUTC increases exponentially with n. Hence, when the dynamic range reduces, the LUTC has an exponentially increasing reduction rate.

2. Straight Line Approximation of LUT Curve

Based on the analysis of the LUT tone curves described in Section II, the LUT tone curves mostly have the form of a straight line that passes through the origin and increases or decreases after that. Thus, a method for decreasing the LUTC by reducing the dynamic range, as illustrated in Eq. (8), is described. Also, the procedure for reducing the dynamic range is described when the straight line does not start from the origin or when it has the shape of a negative slope straight line in Section III.3.

Fig. 2 shows the method of representing the LUT tone curve F(x) with reference to a straight line. The LUT can be represented with the reduced dynamic range, as described in Section III.1. Such a process is presented as a straight-line-based representation. The straight line can be represented as:

(9)
y = ax

Various implementation methods of the straight line y = ax are described in Section IV. DR$_{\mathrm{Reduced}}$can be obtained from the difference between the F(x) and straight line-based representation.

(10)

DR$_{\mathrm{Reduced}}$

= MAX(|F(x) – ax|)

(when F(x) can be located above and below the straight line)

= DR$_{\mathrm{Reduced1}}$+ DR$_{\mathrm{Reduced2}}$ $_{\mathrm{~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ }}$

= MAX(F(x) – ax) + MAX(ax – F(x))

(when F(x) intersects with straight line)

If DR$_{\mathrm{Origin}}$ is reduced to DR$_{\mathrm{Reduced}}$, as shown in Fig. 2, LUTC is reduced according to Eq. (8)

(11)
F$_{\mathrm{Offset}}$(x) = F(x) – ax

As shown in Eq. (11), since F$_{\mathrm{Offset}}$(x) has a value corresponding to the difference between the tone curve and the straight line, it has a reduced DR$_{\mathrm{Reduced}}$. Therefore, the hardware complexity of the LUT ROM can be reduced based on the straight line-based representation, which can be easily realized in the hardware, and only the value of F$_{\mathrm{Offset}}$(x) that has a small dynamic range corresponding to the difference needs to be realized as the offset LUT.

Fig. 2. Representation of the LUT curve with a straight line as a reference.
../../Resources/ieie/JSTS.2023.23.3.162/fig2.png

3. LUT Compression of Straight Line based Representation Without Origin Crossing With Symmetry

In this section, the method of reducing the LUTC further is described when the LUT tone curve has the shape of a straight line that does not cross the origin or when the tone curve has symmetry.

3.1 Shifting

A LUT tone curve that does not start from the origin, that is, a curve with a shape as shown in Fig. 1(d), can be represented with reference to y = ax + b for the straight-line-based representation, as shown in Fig. 3. This process can reduce the dynamic range further.

In Fig. 3, if y = ax as a reference for the LUT curve F(x), the dynamic range is DR$_{\mathrm{Reduced}}$. In the case of dynamic range, when a straight line is shifted upward by b as a reference, DR$_{\mathrm{Reduced\backslash _ Shift =}}$DR$_{\mathrm{Reduced\backslash _ Shift1}}$ + DR$_{\mathrm{Reduced\backslash _ Shift2}}$. As DR$_{\mathrm{Reduced\backslash _ Shift}}$ < DR$_{\mathrm{Reduced}}$, the LUTC can be reduced considerably. Additionally, suppose the LUT curve is symmetrical vertically with the shifted straight line as a reference, as shown in Fig. 3. In that case, the LUTC can be reduced further by the folding method, which will be described in the next section. The straight line that has been shifted by b can be processed by adding b to the straight reference line using a simple adder for adding F$_{\mathrm{Offset}}$(x) to the straight line without any additional complex hardware, as explained in Section IV.

Fig. 3. Shifted straight line-based representation without origin crossing.
../../Resources/ieie/JSTS.2023.23.3.162/fig3.png

3.2 Vertical Folding

As shown in Fig 4, when the LUT tone curve is symmetrical vertically with the straight line as a center (Fig. 1(a) and (c)), the input and output dynamic ranges are almost halved to reduce the LUTC further. In Fig. 4, the dynamic range, DR$_{\mathrm{Reduced}}$ (= DR$_{\mathrm{Reduced1}}$ + DR$_{\mathrm{Reduced2}}$), is reduced further through vertically symmetrical y = ax straight line-based representation, and F(x) has the dynamic range, DR$_{\mathrm{Reduced1}}$ = DR$_{\mathrm{Reduced2}}$, with symmetric up and down offsets. In this case, vertical folding can be performed based on the intermediate value, v, of the pixel value, with halved dynamic range, DR$_{\mathrm{Reduced\backslash _ VF,}}$ as shown in Eq. (12).

(12)
RReduced_VF = DR$_{\mathrm{Reduced}}$ / 2

The reference pixel value of the folding, v, can be expressed as 2$^{\mathrm{n}}$/2. When vertical folding is possible, F(x) can be processed by adding to or subtracting from the reference straight line value F$_{\mathrm{Offset}}$(x), which can be shown in Eq. (13).

(13)
$ \mathrm{F}\left(\mathrm{x}\right)=\left\{\begin{array}{l} \mathrm{ax}+\mathrm{F}_{\text{Offset}}\left(\mathrm{x}\right)\,\,\,\,\,\,\,\,\,\,\left(\mathrm{x}\leq \mathrm{v}\right)\,\\ \mathrm{ax}+\mathrm{F}_{\text{Offset}}\left(2\mathrm{v}-\mathrm{x}\right)\,\,\,\,\left(\mathrm{x}\,>\,\mathrm{y}\right)\, \end{array}\right. $

Therefore, F(x) processes either x as the input when x is smaller than v or 2v - x as the input when it is larger than v. The implementation method when x is greater than v is described in Section III.4 since when x is smaller than v, there is no additional hardware required. When the vertical folding can be applied, both OBW and the input bit width are reduced, as shown in Eq. (6). That is, the OBW and input bit width are reduced to OBWReduced_VF and n ${-}$ 1, respectively. Subsequently, the LUTC can be obtained, as shown in Eq. (14).

(14)
LUTCReduced_VF = 2$^{\mathrm{n-1}}$• OBWReduced_VF
Fig. 4. Vertical Folding.
../../Resources/ieie/JSTS.2023.23.3.162/fig4.png

For example, if n = 8, DR$_{\mathrm{Reduced}}$ = 1.6, and vertical folding can be applied, DRReduced_VF = 0.8 from Eq. (12), and LUTCReduced_VF is 2$^{7}$ ${\times}$ 3 = 384 from Eq. (14). Therefore, LUTCReduced_VF is reduced by 57.1% by Eq. (8) compared to LUTC$_{\mathrm{Reduced}}$.

3.3 Mirroring

As shown in Fig. 1(e), the LUT tone curve decreasing without crossing the origin can be mirrored with reference to M, a mirroring point, to obtain F$_{\mathrm{Mirrored}}$(x) starting from the origin as well as increasing through horizontal symmetry as shown in Fig. 5.

Regarding x$_{\mathrm{Origin}}$ as a mirrored input value, x$_{\mathrm{Mirrored}}$, the straight line-based representation method, can be used by mirroring F(x) with reference to a straight line passing through the origin. Here, x$_{\mathrm{Mirrored}}$ can be obtained by subtracting x$_{\mathrm{Origin}}$ from 2M. This can be processed the same as the method for obtaining the input address when x > v of Eq. (13) in the case of vertical folding.

Fig. 5. Mirroring.
../../Resources/ieie/JSTS.2023.23.3.162/fig5.png

3.4 Input Address Calculation in Cases of Vertical Folding and Mirroring

It is necessary to convert the input address of the LUT for the vertical folding of Fig. 4, x is converted to 2v - x, and for the mirroring, as shown in Fig. 5, x is converted to 2m - x.

Fig. 6 shows a method of converting the address of the LUT based on the vertical folding (mirroring) reference point v(m). If x is 203 when n = 8 and v = 127.5, the input for the LUT is 2v $-$ x, which is 52 (Eq. (13)). x = 203 is 11001011, and the input 52 for the LUT is 00110100, which is identical to the 1’s complement of 203. Based on this, the address can be easily calculated.

Fig. 6. Input address calculation of vertical folding and mirroring.
../../Resources/ieie/JSTS.2023.23.3.162/fig6.png

IV. STRAIGHT LINE IMPLEMENTATION METHOD

1. The Straight Line With the Shape of y = ∑ 2k• x

The method of processing the straight line is described for the reference to represent LUT tone curve F(x). If, as in (9), the slope of y = ax, a can be expressed as ${\sum}$ 2$^{\mathrm{k}}$, it can be easily implemented without additional complicated hardware. The implementation of y = ax can be realized by shifting x by k bit and adding the number of different k values, as shown in Eq. (15). The case where a ${\neq}$ ${\sum}$ 2$^{\mathrm{k}}$is described in Section IV.2.

(15)
y = ${\sum}$ 2$^{\mathrm{k}}$• x

For the straight line-based representation, one adder is required to add the offset LUT for F$_{\mathrm{offset}}$ to the straight line, as shown in Eq. (11). The adder$_{2}$^$_{k\_Line}$ to implement ${\sum}$ 2$^{\mathrm{k}}$can be represented by the number of adders required for both adding offset LUTs as shown in Eq. (16) and straight lines in Eq. (15). If the count of nonzero k is the number of all k’s ${\neq}$ 0,

(16)
Adder$_{2}$^$_{k\_Line}$ = Count of nonzero k + 1

Shifter$_{2}$^$_{k\_Line}$ can be represented as the shifter count as in Eq. (17).

(17)
Shifter$_{2}$^$_{k\_Line}$ = Count of nonzero k

If the dynamic range of the LUT reduced by the straight line approach, as shown in Eq. (15), is DR$_{Reduced}$_2^k, and the reduced output bit width is OBW$_{Reduced}$_2^k, LUTC$_{Reduced}$_2^k can be expressed as shown in Eq. (18).

(18)
LUTC$_{Reduced}$_2^k= 2$^{\mathrm{n}}$•OBW$_{Reduced}$_2^k

For example, straight line, y = 6x, can be represented as y = 2$^{2}$x + 2$^{1}$x, Adder$_{2}$^$_{k\_Line}$= 2, and Shifter$_{2}$^$_{k\_Line}$= 2. As a result, the implementation is feasible with a smaller amount of hardware.

2. Straight Line Implementation using LUT

A method is presented for efficiently implementing the straight line closest to the LUT curve to reduce the dynamic range further when the dynamic range reduction is insufficient with the straight line reference, y = ${\sum}$ 2$^{\mathrm{k}}$• x, as introduced in section 4.1.

Fig. 7 shows two straight line-based LUT representations with both straight lines that are closest to F(x) which will be discussed in this section, and with the straight line y = ${\sum}$ 2$^{\mathrm{k}}$• x as in Eq. (15). In case the difference between the straight line and F(x) is large, the dynamic range increases, increasing LUTC. To implement the straight line used as the reference with low hardware complexity, a LUT-based straight-line method is presented to implement the straight line with multiple smaller numbers of separate LUTs. Input address n bit, increasing LUTC exponentially as shown in Eq. (4), is partitioned by p bit to be added as shown in Eq. (19) below. The straight line processed by p bits instead of n bits is defined as G(x), where p is the partitioned input bit width.

Fig. 7. Two types of straight-line approximation.
../../Resources/ieie/JSTS.2023.23.3.162/fig7.png
(19)

y =$\sum _{\mathrm{k}=0}^{\mathrm{n}/\mathrm{p}-1}$G(x[((k + 1) • p – 1) : k•p] • 2$^{\mathrm{(k\cdot p)}}$)

=$\sum _{\mathrm{k}=0}^{\mathrm{n}/\mathrm{p}-1}$2$^{\mathrm{(k\cdot p)}}$•G(x[((k + 1)•p – 1) : k•p]).

The input address is partitioned by p bits and partitioned into a total of n/p blocks, the LUT value for each partitioned bit is added with a lot less LUT input bit width reducing the LUTC significantly. Since the method in Eq. (19) can be used in a linear case, the reference straight line can be accurately processed without any error. Adder and shifter are used for the values of the LUT that need to be added with shifting to the left by p bit to implement the straight line using the partitioned LUT. Total n/p ${-}$1 adders are used to calculate the expression in Eq. (19), an adder is required to add the offset LUT to the straight line. The total number of adder count, adderLine_LUT, is as Eq. (20).

(20)
AdderLine_LUT = n/p

Total shifter count, shifterLine_LUTis Eq. (21).

(21)
ShifterLine_LUT = n/p – 1

The required LUT is divided into offset LUT and LUT to implement the straight reference line used. OBWLine_LUT is the output bit width of the LUT using the partitioned input bit instead of n bits, that is, the OBW of the original F(x). LUTCLine_LUT is the LUTC for implementing a straight line, based on Eq. (19), which can be represented as

(22)
LUTCLine_LUT = 2$^{\mathrm{p}}$ • OBWLine_LUT

The dynamic range reduced by the straight line is DRReduced_Line and the output bit width is OBWReduced_Line. LUTCReduced_Line is the LUTC when the dynamic range is reduced by the straight line-based representation using LUT by Eq. (22). LUTCReduced_Line_LUT, which is the sum of the LUT needed for implementing the straight line and the offset LUT, can be represented as

(23)

LUTCReduced_Line_LUT

= 2$^{\mathrm{n}}$•OBWReduced_Line + 2$^{\mathrm{p}}$•OBWLine_LUT

For example, if OBWLine_LUT = 8, OBWReduced_Line = 4, and n = 8 are divided by p = 4 bits and their outputs are two, then Y can be calculated as y = a(x[7:4]•2$^{4}$+ x[3:0]) = a•x[7:4]•2$^{4}$+ a•x[3:0]. AdderLine_LUTis 2, and ShifterLine_LUT is 1. LUTCReduced_Line_LUT is 2$^{8}$•4 + 2$^{4}$•8 = 1152, which is reduced by 43.7% compared to LUTC$_{\mathrm{Origin}}$ = 2048.

V. ARCHITECTURE

The architectures for implementing the proposed LUT ROM compression are as follows: Fig. 8(a) shows a line calculation unit for the straight line, as shown in Eqs. (15) and (19), the adder and an offset LUT for F$_{\mathrm{Offset}}$(x), which has a reduced dynamic range, as shown in Eq. (11). The adder can be used to add the shifting value b. In Fig. 8(b), an address calculation unit is added to Fig. 8(a) for vertical folding and mirroring. The address calculation unit generates the input address of the offset LUT by simply using an inverter for v(m) as shown in Eq. (13) for vertical folding (mirroring): the offset LUT stores the F$_{\mathrm{Offset}}$(x) values, which requires less dynamic range and a small value of input address as shown in Eq. (13). The line calculation unit and adder are same as Fig. 8(a). Synthesis with Synopsys design compiler using UMC 40 nm standard cell libraries shows that the chip area (power consumption) of Fig. 8(a) and (b) are 10867um$^{2}$ (225~uW) and 10899 um$^{2}$ (262 uW) respectively with an operating frequency of 200 MHz. The proposed LUT architecture can be implemented with a small chip area and power consumption with high processing reliability of ROM. The detailed structures of the address calculation unit and line calculation unit will be discussed in Figs. 9-11, respectively.

Fig. 9 illustrates the detailed structure of the address calculation unit to process the address for the LUT shown in Fig. 8(b) for the vertical folding (mirroring) shown in Fig. 4 and 5. A comparator outputs the inversed value when x > v(m) and the original value when x ${\leq}$ v(m) with vertical folding (mirroring) point v(m) as the reference.

Fig. 10 shows the line calculation unit for implementing the straight line represented in Eq. (15). For input data, ${\sum}$ 2$^{\mathrm{k}}$• x needs to be calculated, so shift x by k bits is processed with a shifter for the adder input. Offset LUT is discussed in Fig. 8. The number of shifters is the same as the count of non-zero k values in Eq. (17), and the count of non-zero k values + 1 adders are required as in Eq. (16). A total of 3 adders are used to add a straight line, F$_{\mathrm{Offset}}$(x), and a shift value. If an image of 1024 x 768 resolution is processed with 60 Hz, the frame rate is 16.6 ms and the operation time for 1 pixel is 16.6~ms/(1024•768), which results in 21 ns. Therefore, with the current processing technologies, with a single 8-bit adder, three additions can be processed recursively, which makes a single adder implementation sufficient.

Fig. 11 shows the line calculation unit that processes the straight line using the LUT as in Eq. (19). To reduce LUTC, only the LUT data for G(x), which corresponds to the input address p(${\leq}$ n) bit of Eq. (19), are needed. The shifter shifts the output by a p-bit and feeds into the adder. Here, only one recursive adder is sufficient for the operation described above. The proposed LUT ROM compression method is easily implemented by an adder, MUX, inverter, and shifter, all of which are simple hardware.

Fig. 8. LUT architecture: (a) LUT with y = ax straight line approximation and shifting; (b) Vertical folding and mirroring.

../../Resources/ieie/JSTS.2023.23.3.162/fig8-1.png

../../Resources/ieie/JSTS.2023.23.3.162/fig8-2.png

Fig. 9. Address calculation unit.
../../Resources/ieie/JSTS.2023.23.3.162/fig9.png
Fig. 10. y = ${\sum}$ 2$^{\mathrm{k}}$• x line calculation unit for a straight line.
../../Resources/ieie/JSTS.2023.23.3.162/fig10.png
Fig. 11. Line calculation unit using LUT.
../../Resources/ieie/JSTS.2023.23.3.162/fig11.png

VI. VERIFICATION

1. MATLAB Simulation

Fig. 12 shows the results of the MATLAB simulation based on the actual image using the existing LUT and proposed LUT. The gamma is adjusted by three gray level ranges according to the histogram, and it has been verified that there is no difference between the image results.

The MSE between the images by the conventional LUT and the proposed gamma LUT after compression is 0, which shows that the proposed LUT compression can reduce the required LUT count without any loss.

Fig. 12. MATLAB simulation of gamma adjustment according to a histogram: (a) Original image; (b) conventional LUT result; (c) Proposed LUT result.
../../Resources/ieie/JSTS.2023.23.3.162/fig12.png

2. VLSI Implementation

Fig. 13 shows the LUT ROM compression architecture for the mobile display system platform. Fig. 13 shows the block diagram for the functional verifications of the proposed LUT in a conventional display system in the FPGA development kit composed of a graphic controller, frame memory, and graphic coprocessor. The graphic controller divides the image data from the line buffer into 8-bit R, G, and B pixel data using a pixel driver. The proposed LUT ROM is emulated with FPGA at the output of the pixel driver in the graphic controller for verifications. Subsequently, as in the MATLAB simulation in the previous section, the LUT is implemented for applying the gamma-adjusted curve to R, G, and B according to the histogram of each pixel data. Fig. 14 shows the synthesis results of the adder, address calculation, line calculation and lookup table in Fig. 8. Presented LUT architecture can be implemented with simple extra hardware in addition to the compressed LUT presented in this work. Fig. 15 shows the results of the simulation using Verilog HDL to process the output of F(x) by adding F$_{\mathrm{Offset}}$(x), in which the dynamic range was reduced as shown in Eq. (11), and the value of the straight-line-based representation using the LUT as shown in Eq. (19).

Vga_clk is the clock that is sent to the LCD panel, vsync (hsync) is the vertical (horizontal) sync signal for the panel display, de is the display-enable signal on the panel, and r, g, and b are the red, green, and blue pixel data, respectively. 1 clock delay occurs after the offset LUT, and the line calculation unit is added through the recursive adder. However, matching the timing of the VGA driver, the image data can be taken from the line buffer 1 clock earlier, then the additional delay is not required. Therefore, the proposed LUT can be easily realized in existing systems.

Fig. 13. Display system architecture with proposed LUT.
../../Resources/ieie/JSTS.2023.23.3.162/fig13.png
Fig. 14. Synthesis results: (a) Adder; (b) Address calculation; (c) Line calculation; (d) Look up table.
../../Resources/ieie/JSTS.2023.23.3.162/fig14.png
Fig. 15. Simulation Waveform.
../../Resources/ieie/JSTS.2023.23.3.162/fig15.png

VII. PERFORMANCE COMPARISONS AND ADVANTAGES

The proposed LUT compression methods are evaluated and compared with the conventional LUT implementation methods.

Fig. 16 illustrates LUTC vs. DR$_{\mathrm{Origin}}$/DR$_{\mathrm{Reduced}}$ when n = 8. As the dynamic range is reduced, the LUTC reduces drastically. In addition, Fig. 16 shows the LUTC for p = 2, 4, and 8 when the LUT input bit width is partitioned using Eq. (19). If DR$_{\mathrm{Reduced}}$ is reduced, the sizes of LUTC$_{\mathrm{Reduced}}$, LUTCReduced_VF, and LUTCReduced_Line_LUT will decrease. The dynamic range of LUTCReduced_VF is halved compared to that of LUTC$_{\mathrm{Reduced}}$, making it the smallest, and in the case of LUTCReduced_Line_LUT, LUT for a straight line is added, making it larger than LUTC$_{\mathrm{Reduced}}$. Because of the straight-line-based representation implemented in the form of y = ${\sum}$ 2$^{\mathrm{k}}$• x in Eq. (15), LUTC$_{\mathrm{Reduced}}$ with a reduced dynamic range shows a 65% higher reduction rate compared to the case of DR$_{\mathrm{Origin}}$/DR$_{\mathrm{Reduced}}$ = 128. LUTCReduced_VF, which is reduced by vertical folding using Eq. (14), is further reduced by 53% compared to LUTC$_{\mathrm{Reduced}}$, resulting in a total reduction rate of up to 87.5%.

Fig. 16. LUTC according to the change in DR$_{\mathrm{Reduced}}$ (n = 8).
../../Resources/ieie/JSTS.2023.23.3.162/fig16.png
Fig. 17. LUTC according to change of n (DR$_{\mathrm{Orgin}}$/ DR$_{\mathrm{Reduced}}$=128).
../../Resources/ieie/JSTS.2023.23.3.162/fig17.png
Fig. 18. LUTC$_{Reduced}$_2^kand LUTCReduced_Line_LUT(n = 16).
../../Resources/ieie/JSTS.2023.23.3.162/fig18.png

Fig. 17 illustrates the LUTC vs. pixel width, n, which is the input address of the LUT. As n increases, which is a current trend, the LUTC increases dramatically, as described in Eq. (6). It can be observed that the LUTC reduces more as the dynamic range is reduced with various reference straight lines. If the reduced dynamic range is the same, the LUTC is reduced by Eq. (18), and Eq. (23) shows a similar reduction rate even when the input bit width is increased, and the method using the vertical folding of Eq. (14) further reduces LUTC.

For the two types of straight lines in Fig. 7 with n =16 for various DR$_{Reduced}$_2^k / DRReduced_Line and p, Fig. 17 shows comparisons between LUTC$_{\mathrm{Reduced2^ k}}$ by y = ${\sum}$ 2$^{\mathrm{k}}$${\cdot}$x in Eq. (15) and LUTCReduced_Line_LUT = LUTCReduced_Line + LUTLine_LUT in Eq. (23) by LUT based straight line as in Eq. (19).

Based on this, the efficiencies of the different methods of implementing the straight line are compared as shown in Fig. 7. Three cases are evaluated. Case (I): LUTC$_{Reduced}$_2^k = LUTC Reduced_Line_LUT; Case (II): LUTCReduced_Line_LUT < LUTC$_{\mathrm{Reduced2^ k}}$; and Case (III): LUTCReduced_Line_LUT > LUTC$_{\mathrm{Reduced2^ k}}$. In the case of DR$_{\mathrm{Reduced2^ k}}$ = DRReduced_Line, LUTC$_{\mathrm{Line}}$ is added for the straight line to LUTCReduced_Line based on Eq. (23), LUTC$_{\mathrm{Reduced2^ k}}$ is more effective with the straight line by the shifter. As DR$_{\mathrm{Origin}}$ is increased and the value of p reduces, more partitions are required. In Case (I), DR$_{Reduced}$_2^k/DRReduced_Line can be represented as:

(24)

DR$_{Reduced}$_2^k / DRReduced_Line

= 2^(2$^{\mathrm{p-n}}$•log$_{2}$(DR$_{\mathrm{Origin}}$ / e))

As DR$_{Reduced}$_2^k/DRReduced_Line is larger than the right side of Eq. (24), LUTCReduced_Line_LUT of Case (II) is reduced, and the method of implementing straight line using LUT becomes more effective. In Case (I), p can be represented as shown in Eq. (25).

(25)
$\mathrm{p}=\log _{2}(2^{\mathrm{n}}\log _{2}(\frac{\mathrm{DR}_{\text{Reduced}_ 2^ \mathrm{k}}}{\mathrm{DR}_{\text{Reduced}_ \text{Line}}})$/log$_{2}$($\frac{\mathrm{DR}_{\text{Origin}}}{\mathrm{e}}$)

As p is smaller than the right side of Eq. (25), LUTCReduced_Line_LUTof Case (II) decreases, and the implementation of the straight line using LUT is more effective. Therefore, by controlling p according to DR$_{Reduced}$_2^k/DRReduced_Line, LUT implementation can be optimized.

Table 1 shows the LUT count comparisons of the LUT compression methods presented in Fig. 17 and 18. LUTCReduced_2^k and LUTCReduced_Line_LUT have approximately 43% compression rates. For LUTCReduced_VF case, 87.5% compression rate can be achieved.

Fig. 19 shows comparisons of the adder count and shifter count in Eqs. (16) and (17), respectively, for the implementation of the straight line y = ${\sum}$ 2$^{\mathrm{k}}$• x for various slopes of the straight line. Both changes according to the slope, but are always lower than a certain value. The implementation method of the straight line using the LUT is consistent with the change in slope, as described by Eqs. (20) and (21). As p decreases, the adder count increases. However, since the pixel processing speed is low and a single recursive adder is enough, which minimizes the hardware complexity.

Table 2 shows the comparisons of chip areas and power consumptions for the LUT architecture shown in Fig. 8(b) using Synopsy Design Compiler with UMC 40~nm LP (Low Power) RVT (Regular Voltage Threshold) standard cells for 8 bit input and 11 bit output currently used in display panels. Compared to the uncompressed case, a very conservative case is evaluated when input bits are partitioned by 2 and LUT dynamic range is to be processed by 4 bit with the vertical folding. As shown in Table 2, even with the additional line calculation unit and the address calculation unit and the adder, the chip area is reduced more than 70% and power consumptions are halved.

Table 3 and 4 show the chip area for all the units in the architecture shown in Fig. 8(a) and (b) with the offset LUT dynamic range = 4 bits. For both Fig. 8(a) and (b), the additional units required in the presented architectures such as adder, address calculation, and line calculation need little amount of chip areas compared to the offset LUT units which are a lot less than the uncompressed LUT as shown in Table 1.

The LUT tone curve, even for a unique curve shape, as represented in Fig. 1(f), shows a monotonous straight line-like shape. Therefore, most of the LUT ROM can be compressed by reducing the dynamic range by the proposed straight reference line-based representation. The conventional method of LUT compression, adding the error on the quantized value, achieves a compression rate of 70% approximately [10] only for a regular error pattern. However, the error pattern is generally irregular, and large precision loss is unavoidable to increase the compression rate. Precision loss becomes even more problematic when the pixel width increases, which is required in recent and future applications. In addition, LUT ROMs for quantization and error is needed to be divided into several, resulting in increased hardware complexity. If the proposed methods are applied to the typical LUT tone curve like Fig. 1(c), the ratio of DR$_{\mathrm{Reduced}}$ and DR$_{\mathrm{Origin}}$ can be reduced, resulting in a compression rate of 65%. In addition, there is no precision loss compared to [10] because there is no requirement for a regular error pattern. Additionally, if vertical folding can be applied to such a LUT tone curve, a compression rate as high as 87.5% can be achieved without any loss. Furthermore, the proposed method can be easily implemented with simple hardware.

Fig. 19. Adder and shifter count of the straight line implementation according to straight line slope.
../../Resources/ieie/JSTS.2023.23.3.162/fig19.png
Table 1. LUT count comparisons for the various LUT compressions presented in this work

n (Input bit width)

8

12

16

LUTCorigin

2,048

49,152

1,048,576

LUTCReduced_2^k

256

20,480

589,824

LUTCReduced_VF

64

4,096

131,072

LUTCReduced_Line_LUT

(p = n/2)

384

21,248

593,920

LUTCReduced_Line_LUT

(p = n/4)

288

20,576

590,080

Table 2. Comparisons of chip areas and powers

 

Uncompressed

Fig. 8(b)

Fig. 8(b) with Vertical folding

Partitioned input count

0

2

2

Vertical folding

No

No

Yes

Offset LUT Dynamic range

-

4 bits

4 bits

 

 

Area (um2)

Power (uW)

Area (um2)

Power (uW)

Area (um2)

Power (uW)

17,893

269

9,688

194

5,310

125

Comparison (%)

100%

100%

54.2%

72.1%

29.7%

46.5%

Table 3. Chip areas forFig. 8(a)

Fig. 8(a) Hierarchical instance area (um2)

Lookup table (Top level module)

10,867.3237

Adder unit

144.3582

Line calculation unit

602.3304

Offset LUT unit

10,120.6351

Table 4. Chip areas forFig. 8(b)

Fig. 8(b) Hierarchical instance area (um2)

Lookup table (Top level module)

10,885.0393

Adder unit

144.3582

Address calculation unit

13.6458

Line calculation unit

606.4002

Offset LUT unit

10,120.6351

VIII. CONCLUSION

Methodologies to reduce the ROM count are proposed to reduce the dynamic range using the straight line as a reference for a wide range of image signal processing applications. Using the shifter and LUT, approaches to implement a straight line can be represented with y = ${\sum}$ 2$^{\mathrm{k}}$• x. Various methods to compress the LUT are proposed: shifting with a straight line that does not overlap with the origin as the reference, vertical folding, and mirroring when the LUT curve is horizontally or vertically symmetrical. The proposed LUT compression method is verified with a mobile display platform, FPGA, and it has been proven that the qualities of images do not deteriorate. When vertical folding is applied when the LUT curve is vertically symmetrical, and the ratio of the reduced dynamic range and original dynamic range are small, the LUT count can be reduced by 87.5% without any precision loss. The proposed method can be applied to all LUTs, and it is suitable for implementation on a system on a panel where a low hardware complexity is required.

References

1 
Zeng, Hui, et al. "Learning image-adaptive 3d lookup tables for high-performance photo enhancement in real-time." IEEE Transactions on Pattern Analysis and Machine Intelligence 44.4 (2020): 2058-2073.DOI
2 
Kumari, Apurva, and Subhendu Kumar Sahoo. "Fast single image and video deweathering using the look-up-table approach." AEU-International Journal of Electronics and Communications 69.12 (2015): 1773-1782.DOI
3 
Alias, Binu, Anu Mehra, and P. Harsha. "Hardware implementation and testing of effective DPCM image compression technique using multiple-LUT." 2014 International Conference on Advances in Electronics Computers and Communications. IEEE, 2014.DOI
4 
T.-L. Wu, et al., “Adaptive Color Image Enhancement Applied to Display Based on Hardware Design,” Display Workshops, IDW ’06, 13th International, 6-8, pp. 499-502, Dec., 2006.URL
5 
M.-S. Huang, “Gamma Correction only Gain/Offset Control System and Method for Display Controller,” US Patent No. 7068283, Jun., 2006.URL
6 
J. W. Park et al., "A Low-Cost and High-Throughput FPGA Implementation of the Retinex Algorithm for Real-Time Video Enhancement," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 1, pp. 101-114, Jan. 2020, doi: 10.1109/TVLSI.2019.2936260.DOI
7 
D. Han, “Real-Time Color Gamut Mapping Method for Digital TV Display Quality Enhancement,” Consumer Electronics, IEEE Transactions on, Vol. 50, No. 2, pp. 691-698, May, 2004.DOI
8 
V. Monga, R. Bala and X. Mo, "Design and Optimization of Color Lookup Tables on a Simplex Topology," in IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1981-1996, April 2012, doi: 10.1109/TIP.2011.2177848.DOI
9 
I. R. Khan, S. Rahardja, M. M. Khan, M. M. Movania and F. Abed, "A Tone-Mapping Technique Based on Histogram Using a Sensitivity Model of the Human Visual System," in IEEE Transactions on Industrial Electronics, vol. 65, no. 4, pp. 3469-3479, April 2018, doi: 10.1109/TIE.2017.2760247.DOI
10 
B.g-D. Yang and L.-S. Kim, “An Error Pattern ROM Compression Method for Continuous Data,” Circuits and Systems, IEEE International Symposium on, Vol. 2, pp. II-845, May, 2004.DOI
11 
S. Kim and H. -J. Lee, "Optimized Interpolation and Cached Data Access in LUT-Based RGB-to-RGBW Conversion," in IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 65, no. 7, pp. 943-947, July 2018, doi: 10.1109/TCSII.2017.2740358.DOI
12 
M. Ghosh, et al., “A Novel DDS Architecture Using Nonlinear ROM Addressing with Improved Compression Ratio and Quantization Noise,” Circuits and Systems, IEEE International Symposium on, Vol. 2, pp. II-705, May 2004.DOI
13 
Pi, Dapu, et al. "Reducing the memory usage of computer-generated hologram calculation using accurate high-compressed look-up-table method in color 3D holographic display." Optics express 27.20 (2019): 28410-28422.DOI
14 
Pi, Dapu, et al. "Simple and effective calculation method for computer-generated hologram based on non-uniform sampling using look-up-table." Optics Express 27.26 (2019): 37337-37348.DOI
15 
Jiao, Shuming, Zhaoyong Zhuang, and Wenbin Zou. "Fast computer generated hologram calculation with a mini look-up table incorporated with radial symmetric interpolation." Optics express 25.1 (2017): 112-123.DOI
16 
Gener, Y. Serhan, Sezer Gören, and H. Fatih Ugurdag. "Lossless look-up table compression for hardware implementation of transcendental functions." 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC). IEEE, 2019.DOI
17 
Hsiao, Shen-Fu, Chia-Sheng Wen, and Po-Han Wu. "Compression of lookup table for piecewise polynomial function evaluation." 2014 17th Euromicro Conference on Digital System Design. IEEE, 2014.DOI
18 
S.-Y. Kim, et al., “Image Contrast Enhancement Based on the Piecewise-Linear Approximation of CDF,” Consumer Electronics, IEEE Transactions on, Vol. 45, No. 3, pp. 823-834, Aug., 1999.DOI
19 
K. C. Soon and K. K. Sup, “An Improved Contrast Control Method for LCD Monitor,” Korea Multimedia Society, Journal of, Vol. 5, No. 6, pp. 609-615, Dec., 2002.URL
20 
S6D0114, 132 RGB X 176 DOT 1-Chip Driver IC with Internal GRAM for 262,144 Colors TFT-LCD, SAMSUNG Electronics, 2002.URL
21 
X.-F. Feng, H. Pan, and S. Daly, “Dynamic Gamma: Applications to Hold Type Motion Blur Reduction Using Synchronized Backlight Flashing,” Display Workshops in conjunction with Asia Display, IDW/AD ’05, 12th International, 6-9, pp. 807-810, Dec., 2005.URL
22 
T. Kono, “Gamma Correction Method, Gamma Correction Apparatus, and Image Reading System,” US Patent no. 7271939 B2, Sep., 2007.URL
23 
J. Jia and C.-K. Tang, “Image Registration with Global and Local Luminance Alignment,” Computer Vision, ICCV ’03, Ninth IEEE International Conference on, 13-6, pp. 156-163, Oct., 2003.DOI
24 
Rahman, Shanto, et al. “An adaptive gamma correction for image enhancement.” EURASIP Journal on Image and Video Processing 2016.1 (2016): 1-13.DOI
25 
Li, Shih‐An, and Chi‐Yi Tsai. "Low‐cost and high‐speed hardware implementation of contrast‐preserving image dynamic range compression for full‐HD video enhancement." IET Image Processing 9.8 (2015): 605-614.DOI

Author

Sun Myung Kim
../../Resources/ieie/JSTS.2023.23.3.162/au1.png

Sun Myung Kim received B.S., and M.S. degrees in the Department of Electronic and Electrical Engineering from Hongik University, Seoul, Korea, in 2007, and 2009, respectively. In 2010, he joined at TLi inc., where he has been working in the area of image processing and high-speed interface design. His current interests include a high-speed interface, low-power panel driving algorithms, and high dynamic range image processing.

Jaehee You
../../Resources/ieie/JSTS.2023.23.3.162/au2.png

Jaehee You received his B.S. degree in Electronics Engineering from Seoul National University, Seoul, Korea, in 1985. He received his M.S. and Ph.D. degrees in Electrical Engineering from Cornell University, Ithaca, NY, in 1987 and 1990, respectively. In 1990, he joined Texas Instruments, Dallas, TX, as a Member of the Technical Staff. In 1991, he joined the faculty of the School of Electrical Engineering, Hongik University in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory. He is currently Vice President of the Institute of Semiconductor Engineers and has served as an Executive Director of the Drive technology and System research group of the Korean Information Display Society. He was a recipient of the Korean Ministry of Strategy and Finance, KEIT Chairman Award for Excellence, in 2011. He has worked as a technical consultant for many companies, such as Samsung Semiconductor, SK Hynix, Global Communication Technologies, P&K, Penta Micro, Nexia device, and Primenet. His current research interests include integrated system design for display image signal processing and perceptual image quality enhancements.