I. INTRODUCTION
To meet the needs of the consumers, the complexity of the image processing system,
including both conventional image processing and perceptual image quality enhancement,
has been increased. The lookup table (LUT) based processing can be an efficient approach
for low-power, real-time image processing, especially portable devices because the
conventional processing unit requires a large amount of hardware consuming lots of
power. There are wide range of LUT applications as follows. In [1], 3D RGB LUT is used for CNN-based image adaptive hue, saturation, exposure, color,
and, tone enhancement photos where lots of LUT entries are required. In [2], gamma correction lookup table-based dehazing methods are used for deweathering image
enhancements for low computation real-time processing. In [3], LUT is used for DPCM-based image compression.
However, conventional LUT-based approaches need a large amount of memory and high
bandwidth data bus for real-time high-resolution image processing. Studies have been
conducted on improving the quality of images using LUT [4] with a large amount of memories. In [5], only the part of the data is stored selectively according to the image in LUT losing
accuracies. [6] used LUT for retinex operation with reduced input bit widths with errors even if
optimized. As larger resolution and more color depth of display systems are required,
and the processing speed increases, the amount of LUTs can greatly increase. Hence
methods to reduce the amounts LUTs are necessary. The conventional LUT compression
methods store data with large intervals and process the gap between them with interpolations.
[7-9] compress LUT by dividing quantized data and error ROM of differences from the quantized
data values [10]. In [11], the approximation is also used, losing accuracy. Another method of compressing LUT
is the direct digital frequency synthesizer sine/cos LUT [12] for a special application. However, there are limitations in the methods mentioned
above, such as deterioration of accuracy due to the interpolation, or increased hardware
complexity for partitioning quantization and error ROM. In the case of a system-on-panel
(SOP) application, it is necessary to minimize the hardware complexity of LUT to overcome
the low processing yield. For conventional LUT-based image processing, there are lots
of limitations in applications. For computer-generated holograms requiring large amounts
of LUT entries in [13,14], shifted modulation factors are computed from the basic modulation factor LUT, which
works only for the processing methods with limitations in other applications. In [15], radial interpolation needs to be used losing accuracy. In [16], lossless lookup table compression methods are presented but it is limited to transcendental
functions only. In [17], lookup table compression methods are presented but they can be used only for polynomial
approximations with input interval segmentation with loss of precision and they need
multipliers using large chip area and power.
Therefore, more general lookup table compression methodologies will be discussed in
this work utilizing the shape features of LUT curve for a wide range of applications,
with small chip area and power consumption. This paper describes lossless compression
methodologies without losing accuracies for the data stored in LUT ROM with low-complexity
hardware in general. Various tone curve characteristics are analyzed in Section II,
and the method of reducing the dynamic range is described in section III. The implementation
method of the straight-line required for the compression is described in Section IV,
LUT architecture is presented in Section V, and MATLAB simulations and implementation
methodologies are discussed in Section VI. Finally, the evaluation results and performance
comparisons of the conventional LUT, such as ROM count, are discussed in Section VII.
II. ANALYSIS OF VARIOUS TONE CURVE CHARACTERISTICS
To reduce the LUT ROM count for the tone curves used in image signal processing, various
tone curves are analyzed and presented in Fig. 1. Fig. 1(a) shows the tone curve [4] for the various contrast processing according to the characteristics of the image,
and Fig. 1(b) shows the tone curve of the cumulative density function (CDF) [18] for histogram equalization. Fig. 1(c) shows the tone curve representing the reduction of the dynamic range of the dark
and bright areas [4,19,20]. Fig. 1(d) shows the tone curve of different gamma values [21]. Fig. 1(e) shows the interpolation curve [22] for gamma adjustment by dividing the pixel values into multiple gray levels. Fig. 1(f) shows the tone curve for maintaining the luminance of the original image with the
sampled image [23]. Tone curves are nonlinear, but all of them have increasing and decreasing characteristics
similar to a straight line, as shown in the curves in Fig. 1. In particular, even
though Fig. 1(d) doesn’t start from the origin, the curve Fig. 1(e) retains the shape of a straight line with a negative slope, and they both have the
shape of a straight line. Furthermore, both curves in Fig. 1(a) and (c) are symmetrical, with a mid-pixel point as the reference. In addition, most
gamma curves in [24] have this feature. For LUT, there are many cases of straight-line shapes like [25]. In this work, a method for reducing the size of the LUT ROM is discussed based on
the analysis of the tone curve and LUT characteristics like the above for a wide range
of image signal processing.
Fig. 1. LUT tone curves: (a) Contrast control according to a histogram; (b) The cumulative density function of the histogram; (c) Contrast sensitivity-based enhancement curve; (d) Dynamic gamma; (e) Interpolated pixel inversion curve; (f) Luminance vote curve.
III. LUT ROM COMPRESSION
1. Quantization Noise
In LUT, the values to be displayed can be stored as a quantized value for the input
pixel value. Therefore, to reduce the size of the LUT ROM, it is important to analyze
the factors which determine the LUT ROM count with dynamic range (DR), quantization
error, quantization level, and characteristics of the LUT tone curve. If F(x) is the
output of a LUT tone curve, the dynamic range is the difference between the maximum
and minimum values of F(x). And if DR$_{\mathrm{Origin}}$is the DR of F(x), then it
can be expressed as:
If L$_{\mathrm{Origin}}$ is the number of quantization levels within the dynamic range,
the quantization error e is defined as
The output bit width (OBW) of the LUT can be represented as:
LUT count (LUTC), as shown in (4), can be calculated using the bit width of the LUT
address, n, and OBW.
LUTC can be reduced by reducing the input bit width and OBW, and a method to reduce
the input bit width is described in a later section. To reduce OBW, it is necessary
to reduce DR while maintaining e below the required level, as shown in (5). That is,
if reduced dynamic range (DR$_{\mathrm{Reduced}}$) can be obtained to minimize LUTC
while keeping e below the required level, L$_{\mathrm{Reduced}}$becomes the reduced
quantization level with respect to DR$_{\mathrm{Reduced}}$. Methods to reduce the
dynamic range are described in a later section.
Let OBW$_{\mathrm{Origin}}$ (OBW$_{\mathrm{Reduced}}$) the original (reduced) LUT
output bit width. If DR$_{\mathrm{Origin}}$is reduced to DR$_{\mathrm{Reduced}}$according
to Eq. (5), the reduced OBW can be represented as:
Let LUTC$_{\mathrm{Origin}}$ (LUTC$_{\mathrm{Reduced}}$) original (reduced) LUT count.
The LUTCs for DR$_{\mathrm{Origin}}$and DR$_{\mathrm{Reduced}}$ are represented in
Eqs. (4) and (7), respectively.
Therefore, the ratio between the LUTC$_{\mathrm{Origin}}$and LUTC$_{\mathrm{Reduced}}$
can be represented as:
With the same e, the LUTC depends on the log$_{2}$of the dynamic range. For example,
if DR$_{\mathrm{Reduced}}$ is reduced to 1.6 when e = 0.1, n = 8, DR$_{\mathrm{Origin}}$
is 25.6, and OBW$_{\mathrm{Origin}}$ is 8, OBW$_{\mathrm{Reduced}}$ becomes 4, and
LUTC halves. LUTC$_{\mathrm{Origin}}$becomes 2048 (Eq. (4)), and the LUTC$_{\mathrm{Reduced}}$ becomes 1024 (halved) by Eq. (7). In particular, as the pixel bit width (n) of the image increases, which is a current
trend, LUTC increases exponentially with n. Hence, when the dynamic range reduces,
the LUTC has an exponentially increasing reduction rate.
2. Straight Line Approximation of LUT Curve
Based on the analysis of the LUT tone curves described in Section II, the LUT tone
curves mostly have the form of a straight line that passes through the origin and
increases or decreases after that. Thus, a method for decreasing the LUTC by reducing
the dynamic range, as illustrated in Eq. (8), is described. Also, the procedure for reducing the dynamic range is described when
the straight line does not start from the origin or when it has the shape of a negative
slope straight line in Section III.3.
Fig. 2 shows the method of representing the LUT tone curve F(x) with reference to a straight
line. The LUT can be represented with the reduced dynamic range, as described in Section
III.1. Such a process is presented as a straight-line-based representation. The straight
line can be represented as:
Various implementation methods of the straight line y = ax are described in Section
IV. DR$_{\mathrm{Reduced}}$can be obtained from the difference between the F(x) and
straight line-based representation.
If DR$_{\mathrm{Origin}}$ is reduced to DR$_{\mathrm{Reduced}}$, as shown in Fig. 2, LUTC is reduced according to Eq. (8)
As shown in Eq. (11), since F$_{\mathrm{Offset}}$(x) has a value corresponding to the difference between
the tone curve and the straight line, it has a reduced DR$_{\mathrm{Reduced}}$. Therefore,
the hardware complexity of the LUT ROM can be reduced based on the straight line-based
representation, which can be easily realized in the hardware, and only the value of
F$_{\mathrm{Offset}}$(x) that has a small dynamic range corresponding to the difference
needs to be realized as the offset LUT.
Fig. 2. Representation of the LUT curve with a straight line as a reference.
3. LUT Compression of Straight Line based Representation Without Origin Crossing With
Symmetry
In this section, the method of reducing the LUTC further is described when the LUT
tone curve has the shape of a straight line that does not cross the origin or when
the tone curve has symmetry.
3.1 Shifting
A LUT tone curve that does not start from the origin, that is, a curve with a shape
as shown in Fig. 1(d), can be represented with reference to y = ax + b for the straight-line-based representation,
as shown in Fig. 3. This process can reduce the dynamic range further.
In Fig. 3, if y = ax as a reference for the LUT curve F(x), the dynamic range is DR$_{\mathrm{Reduced}}$.
In the case of dynamic range, when a straight line is shifted upward by b as a reference,
DR$_{\mathrm{Reduced\backslash _ Shift =}}$DR$_{\mathrm{Reduced\backslash _ Shift1}}$
+ DR$_{\mathrm{Reduced\backslash _ Shift2}}$. As DR$_{\mathrm{Reduced\backslash _
Shift}}$ < DR$_{\mathrm{Reduced}}$, the LUTC can be reduced considerably. Additionally,
suppose the LUT curve is symmetrical vertically with the shifted straight line as
a reference, as shown in Fig. 3. In that case, the LUTC can be reduced further by the folding method, which will
be described in the next section. The straight line that has been shifted by b can
be processed by adding b to the straight reference line using a simple adder for adding
F$_{\mathrm{Offset}}$(x) to the straight line without any additional complex hardware,
as explained in Section IV.
Fig. 3. Shifted straight line-based representation without origin crossing.
3.2 Vertical Folding
As shown in Fig 4, when the LUT tone curve is symmetrical vertically with the straight
line as a center (Fig. 1(a) and (c)), the input and output dynamic ranges are almost halved to reduce the LUTC
further. In Fig. 4, the dynamic range, DR$_{\mathrm{Reduced}}$ (= DR$_{\mathrm{Reduced1}}$ + DR$_{\mathrm{Reduced2}}$),
is reduced further through vertically symmetrical y = ax straight line-based representation,
and F(x) has the dynamic range, DR$_{\mathrm{Reduced1}}$ = DR$_{\mathrm{Reduced2}}$,
with symmetric up and down offsets. In this case, vertical folding can be performed
based on the intermediate value, v, of the pixel value, with halved dynamic range,
DR$_{\mathrm{Reduced\backslash _ VF,}}$ as shown in Eq. (12).
The reference pixel value of the folding, v, can be expressed as 2$^{\mathrm{n}}$/2.
When vertical folding is possible, F(x) can be processed by adding to or subtracting
from the reference straight line value F$_{\mathrm{Offset}}$(x), which can be shown
in Eq. (13).
Therefore, F(x) processes either x as the input when x is smaller than v or 2v - x
as the input when it is larger than v. The implementation method when x is greater
than v is described in Section III.4 since when x is smaller than v, there is no additional
hardware required. When the vertical folding can be applied, both OBW and the input
bit width are reduced, as shown in Eq. (6). That is, the OBW and input bit width are reduced to OBWReduced_VF and n ${-}$ 1, respectively. Subsequently, the LUTC can be obtained, as shown in
Eq. (14).
Fig. 4. Vertical Folding.
For example, if n = 8, DR$_{\mathrm{Reduced}}$ = 1.6, and vertical folding can be
applied, DRReduced_VF = 0.8 from Eq. (12), and LUTCReduced_VF is 2$^{7}$ ${\times}$ 3 = 384 from Eq. (14). Therefore, LUTCReduced_VF is reduced by 57.1% by Eq. (8) compared to LUTC$_{\mathrm{Reduced}}$.
3.3 Mirroring
As shown in Fig. 1(e), the LUT tone curve decreasing without crossing the origin can be mirrored with reference
to M, a mirroring point, to obtain F$_{\mathrm{Mirrored}}$(x) starting from the origin
as well as increasing through horizontal symmetry as shown in Fig. 5.
Regarding x$_{\mathrm{Origin}}$ as a mirrored input value, x$_{\mathrm{Mirrored}}$,
the straight line-based representation method, can be used by mirroring F(x) with
reference to a straight line passing through the origin. Here, x$_{\mathrm{Mirrored}}$
can be obtained by subtracting x$_{\mathrm{Origin}}$ from 2M. This can be processed
the same as the method for obtaining the input address when x > v of Eq. (13) in the case of vertical folding.
3.4 Input Address Calculation in Cases of Vertical Folding and Mirroring
It is necessary to convert the input address of the LUT for the vertical folding of
Fig. 4, x is converted to 2v - x, and for the mirroring, as shown in Fig. 5, x is converted to 2m - x.
Fig. 6 shows a method of converting the address of the LUT based on the vertical folding
(mirroring) reference point v(m). If x is 203 when n = 8 and v = 127.5, the input
for the LUT is 2v $-$ x, which is 52 (Eq. (13)). x = 203 is 11001011, and the input 52 for the LUT is 00110100, which is identical
to the 1’s complement of 203. Based on this, the address can be easily calculated.
Fig. 6. Input address calculation of vertical folding and mirroring.
IV. STRAIGHT LINE IMPLEMENTATION METHOD
1. The Straight Line With the Shape of y = ∑ 2k• x
The method of processing the straight line is described for the reference to represent
LUT tone curve F(x). If, as in (9), the slope of y = ax, a can be expressed as ${\sum}$
2$^{\mathrm{k}}$, it can be easily implemented without additional complicated hardware.
The implementation of y = ax can be realized by shifting x by k bit and adding the
number of different k values, as shown in Eq. (15). The case where a ${\neq}$ ${\sum}$ 2$^{\mathrm{k}}$is described in Section IV.2.
For the straight line-based representation, one adder is required to add the offset
LUT for F$_{\mathrm{offset}}$ to the straight line, as shown in Eq. (11). The adder$_{2}$^$_{k\_Line}$ to implement ${\sum}$ 2$^{\mathrm{k}}$can be represented by the number
of adders required for both adding offset LUTs as shown in Eq. (16) and straight lines in Eq. (15). If the count of nonzero k is the number of all k’s ${\neq}$ 0,
Shifter$_{2}$^$_{k\_Line}$ can be represented as the shifter count as in Eq. (17).
If the dynamic range of the LUT reduced by the straight line approach, as shown in
Eq. (15), is DR$_{Reduced}$_2^k, and the reduced output bit width is OBW$_{Reduced}$_2^k, LUTC$_{Reduced}$_2^k can be expressed as shown in Eq. (18).
For example, straight line, y = 6x, can be represented as y = 2$^{2}$x + 2$^{1}$x,
Adder$_{2}$^$_{k\_Line}$= 2, and Shifter$_{2}$^$_{k\_Line}$= 2. As a result, the implementation is feasible with a smaller amount
of hardware.
2. Straight Line Implementation using LUT
A method is presented for efficiently implementing the straight line closest to the
LUT curve to reduce the dynamic range further when the dynamic range reduction is
insufficient with the straight line reference, y = ${\sum}$ 2$^{\mathrm{k}}$• x, as
introduced in section 4.1.
Fig. 7 shows two straight line-based LUT representations with both straight lines that are
closest to F(x) which will be discussed in this section, and with the straight line
y = ${\sum}$ 2$^{\mathrm{k}}$• x as in Eq. (15). In case the difference between the straight line and F(x) is large, the dynamic
range increases, increasing LUTC. To implement the straight line used as the reference
with low hardware complexity, a LUT-based straight-line method is presented to implement
the straight line with multiple smaller numbers of separate LUTs. Input address n
bit, increasing LUTC exponentially as shown in Eq. (4), is partitioned by p bit to be added as shown in Eq. (19) below. The straight line processed by p bits instead of n bits is defined as G(x),
where p is the partitioned input bit width.
Fig. 7. Two types of straight-line approximation.
The input address is partitioned by p bits and partitioned into a total of n/p blocks,
the LUT value for each partitioned bit is added with a lot less LUT input bit width
reducing the LUTC significantly. Since the method in Eq. (19) can be used in a linear case, the reference straight line can be accurately processed
without any error. Adder and shifter are used for the values of the LUT that need
to be added with shifting to the left by p bit to implement the straight line using
the partitioned LUT. Total n/p ${-}$1 adders are used to calculate the expression
in Eq. (19), an adder is required to add the offset LUT to the straight line. The total number
of adder count, adderLine_LUT, is as Eq. (20).
Total shifter count, shifterLine_LUTis Eq. (21).
The required LUT is divided into offset LUT and LUT to implement the straight reference
line used. OBWLine_LUT is the output bit width of the LUT using the partitioned input bit instead of n bits,
that is, the OBW of the original F(x). LUTCLine_LUT is the LUTC for implementing a straight line, based on Eq. (19), which can be represented as
The dynamic range reduced by the straight line is DRReduced_Line and the output bit width is OBWReduced_Line. LUTCReduced_Line is the LUTC when the dynamic range is reduced by the straight line-based representation
using LUT by Eq. (22). LUTCReduced_Line_LUT, which is the sum of the LUT needed for implementing the straight line and the offset
LUT, can be represented as
For example, if OBWLine_LUT = 8, OBWReduced_Line = 4, and n = 8 are divided by p = 4 bits and their outputs are two, then Y can be
calculated as y = a(x[7:4]•2$^{4}$+ x[3:0]) = a•x[7:4]•2$^{4}$+ a•x[3:0]. AdderLine_LUTis 2, and ShifterLine_LUT is 1. LUTCReduced_Line_LUT is 2$^{8}$•4 + 2$^{4}$•8 = 1152, which is reduced by 43.7% compared to LUTC$_{\mathrm{Origin}}$
= 2048.
V. ARCHITECTURE
The architectures for implementing the proposed LUT ROM compression are as follows:
Fig. 8(a) shows a line calculation unit for the straight line, as shown in Eqs. (15) and (19),
the adder and an offset LUT for F$_{\mathrm{Offset}}$(x), which has a reduced dynamic
range, as shown in Eq. (11). The adder can be used to add the shifting value b. In Fig. 8(b), an address calculation unit is added to Fig. 8(a) for vertical folding and mirroring. The address calculation unit generates the input
address of the offset LUT by simply using an inverter for v(m) as shown in Eq. (13) for vertical folding (mirroring): the offset LUT stores the F$_{\mathrm{Offset}}$(x)
values, which requires less dynamic range and a small value of input address as shown
in Eq. (13). The line calculation unit and adder are same as Fig. 8(a). Synthesis with Synopsys design compiler using UMC 40 nm standard cell libraries
shows that the chip area (power consumption) of Fig. 8(a) and (b) are 10867um$^{2}$ (225~uW) and 10899 um$^{2}$ (262 uW) respectively with
an operating frequency of 200 MHz. The proposed LUT architecture can be implemented
with a small chip area and power consumption with high processing reliability of ROM.
The detailed structures of the address calculation unit and line calculation unit
will be discussed in Figs. 9-11, respectively.
Fig. 9 illustrates the detailed structure of the address calculation unit to process the
address for the LUT shown in Fig. 8(b) for the vertical folding (mirroring) shown in Fig. 4 and 5. A comparator outputs the inversed value when x > v(m) and the original value
when x ${\leq}$ v(m) with vertical folding (mirroring) point v(m) as the reference.
Fig. 10 shows the line calculation unit for implementing the straight line represented in
Eq. (15). For input data, ${\sum}$ 2$^{\mathrm{k}}$• x needs to be calculated, so shift x
by k bits is processed with a shifter for the adder input. Offset LUT is discussed
in Fig. 8. The number of shifters is the same as the count of non-zero k values in Eq. (17), and the count of non-zero k values + 1 adders are required as in Eq. (16). A total of 3 adders are used to add a straight line, F$_{\mathrm{Offset}}$(x), and
a shift value. If an image of 1024 x 768 resolution is processed with 60 Hz, the frame
rate is 16.6 ms and the operation time for 1 pixel is 16.6~ms/(1024•768), which results
in 21 ns. Therefore, with the current processing technologies, with a single 8-bit
adder, three additions can be processed recursively, which makes a single adder implementation
sufficient.
Fig. 11 shows the line calculation unit that processes the straight line using the LUT as
in Eq. (19). To reduce LUTC, only the LUT data for G(x), which corresponds to the input address
p(${\leq}$ n) bit of Eq. (19), are needed. The shifter shifts the output by a p-bit and feeds into the adder. Here,
only one recursive adder is sufficient for the operation described above. The proposed
LUT ROM compression method is easily implemented by an adder, MUX, inverter, and shifter,
all of which are simple hardware.
Fig. 8. LUT architecture: (a) LUT with y = ax straight line approximation and shifting; (b) Vertical folding and mirroring.
Fig. 9. Address calculation unit.
Fig. 10. y = ${\sum}$ 2$^{\mathrm{k}}$• x line calculation unit for a straight line.
Fig. 11. Line calculation unit using LUT.
VI. VERIFICATION
1. MATLAB Simulation
Fig. 12 shows the results of the MATLAB simulation based on the actual image using the existing
LUT and proposed LUT. The gamma is adjusted by three gray level ranges according to
the histogram, and it has been verified that there is no difference between the image
results.
The MSE between the images by the conventional LUT and the proposed gamma LUT after
compression is 0, which shows that the proposed LUT compression can reduce the required
LUT count without any loss.
Fig. 12. MATLAB simulation of gamma adjustment according to a histogram: (a) Original image; (b) conventional LUT result; (c) Proposed LUT result.
2. VLSI Implementation
Fig. 13 shows the LUT ROM compression architecture for the mobile display system platform.
Fig. 13 shows the block diagram for the functional verifications of the proposed LUT in a
conventional display system in the FPGA development kit composed of a graphic controller,
frame memory, and graphic coprocessor. The graphic controller divides the image data
from the line buffer into 8-bit R, G, and B pixel data using a pixel driver. The proposed
LUT ROM is emulated with FPGA at the output of the pixel driver in the graphic controller
for verifications. Subsequently, as in the MATLAB simulation in the previous section,
the LUT is implemented for applying the gamma-adjusted curve to R, G, and B according
to the histogram of each pixel data. Fig. 14 shows the synthesis results of the adder, address calculation, line calculation and
lookup table in Fig. 8. Presented LUT architecture can be implemented with simple extra hardware in addition
to the compressed LUT presented in this work. Fig. 15 shows the results of the simulation using Verilog HDL to process the output of F(x)
by adding F$_{\mathrm{Offset}}$(x), in which the dynamic range was reduced as shown
in Eq. (11), and the value of the straight-line-based representation using the LUT as shown in
Eq. (19).
Vga_clk is the clock that is sent to the LCD panel, vsync (hsync) is the vertical
(horizontal) sync signal for the panel display, de is the display-enable signal on
the panel, and r, g, and b are the red, green, and blue pixel data, respectively.
1 clock delay occurs after the offset LUT, and the line calculation unit is added
through the recursive adder. However, matching the timing of the VGA driver, the image
data can be taken from the line buffer 1 clock earlier, then the additional delay
is not required. Therefore, the proposed LUT can be easily realized in existing systems.
Fig. 13. Display system architecture with proposed LUT.
Fig. 14. Synthesis results: (a) Adder; (b) Address calculation; (c) Line calculation; (d) Look up table.
Fig. 15. Simulation Waveform.
VII. PERFORMANCE COMPARISONS AND ADVANTAGES
The proposed LUT compression methods are evaluated and compared with the conventional
LUT implementation methods.
Fig. 16 illustrates LUTC vs. DR$_{\mathrm{Origin}}$/DR$_{\mathrm{Reduced}}$ when n = 8. As
the dynamic range is reduced, the LUTC reduces drastically. In addition, Fig. 16 shows the LUTC for p = 2, 4, and 8 when the LUT input bit width is partitioned using
Eq. (19). If DR$_{\mathrm{Reduced}}$ is reduced, the sizes of LUTC$_{\mathrm{Reduced}}$, LUTCReduced_VF, and LUTCReduced_Line_LUT will decrease. The dynamic range of LUTCReduced_VF is halved compared to that of LUTC$_{\mathrm{Reduced}}$, making it the smallest,
and in the case of LUTCReduced_Line_LUT, LUT for a straight line is added, making it larger than LUTC$_{\mathrm{Reduced}}$.
Because of the straight-line-based representation implemented in the form of y = ${\sum}$
2$^{\mathrm{k}}$• x in Eq. (15), LUTC$_{\mathrm{Reduced}}$ with a reduced dynamic range shows a 65% higher reduction
rate compared to the case of DR$_{\mathrm{Origin}}$/DR$_{\mathrm{Reduced}}$ = 128.
LUTCReduced_VF, which is reduced by vertical folding using Eq. (14), is further reduced by 53% compared to LUTC$_{\mathrm{Reduced}}$, resulting in a
total reduction rate of up to 87.5%.
Fig. 16. LUTC according to the change in DR$_{\mathrm{Reduced}}$ (n = 8).
Fig. 17. LUTC according to change of n (DR$_{\mathrm{Orgin}}$/ DR$_{\mathrm{Reduced}}$=128).
Fig. 18. LUTC$_{Reduced}$_2^kand LUTCReduced_Line_LUT(n = 16).
Fig. 17 illustrates the LUTC vs. pixel width, n, which is the input address of the LUT. As
n increases, which is a current trend, the LUTC increases dramatically, as described
in Eq. (6). It can be observed that the LUTC reduces more as the dynamic range is reduced with
various reference straight lines. If the reduced dynamic range is the same, the LUTC
is reduced by Eq. (18), and Eq. (23) shows a similar reduction rate even when the input bit width is increased, and the
method using the vertical folding of Eq. (14) further reduces LUTC.
For the two types of straight lines in Fig. 7 with n =16 for various DR$_{Reduced}$_2^k / DRReduced_Line and p, Fig. 17 shows comparisons between LUTC$_{\mathrm{Reduced2^ k}}$ by y = ${\sum}$ 2$^{\mathrm{k}}$${\cdot}$x
in Eq. (15) and LUTCReduced_Line_LUT = LUTCReduced_Line + LUTLine_LUT in Eq. (23) by LUT based straight line as in Eq. (19).
Based on this, the efficiencies of the different methods of implementing the straight
line are compared as shown in Fig. 7. Three cases are evaluated. Case (I): LUTC$_{Reduced}$_2^k = LUTC Reduced_Line_LUT; Case (II): LUTCReduced_Line_LUT < LUTC$_{\mathrm{Reduced2^ k}}$; and Case (III): LUTCReduced_Line_LUT > LUTC$_{\mathrm{Reduced2^ k}}$. In the case of DR$_{\mathrm{Reduced2^ k}}$ = DRReduced_Line, LUTC$_{\mathrm{Line}}$ is added for the straight line to LUTCReduced_Line based on Eq. (23), LUTC$_{\mathrm{Reduced2^ k}}$ is more effective with the straight line by the shifter.
As DR$_{\mathrm{Origin}}$ is increased and the value of p reduces, more partitions
are required. In Case (I), DR$_{Reduced}$_2^k/DRReduced_Line can be represented as:
As DR$_{Reduced}$_2^k/DRReduced_Line is larger than the right side of Eq. (24), LUTCReduced_Line_LUT of Case (II) is reduced, and the method of implementing straight line using LUT becomes
more effective. In Case (I), p can be represented as shown in Eq. (25).
As p is smaller than the right side of Eq. (25), LUTCReduced_Line_LUTof Case (II) decreases, and the implementation of the straight line using LUT is more
effective. Therefore, by controlling p according to DR$_{Reduced}$_2^k/DRReduced_Line, LUT implementation can be optimized.
Table 1 shows the LUT count comparisons of the LUT compression methods presented in Fig. 17 and 18. LUTCReduced_2^k and LUTCReduced_Line_LUT have approximately 43% compression
rates. For LUTCReduced_VF case, 87.5% compression rate can be achieved.
Fig. 19 shows comparisons of the adder count and shifter count in Eqs. (16) and (17), respectively,
for the implementation of the straight line y = ${\sum}$ 2$^{\mathrm{k}}$• x for various
slopes of the straight line. Both changes according to the slope, but are always lower
than a certain value. The implementation method of the straight line using the LUT
is consistent with the change in slope, as described by Eqs. (20) and (21). As p decreases,
the adder count increases. However, since the pixel processing speed is low and a
single recursive adder is enough, which minimizes the hardware complexity.
Table 2 shows the comparisons of chip areas and power consumptions for the LUT architecture
shown in Fig. 8(b) using Synopsy Design Compiler with UMC 40~nm LP (Low Power) RVT (Regular Voltage
Threshold) standard cells for 8 bit input and 11 bit output currently used in display
panels. Compared to the uncompressed case, a very conservative case is evaluated when
input bits are partitioned by 2 and LUT dynamic range is to be processed by 4 bit
with the vertical folding. As shown in Table 2, even with the additional line calculation unit and the address calculation unit
and the adder, the chip area is reduced more than 70% and power consumptions are halved.
Table 3 and 4 show the chip area for all the units in the architecture shown in Fig. 8(a) and (b) with the offset LUT dynamic range = 4 bits. For both Fig. 8(a) and (b), the additional units required in the presented architectures such as adder,
address calculation, and line calculation need little amount of chip areas compared
to the offset LUT units which are a lot less than the uncompressed LUT as shown in
Table 1.
The LUT tone curve, even for a unique curve shape, as represented in Fig. 1(f), shows a monotonous straight line-like shape. Therefore, most of the LUT ROM can
be compressed by reducing the dynamic range by the proposed straight reference line-based
representation. The conventional method of LUT compression, adding the error on the
quantized value, achieves a compression rate of 70% approximately [10] only for a regular error pattern. However, the error pattern is generally irregular,
and large precision loss is unavoidable to increase the compression rate. Precision
loss becomes even more problematic when the pixel width increases, which is required
in recent and future applications. In addition, LUT ROMs for quantization and error
is needed to be divided into several, resulting in increased hardware complexity.
If the proposed methods are applied to the typical LUT tone curve like Fig. 1(c), the ratio of DR$_{\mathrm{Reduced}}$ and DR$_{\mathrm{Origin}}$ can be reduced,
resulting in a compression rate of 65%. In addition, there is no precision loss compared
to [10] because there is no requirement for a regular error pattern. Additionally, if vertical
folding can be applied to such a LUT tone curve, a compression rate as high as 87.5%
can be achieved without any loss. Furthermore, the proposed method can be easily implemented
with simple hardware.
Fig. 19. Adder and shifter count of the straight line implementation according to straight line slope.
Table 1. LUT count comparisons for the various LUT compressions presented in this work
n (Input bit width)
|
8
|
12
|
16
|
LUTCorigin
|
2,048
|
49,152
|
1,048,576
|
LUTCReduced_2^k
|
256
|
20,480
|
589,824
|
LUTCReduced_VF
|
64
|
4,096
|
131,072
|
LUTCReduced_Line_LUT
(p = n/2)
|
384
|
21,248
|
593,920
|
LUTCReduced_Line_LUT
(p = n/4)
|
288
|
20,576
|
590,080
|
Table 2. Comparisons of chip areas and powers
|
Uncompressed
|
Fig. 8(b)
|
Fig. 8(b) with Vertical folding
|
Partitioned input count
|
0
|
2
|
2
|
Vertical folding
|
No
|
No
|
Yes
|
Offset LUT Dynamic range
|
-
|
4 bits
|
4 bits
|
|
Area (um2)
|
Power (uW)
|
Area (um2)
|
Power (uW)
|
Area (um2)
|
Power (uW)
|
17,893
|
269
|
9,688
|
194
|
5,310
|
125
|
Comparison (%)
|
100%
|
100%
|
54.2%
|
72.1%
|
29.7%
|
46.5%
|
Table 3. Chip areas forFig. 8(a)
Fig. 8(a) Hierarchical instance area (um2)
|
Lookup table (Top level module)
|
10,867.3237
|
Adder unit
|
144.3582
|
Line calculation unit
|
602.3304
|
Offset LUT unit
|
10,120.6351
|
Table 4. Chip areas forFig. 8(b)
Fig. 8(b) Hierarchical instance area (um2)
|
Lookup table (Top level module)
|
10,885.0393
|
Adder unit
|
144.3582
|
Address calculation unit
|
13.6458
|
Line calculation unit
|
606.4002
|
Offset LUT unit
|
10,120.6351
|
VIII. CONCLUSION
Methodologies to reduce the ROM count are proposed to reduce the dynamic range using
the straight line as a reference for a wide range of image signal processing applications.
Using the shifter and LUT, approaches to implement a straight line can be represented
with y = ${\sum}$ 2$^{\mathrm{k}}$• x. Various methods to compress the LUT are proposed:
shifting with a straight line that does not overlap with the origin as the reference,
vertical folding, and mirroring when the LUT curve is horizontally or vertically symmetrical.
The proposed LUT compression method is verified with a mobile display platform, FPGA,
and it has been proven that the qualities of images do not deteriorate. When vertical
folding is applied when the LUT curve is vertically symmetrical, and the ratio of
the reduced dynamic range and original dynamic range are small, the LUT count can
be reduced by 87.5% without any precision loss. The proposed method can be applied
to all LUTs, and it is suitable for implementation on a system on a panel where a
low hardware complexity is required.
References
Zeng, Hui, et al. "Learning image-adaptive 3d lookup tables for high-performance photo
enhancement in real-time." IEEE Transactions on Pattern Analysis and Machine Intelligence
44.4 (2020): 2058-2073.
Kumari, Apurva, and Subhendu Kumar Sahoo. "Fast single image and video deweathering
using the look-up-table approach." AEU-International Journal of Electronics and Communications
69.12 (2015): 1773-1782.
Alias, Binu, Anu Mehra, and P. Harsha. "Hardware implementation and testing of effective
DPCM image compression technique using multiple-LUT." 2014 International Conference
on Advances in Electronics Computers and Communications. IEEE, 2014.
T.-L. Wu, et al., “Adaptive Color Image Enhancement Applied to Display Based on Hardware
Design,” Display Workshops, IDW ’06, 13th International, 6-8, pp. 499-502, Dec., 2006.
M.-S. Huang, “Gamma Correction only Gain/Offset Control System and Method for Display
Controller,” US Patent No. 7068283, Jun., 2006.
J. W. Park et al., "A Low-Cost and High-Throughput FPGA Implementation of the Retinex
Algorithm for Real-Time Video Enhancement," in IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 28, no. 1, pp. 101-114, Jan. 2020, doi: 10.1109/TVLSI.2019.2936260.
D. Han, “Real-Time Color Gamut Mapping Method for Digital TV Display Quality Enhancement,”
Consumer Electronics, IEEE Transactions on, Vol. 50, No. 2, pp. 691-698, May, 2004.
V. Monga, R. Bala and X. Mo, "Design and Optimization of Color Lookup Tables on a
Simplex Topology," in IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 1981-1996,
April 2012, doi: 10.1109/TIP.2011.2177848.
I. R. Khan, S. Rahardja, M. M. Khan, M. M. Movania and F. Abed, "A Tone-Mapping Technique
Based on Histogram Using a Sensitivity Model of the Human Visual System," in IEEE
Transactions on Industrial Electronics, vol. 65, no. 4, pp. 3469-3479, April 2018,
doi: 10.1109/TIE.2017.2760247.
B.g-D. Yang and L.-S. Kim, “An Error Pattern ROM Compression Method for Continuous
Data,” Circuits and Systems, IEEE International Symposium on, Vol. 2, pp. II-845,
May, 2004.
S. Kim and H. -J. Lee, "Optimized Interpolation and Cached Data Access in LUT-Based
RGB-to-RGBW Conversion," in IEEE Transactions on Circuits and Systems II: Express
Briefs, vol. 65, no. 7, pp. 943-947, July 2018, doi: 10.1109/TCSII.2017.2740358.
M. Ghosh, et al., “A Novel DDS Architecture Using Nonlinear ROM Addressing with Improved
Compression Ratio and Quantization Noise,” Circuits and Systems, IEEE International
Symposium on, Vol. 2, pp. II-705, May 2004.
Pi, Dapu, et al. "Reducing the memory usage of computer-generated hologram calculation
using accurate high-compressed look-up-table method in color 3D holographic display."
Optics express 27.20 (2019): 28410-28422.
Pi, Dapu, et al. "Simple and effective calculation method for computer-generated hologram
based on non-uniform sampling using look-up-table." Optics Express 27.26 (2019): 37337-37348.
Jiao, Shuming, Zhaoyong Zhuang, and Wenbin Zou. "Fast computer generated hologram
calculation with a mini look-up table incorporated with radial symmetric interpolation."
Optics express 25.1 (2017): 112-123.
Gener, Y. Serhan, Sezer Gören, and H. Fatih Ugurdag. "Lossless look-up table compression
for hardware implementation of transcendental functions." 2019 IFIP/IEEE 27th International
Conference on Very Large Scale Integration (VLSI-SoC). IEEE, 2019.
Hsiao, Shen-Fu, Chia-Sheng Wen, and Po-Han Wu. "Compression of lookup table for piecewise
polynomial function evaluation." 2014 17th Euromicro Conference on Digital System
Design. IEEE, 2014.
S.-Y. Kim, et al., “Image Contrast Enhancement Based on the Piecewise-Linear Approximation
of CDF,” Consumer Electronics, IEEE Transactions on, Vol. 45, No. 3, pp. 823-834,
Aug., 1999.
K. C. Soon and K. K. Sup, “An Improved Contrast Control Method for LCD Monitor,” Korea
Multimedia Society, Journal of, Vol. 5, No. 6, pp. 609-615, Dec., 2002.
S6D0114, 132 RGB X 176 DOT 1-Chip Driver IC with Internal GRAM for 262,144 Colors
TFT-LCD, SAMSUNG Electronics, 2002.
X.-F. Feng, H. Pan, and S. Daly, “Dynamic Gamma: Applications to Hold Type Motion
Blur Reduction Using Synchronized Backlight Flashing,” Display Workshops in conjunction
with Asia Display, IDW/AD ’05, 12th International, 6-9, pp. 807-810, Dec., 2005.
T. Kono, “Gamma Correction Method, Gamma Correction Apparatus, and Image Reading System,”
US Patent no. 7271939 B2, Sep., 2007.
J. Jia and C.-K. Tang, “Image Registration with Global and Local Luminance Alignment,”
Computer Vision, ICCV ’03, Ninth IEEE International Conference on, 13-6, pp. 156-163,
Oct., 2003.
Rahman, Shanto, et al. “An adaptive gamma correction for image enhancement.” EURASIP
Journal on Image and Video Processing 2016.1 (2016): 1-13.
Li, Shih‐An, and Chi‐Yi Tsai. "Low‐cost and high‐speed hardware implementation of
contrast‐preserving image dynamic range compression for full‐HD video enhancement."
IET Image Processing 9.8 (2015): 605-614.
Author
Sun Myung Kim received B.S., and M.S. degrees in the Department of Electronic and
Electrical Engineering from Hongik University, Seoul, Korea, in 2007, and 2009, respectively.
In 2010, he joined at TLi inc., where he has been working in the area of image processing
and high-speed interface design. His current interests include a high-speed interface,
low-power panel driving algorithms, and high dynamic range image processing.
Jaehee You received his B.S. degree in Electronics Engineering from Seoul National
University, Seoul, Korea, in 1985. He received his M.S. and Ph.D. degrees in Electrical
Engineering from Cornell University, Ithaca, NY, in 1987 and 1990, respectively. In
1990, he joined Texas Instruments, Dallas, TX, as a Member of the Technical Staff.
In 1991, he joined the faculty of the School of Electrical Engineering, Hongik University
in Seoul, Korea, where he is now supervising the Semiconductor Integrated System Laboratory.
He is currently Vice President of the Institute of Semiconductor Engineers and has
served as an Executive Director of the Drive technology and System research group
of the Korean Information Display Society. He was a recipient of the Korean Ministry
of Strategy and Finance, KEIT Chairman Award for Excellence, in 2011. He has worked
as a technical consultant for many companies, such as Samsung Semiconductor, SK Hynix,
Global Communication Technologies, P&K, Penta Micro, Nexia device, and Primenet. His
current research interests include integrated system design for display image signal
processing and perceptual image quality enhancements.