I. INTRODUCTION
Recently, virtual reality (VR) and mixed reality (MR) have become important applications
in smart mobile devices, such as smartphones and head-mounted displays (HMDs). Conventionally,
they exploited a touchscreen or a hand-held controller to interact with virtual 3D
objects. However, because manipulations of the touchscreen support only 2D interactions
in 2D space, its maneuverability in VR/MR environments is very uncomfortable due to
mapping from 2D interactions to 3D interactions. On the other hand, even though the
hand-held controller can support 3D interactions, its additional control device makes
it inconvenient to utilize such interactions in VR/MR applications. Therefore, a 3D
hand gesture interface (HGI) that supports intuitive 3D interactions without any additional
controller has drawn active attention instead of conventional UIs for smart mobile
devices.
Fig. 1 describes the 3D HGI on an HMD system. First, it generates 3D depth maps of human
hands and the host processor in the HMD system calculates the location and rotation
of hands in a virtual 3D space from extracted depth maps. Utilizing such information
can support translation, rotation, and manipulation with virtual 3D objects. These
interactions require smart devices to essentially acquire an accurate depth map of
input scenes because the robustness of HGI in MR is strongly dependent on accuracy
of depth information.
Fig. 1. 3D hand gesture interface system.
There are three general approaches to extracting accurate depth maps, which are using
a time-of-flight (ToF) camera, a structured light system, and a stereo vision system
(1). The ToF camera calculates distance by evaluating travel time of emitted light from
VCSEL between a camera and the objects. However, it suffers from large power consumption
(> 2.1 W) for infrared light emission
(2). For example, the state-of-the-art HMD system (Hololens)
(3) integrates 16.5Wh battery, while it requires over 4.1W power consumption to perform
the 3D HGI including the ToF sensor and the mobile processor (Intel Atom x5-z8500).
This power consumption limits the lifetime of the HMD system to only 2~3 hours which
is not feasible to provide alwayson 3D HGI. Although today’s ToF sensors that target
mobile applications consume less power around 200~300 mW
(4-6), depth-sensing must dissipate lower power since it must be used as an always-on interface
for 3D HGI. Therefore, utilizing ToF is not feasible for lowpower 3D HGI in mobile
systems considering their limited power budgets. The structured light system projects
light with pattern and measures distance by distortion of the projected light. This
system also consumes more than 2.25 W
(7), which is also not feasible for mobile devices, because both of its light projection
and depth calculations. To overcome such
limits of active sensor-based approaches, stereo vision system that estimates a depth
map by triangulation between two cameras in a similar way that how human eyes predict
depth is used for mobile devices. It extracts disparity between left and right images
by sliding window matching and then a distance of an object is inversely proportional
to the measured disparity of matched pixels. Stereo vision system without active sensors
benefits over power consumption. Therefore, a low-power and low-latency depth-estimation
processor is required instead of using active depth sensors
(2-5) because low-power and real-time operations are essential in mobile UI applications.
Several works implemented stereo matching processor in ASIC
(8-13) and FPGA
(14,15), but they still consumed too much power that they cannot be used for 3D HGI sensor
since their target was wide-range depth-estimation for high-end applications such
as unmanned vehicles. Although
(16) consumes less power compared with other previous works
(8-13), it is not feasible to provide the accurate 3D HGI due to its poor depth accuracy
from the block matching algorithm as decscribed in Section II in detail. Thus, stereo
matching with local aggregation is adequate for lowpower 3D HGI. However, it causes
massive memory accesses and computation, so it is almost impossible to realize a real-time
operation on CPU or GPU systems
(17). In aspect of latency, depth-estimation must be < 10 ms with the hand pose estimation
(18) since the overall UI latency should be < 40 ms
(19). Meanwhile, it should consume < 50 mW, which is only 5% of power consumed in general
application processor, to enable UI to run always-on during the entire operation of
HMD or MR devices.
Fig. 2 shows the overall stereo matching flow, and it consists of 4 stages: initial matching,
cost aggregation, winner-takes-all (WTA), and consistency check. First, the initial
matching stage calculates the similarity cost map between small patches (~5x5) of
left and right images where the sum of absolute difference, sum of squared difference,
and census (22) are widely used for the matching. In the initial matching stage, size of image template
is a significant factor for depth accuracy. For example, using larger templates generates
much reliable initial matching costs as shown in Fig. 3. However, it degrades matching cost of hand, which is only our interest, due to the
large regions of background clutters. On the other hand, utilizing small templates
provides more robust matching costs, while initial matching costs become vulnerable
to pixel-level noise such as illumination or blurring due to the reduced sample points
within the template. To summarize, the results of initial matching give a poor depth
map because optimal template size is crucial to depth accuracy while its size is variable
as objects’ distance varies. Therefore, the stateof-the-art algorithms (21-25) essentially exploited cost aggregation, which aggregates the matching costs of neighboring
pixels, to refine the initial depth map. They usually exploit small (1x1 to 5x5) template
matching to reject the background clutter effect with large (15x15 to entire image)
aggregation regions. After that, the WTA stage selects the best-matched depth index
from the aggregated cost map. Finally, the left-right consistency checking stage eliminates
mismatching or occlusion by comparing left and right depth maps. Among the stages,
the initial matching and cost aggregation cause a large amount of computation and
memory accesses because they are performed along every disparity level (Fig. 2). For example, stereo matching requires over 630.7 Gflops and 18.3 GB/s for 100 fps
with 60 disparity under QVGA (320x240) resolution. Moreover, 81% of the whole computations
and 92% of memory accesses are concentrated in the cost aggregation stage, hence is
the most power-consuming part.
Fig. 2. Dataflow of stereo matching process.
Fig. 3. Effect of template size in the initial matching operation.
In order to meet the above-mentioned computational requirement, massively parallel
designs using more than 60-way processing units are utilized
(8-15). They used DDR3 DRAM with cache or wide-bandwidth (1612b) SRAM to resolve huge bandwidth
of on-chip and offchip memory. However, both high-speed external memory and wide-bandwidth
SRAM cause large power consumption and area overhead.
In this paper, we propose a low-power and low-latency depth-estimation processor (DEP)
with reduced memory bandwidth by proposing algorithm and hardware cooptimization with
the following 3 key features: 1) shifterbased adaptive support-weight cost aggregation
that replaces complex floating-point operations to integer operations for power and
memory bandwidth reduction; 2) line streaming 7-stage pipeline architecture to realize
high utilization and reduce additional required memory; and 3) shift register-based
pipeline buffer to reduce area. The proposed chip is designed for 320x240 image resolution,
and it is sufficient for the 3D HGI because the adopted algorithm (18) requires 60x60 input hand images and the size of hand regions is usually (60x60 ~
128x128) on (15 cm ~ 30 cm) range with general webcam environments. As a result, the
total normalized power dissipation and required memory can be reduced by 74.7% and
54.6% compared with the state-of-the-art hardware (9,10), respectively, while maximally achieving 175 fps at 150 MHz under QVGA resolution.
The rest of this paper is organized as follows. Section II describes the optimal algorithm
selection for the 3D HGI and the proposed shifter-based cost aggregation algorithm
as well as the hardware architecture. In section III, the overall architecture of
the depth-estimation processor (DEP) with 7-stage pipeline, pipeline buffer optimization,
and resolution scalable pipeline control are explained with detailed hardware implementations.
Section IV shows the system implementation with the proposed chip and evaluation results,
respectively, followed by conclusion in Section V.
Fig. 4. Hand depth images of (a) original input image, (b) global aggregation, (c)
local aggregation, (d) block aggregation.
II. SHIFTER-BASED COST AGGREGATION
1. Optimal Aggregation for 3D HGI
The cost aggregation is the most important stage in view of not only accuracy but
also memory accesses and computation in the depth-estimation. There are three basic
categories in the cost aggregation algorithms, which are global aggregation (21-23), local aggregation (24,25) and block aggregation (26). The global aggregation was utilized in (8-11,13,15), while (12,14) adopted a local aggregation method, and (16) utilized block aggregation method. Fig. 4 shows depth-estimation results for each aggregation algorithms of semi-global (23), adaptive support weight (ASW) (24), and simple block aggregation (SSD + Mean filtering) (26). The global aggregation method aggregates the initial cost maps to minimize the overall
sum of matching. Its aggregation paths are fully connected, and final depth points
are selected by comparing all of the cost values along with the whole possible aggregation
paths. Because of exploring all possible aggregation paths, the global aggregation
automatically interpolates ambiguous depth regions such as textureless regions, occluded
regions, or repeated pattern compared with other methods, as shown in Fig. 4(b). Moreover, it generates a dense depth map without any additional post-processing.
However, its fully-connected aggregation path requires large computation and intermediate
data of which complexity is O(WxHxD2) and O(32xWxHxD), as shown in Table 1, respectively. Next, the local aggregation method aggregates the cost maps over the
same disparity level. It usually utilizes supporting filters that are generated by
intensity differences (24), or segmentation regions (25) to improve accuracy of a depth map since it does not aggregate across different disparity
levels. What is critical with this method is that it cannot interpolate ambiguous
regions because its aggregation is explored on only single disparity level. However,
the local aggregation provides much sharpened image and accurate depth information
at close objects. It also provides as high-quality depth map as the global aggregation
does for the 3D HGI in active regions because active distance to hands is 20 ~ 40
cm and hands are always located closer than other background objects. Compared to
global methods, its complexity of computation and intermediate data size is reduced
to O(WxHxD) and O(16xWxH) due to exploring only same disparity level. Finally, the
block aggregation method aggregates costs within the fixed-size box region. As shown
in Fig. 4(d), this method seems to provide the worst quality of a depth map among the three methods
because it just sums the initially matched costs without any supporting weights. However,
it reduces the complexity of computation and required memory compared with other two
methods because only INT16 is sufficient to compute aggregation due to its simple
summation-only aggregation. However, its average pixel error is 14.2%, and its result
is too poor to realize the accurate 3D HGI, as shown in Fig. 4 and Table 2. Therefore, the local aggregation becomes the optimal algorithm for the mobile HGI
to realize both low latency (< 10 ms) and low power (< 50 mW) in aspect of both accuracy
and algorithm complexity. In this paper, we utilize and optimize ASW (24,27) among variants of local aggregation methods for the proposed hardware.
Table 1. Complexity comparison among aggregations
Fig. 5. Operations of adaptive support weight aggregation
Fig. 5 describes the operations of ASW with 60 disparity levels where it aggregates initial
costs level-bylevel. For each disparity level, it performs sequential aggregation
along with four directions (right, left, top, and bottom) for every pixel where horizontal
and vertical aggregations are performed in order for higher accuracy. Cost aggregation
along each direction performs weighted summation where the weights are generated by
gestalt grouping
(24), which is formulated by using a Laplacian kernel of color difference between a center
pixel and an aggregated pixel. However, it must use a 32bit floatingpoint (FP) number
system for costs and weights since it requires exponent computation that results in
implementing power-consuming FP ALUs as well as huge memory bandwidth. Moreover, weighted
summations are performed for all pixels and disparity levels requiring large computations
even though the local aggregation has lower computation complexity than the global
aggregation. For example, it requires 579.4 Gflops and 14.6 GB/s for 100 fps under
QVGA resolution, implying that fine software optimization is required. Therefore,
we introduce a hardware-friendly ASW algorithm in the next section that uses integer
instead of FP for cost values.
2. Shifter-based Cost Aggregation Processing
Laplacian scale factor used in ASW algorithm (24,27) is the absolute difference of adjacent pixels’ intensities:
where σ is the supporting parameter, of which the values used in the proposed hardware
are 2, 4, 8 and 16. Then, an ASW cost with the weights is described as
Table 2. Depth error comparison of aggregation methods
In
(2), an ASW cost is calculated by weighted summation of successive costs Ci
(24,27). Ccenter indicates the initial matching cost of a center point and Ci’s are the costs
of neighboring pixels. To calculate exponent operations,
(24,27) utilizes a 32bit FP number system to reduce truncation errors during aggregation.
Due to large area as well as large power consumption from FP logics,
(12) deployed a 24bit INT number system to reduce their overheads with 6.8% average pixel
error which is comparable with the accuracy of
(24) (6.5%), as shown in
Table 2. However, its 24bit number system still requires large overhead for the size of intermediate
memory and the area of 24bit multipliers, and the proposed algorithm is applied additional
approximations to reduce an additional bit-width of the costs.
The stereo matching algorithm finds the pair of the best-matched points by WTA algorithm
that searches for the index of minimum cost along with depth levels. Therefore, the
result of a depth map by WTA is not changed if the inequality between two costs is
preserved after approximations. In the first step of approximations, the base of the
Laplacian kernel is changed from Euler’s number to 2 as
The modification to
(3) does not change the inequality condition since $2^{-x} ∝ e^{-\log _{2}(x)}$. After
that, base-2 ASW cost is approximated to shifting operation as
(4) because >> $x$ ∝ 2
-x , and it still preserves the inequality condition without any loss of generality.
As a result, as shown in
Fig. 6 and
Table 2, the accuracy differences between the proposed shifter-based aggregation and the
previous integer-based ASW algorithms
(12) is -3.82%, 3.79%, +1.59%, and +0.65% on Tuskuba, Venus, Teddy, and Cone cases of
Middlebury stereo dataset
(28), respectively, while large bit-width reduction.
Fig. 6. Results of the proposed shifter-based aggregation (a) Input image, (b) Ground
truth, (c) Adaptive support weight, (d) Shifter-based adaptive support weight
3. Shifter-based Aggregation Unit
Fig. 7(a) describes a hardware implementation of ASW that consists of an exponent, a multiplier,
and an adder implemented with 32bit FP as (2). It takes weights and costs as input, and it calculates $C_{a c c}+C \cdot e^{-w}$
for every cycle. Then, aggregated cost over one direction is stored in an accumulation
register. The FP exponent logic and FP MAC require complicated hardware using either
lookup tables or piecewise linear approximation schemes for reduced hardware complexity.
However, both approaches still require large on-chip memory size and complex processing
logics compared with integer-based hardware, respectively. In addition, a DEP requires
highly-parallel aggregation unit arrays (e.g.,> 270-ways), so the overheads of area
and power consumption are critical.
Unlike the FP-based unit, the proposed aggregation unit in Fig. 7(b) requires only a barrel shifter and an integer adder. It generates multiplication
between input costs and exponential of weights by only one shifting operation. Also,
the proposed shifter-based ASW enables to use integer number system during aggregation.
The initial costs are generated by 8 points selective census matching within 5x5 template
size of which the maximum value is 8. After that, they are aggregated by the proposed
ASW within 15x15 aggregation region, and the maximum value of the intermediate and
the final aggregated costs are 120 and 1800, respectively. Thus, the former 32bit
bit-width of the initial, the intermediate, and the final costs are set to 4bit, 7bit,
and 11bit, respectively, without overflow. The maximum value of weights, which represent
the aggregation strength between neighboring pixels by similarity, is determined empirically.
Simulation results show that utilizing 3 bit is enough for the shifter-based ASW processing
without accuracy degradation. As a result, the proposed unit only contains a 4(7)bit
barrel shifter with a 3bit operand and a 6(11)bit accumulator for vertical (horizontal)
directions, respectively, reducing power consumption by 92.2% compared with original
FP-based implementation.
Fig. 7. (a) FP-based aggregation unit, (b) Proposed shifterbased aggregation unit
In addition to the power reduction, the bit-width reduction of processing data also
drastically reduces the overall intermediate data size by 69.1%. The required intermediate
memory reduction to 31.9 KB facilitates to integrate all the intermediate buffers
on the chip, removing external memory accesses during stereo matching.
III. PROPOSED DEPTH-ESTIMATION PROCESSOR
1. Overall Architecture
Fig. 8 describes the overall architecture of the proposed DEP that is composed of a top
controller, an input image loader, an output depth buffer, and a stereo pipeline module
(SPM). The 7-stage pipelined SPM estimates depth line-by-line. It is composed of an
input buffer, a census transformation unit, an initial matching unit, a vertical aggregation
unit, a horizontal aggregation unit, a WTA unit, and a left-right (L-R) consistency
check unit. First, the input image loader fetches 8b left and right pixels from an
external memory, and stores them to the 320x20 input buffer in the SPM for every clock
cycle. After 20 lines of inputs are fetched inside the input buffer, the census transformation
unit generates both of 30 left and right binary patterns and corresponding aggregation
weights from the 20-line inputs for every cycle. Then, the initial matching unit calculates
hamming distance between left and right census pairs and extracts 74 initial cost
maps for every 60 cycles. Next, the initial cost maps from the previous stage are
aggregated by the vertical and horizontal aggregation units in order with 248-way
and 240-way parallelism, respectively. After that, the WTA unit searches the best-matched
index between left and right images and generates left and right depth maps. Finally,
the L-R consistency check unit compares left and right depth maps to eliminate falsely
matched depth points, which come from occluded or textureless points, and the 60 final
depth points are stored to the output depth buffer for every 60 cycles. The proposed
shifter-based ASW completely eliminates external memory access during SPM operations
by holding all of the intermediate data inside pipeline buffer. To realize 10 ms stereo
matching latency, the initial matching, the vertical aggregation, and the horizontal
aggregation units are composed with homogeneous 148-way, 148-way, and 120-way parallelized
PEs, respectively.
Fig. 8. The overall depth-estimation processor architecture
Fig. 9. The timing diagram of the proposed DEP with hierarchical pipelining
Fig. 9 describes a timing diagram of the proposed DEP operations with hierarchical pipelining.
The first is linelevel pipeline with 3 stages of line loading, line processing, and
line storing. The SPM estimates 1 line of depth map for every 480 clock cycle. Each
line processing stage consists of 7 stages of pixel-level pipeline: pre-fetching input,
census transformation, initial matching, vertical aggregation, horizontal aggregation,
WTA, and consistency check. Each stage processes pixel-level operations every 8 clocks.
All the pipeline stages are well balanced to achieve 94% of utilization.
2. Line Streaming 7-stage Pipeline Architecture
Fig. 10 describes data processing patterns of sliding window matching and 4-direction cost
aggregation. In the initial matching stage, a right (reference) patch and a left (target)
patch are compared to generate initial costs. In this operation, the right patch is
reused 60 times while sliding the left patch toward right direction. In general implementation,
the target patches are fetched into the left buffer and a wide I/O multiplexer (MUX)
reorders its data to align with those in the right buffer. However, utilizing such
a wide MUX causes large area overhead and routing congestion because it must be connected
with all of the ports in the matching PE. On the other hand, the proposed architecture
with a shifting register (SR)-based buffer for the target patch (marked in red) moves
1 index every pipeline cycle, as indicated in Fig. 10(a). In the meantime, the reference patch stored in blue RFs is loaded every 60 pipeline
cycle. The 4-direction cost aggregation is obtained by recursively performing the
bi-directional aggregation for top/bottom and right/left, respectively, as shown in
Fig. 10(b). The size of aggregation window is 15x15, and maximally 8 costs are aggregated in
a single aggregation PE. Therefore, the initial costs in a buffer are selected with
cyclic indexing and issued to forward and backward aggregation units. The bi-direction
aggregated costs are generated through both units after 8 clock cycles.
Fig. 10. Hardware implementations in stereo matching (a) Matching hardware with shifting
register-based buffer, (b) Aggregation units with 2- direction MUX-based buffer
There are 5 pipeline buffers in the SPM, which are the input buffer, the left and
right census registers, the initial cost register, the intermediate cost registers
for vertically aggregated costs, and the final aggregated cost register. The overall
latency of this pipeline is 480 clock cycles, and the buffers latch and fetch data
with synchronized pipeline cycle. First, the 3-banked input buffer issues 3 pixels
of the left and right images to left and right census transform units for one clock.
This operation consumes 320 cycles for issuing 1 line of the input images, and remaining
160 cycles are used for fetching the next line to the input buffer from external memory.
Second, the left-/right- census units transform the pixels of the input images into
15 census pixels for every clock cycle at the same time. After that, they are stored
to the left and right census registers shown in
Fig. 11. The right census buffers utilize double buffering and they are swapped for every
60 pipeline cycle (480 clock cycles), and the left census buffer is composed of the
SR-based buffer architecture as mentioned in
Fig. 10. Third, upper and lower lines from the active left-/right- census buffers are fetched
from the buffers and the initial matching units calculate 2 lines (148 words) of the
initial costs for every clock cycles element-by-element as
Fig. 11 describes. Fourth, bi-direction vertical aggregation is performed with 148-way vertical
aggregation units that generate 74 upper initial costs and 74 lower initial costs
for every 8 clock cycle. To eliminate a pipeline stall of vertical aggregations, as
shown in
Fig. 11, initial matching and vertical aggregation process different lines of data with 1
index shifting. Finally, the horizontal aggregation is performed with 120-way horizontal
aggregation units and resulting aggregated costs are stored in the final aggregated
cost registers. In addition to vertical aggregations, the intermediate cost buffer
exploits double buffering architecture to reduce a pipeline stall of horizontal aggregations.
Both vertical and horizontal aggregation buffers utilize MUX-based buffer as shown
in
Fig. 10(b). As a result, the proposed SPM processes 300 depth points with initial matching and
aggregations for every 2400 clock cycle with 60 disparity levels, and its average
utilization is 94%.
Fig. 11. Cost generation and pipeline buffer architecture of SPM: 1) Shifting register
and double buffering for left and right census, 2) 2-path initial matching and initial
cost buffer, 3) 2path vertical aggregation and horizontal aggregation
Fig. 12. Comparison between multiplexer-based and shifting register-based architecture
(a) Area vs. parallelism, (b) Power vs. parallelism, (c) Area-power product vs. parallelism
The proposed SPM does not use any external memory during stereo matching. So, the
size of its pipeline RFs is very critical for aspect of both logic area and power
consumption. To reduce RFs, we also change order of aggregation direction from X-Y
to Y-X such as
(12). For examples in our case, in X-Y order aggregation, it needs 960 words which are
composed of 60x15 and 60x1 RFs for pipeline buffers. However, in Y-X order aggregation,
it needs only 134 words which are composed of 74x1 and 60x1 RFs instead of 960 words.
This optimization also reduces weight buffers, and its effect is doubled due to reducing
both left and right buffers. Therefore, the proposed hardware reduces 43.9% further
memory in the CA stages with only 0.5% error penalty. As a result, due to line-level
processing and changing aggregation order, the proposed SPM requires only 17.9 KB
buffer size without any external memory accesses for QVGA stereo matching.
3. Shifting Register-based Pipeline Buffer
There are two basic pipeline buffer architectures of a MUX-based buffer and an SR-based
buffer designs. The difference is the way that they align enormous data to a parallel
PE array by using either multiplexer (MUX) or SR. In the MUX-based architecture, input
data from a previous stage are stored into a pipeline buffer through wide I/O MUX.
On the other hand, the SR-based architecture orders data by shifting 1-index for every
insert of input. In general, the MUX-based architecture consumes less dynamic power
and small area compared with the SR-based one in low parallelism design, thus, CPU
or DSP does not deploy SR-based architectures. However, when it comes to high parallelism
that consequently requires large amount of connections, its area increases tremendously
as well as the static power becomes dominant. These area and power overheads make
it inefficient in highly parallel designs such as the proposed DEP.
Simulations were taken to get the relationships of area, power, and area-power product
as a new figure-of-merit with respect to parallelism to optimize buffer architectures;
where both architectures run at 150 MHz with 1.0 V supply voltage. Barrel shifter
with O(n) logic complexity is used for MUX-based architecture for fair comparison
because buffers in stereo matching only move indices along one direction. The baseline
of normalization is the MUX-based architecture, and they are tested from 5-way to
100-way. As shown in Fig. 12(a), the MUX shows smaller size than the SR below 25-way. However, normalized area of
the MUX is bigger than SR when > 25-way. In view of normalized power shown in Fig. 12(b), the MUX always consumes less power due to the dynamic power consumption of SRs.
However, the gap between the SR and the MUX is only 4% at 100-way parallelism. Since
both area and power are important in hardware design, we analyze the areapower product
to find the optimal designs for SPM buffers. As shown in Fig. 12(c), the MUX-based design shows better performance than SR-based design when < 40-way,
while it is opposite when > 40-way. Since both sliding window matching and aggregation
are performed repeatedly by moving 1-index for every processing, pipeline buffers
between each 7 stages can be implemented both MUX-based and SR-based designs, and
its optimization can be made by utilizing each architecture according to its parallelism
level. Therefore, the initial matching buffer (8), the vertical aggregation buffer
(8), and the horizontal aggregation buffer (1) utilize the MUX-based architecture,
and the left and right census buffers (74) utilize SR-based architecture. As a result,
the optimized buffer design improves 44% timing for a critical path and reduces 29.8%
of overall area.
4. Resolution Scalable Pipeline Control
Due to line streaming processing, the depth-estimation can support any size of the
height of input image resolution. On the other hand, for width scalability, the proposed
SPM architecture supports any resolution with widths in multiples of 60. Since the
input buffer size of our DEP is 320(Max width)x21(Max aggregation range), the DEP
supports 60, 120, 180, 240, and 300 width images without degrading utilization using
300 input buffers out of 320, while the rest of 20 input buffers are used for aggregation.
If this buffer size is enlarged to 640, or 1920, the proposed architecture is also
possible to support VGA or Full-HD images without any change of PE architecture. To
realize this scalability, the resolution scalable pipeline control is proposed in
the SPM shown in Fig. 13 so that the controller does not have to be altered even though number of buffers
are scaled. As shown in Fig. 13(a), the hardware block of single pipeline stage receives only 4 signals: EN (enable
signal), RST (reset), Pn (pipeline number), and Ln (loop number). EN controls whether
to latch data into accumulation registers or pipeline buffers. RST resets both an
accumulation register inside a PE array and an alignment index in a data alignment
unit to zero. These two signal is mandatory signal. On the other hand, Pn and Ln are
optional signals for aggregation stages, WTA, and L-R consistency check stages. Pn
is used for controlling current position of aggregation and latching signal for pipeline
buffers. Ln is used for WTA and L-R consistency check operations. These 4 control
signals are generated by a signal generator in the SPM described in Fig. 13(b). In SPM, there are two 3bit and 6bit counters for Pn and Ln, one variable width pulse
generator, and a configuration registers, and they generate all of the control signals
required in the 7 pipeline stages. The SPM receives SPM_EN (global enable) and SPM_RST
(line reset signal) from the top DEP controller and it performs 480 (60x8) cycle stereo
operation. After 480 cycles, dependent on configuration setting, it automatically
proceeds next 60 depth processing or stops until next line processing. In configuration
registers, the number of pre-fetching lines, image resolution, and debugging settings
for dumping intermediate data are stored and the signal generator makes variable width
EN signal by this information. Due to these line-level automated control, the top
controller only needs to send SPM_EN signal and SPM_RST signal to the SPM while processing
the whole stereo matching. Fig. 13(c) describes timing diagrams for the proposed DEP. First, after the top controller in
DEP asserts SPM_EN and send single pulse of SPM_RST, the SPM automatically processes
1 line of depth-estimation. Then, SPM_RST resets the loop counter to zero inside the
SPM, and it makes enable signals to the 7 stages until processing the entire 1 line.
After SPM_RST is asserted, the variable width pulse generator in the SPM sends EN
signal for stage 1, which is successively propagated to stage 2~7. These signal propagations
can be turned on or off by setting of configuration registers in the SPM. For example,
before estimating first line of a depth map, the proposed hardware must pre-fetch
20 lines and stage 2~7 must not process any data because there are invalid images
inside the input buffer. In this case, the signal generator blocks propagation of
enable signal and performs stage 1 only for the remaining 19 lines. In this situation,
all other stages are stalled and clock-gated to reduce redundant power consumption.
Due to these simple control architecture, its control logic only occupies 0.26% of
overall DEP area while it supports various resolutions of input images.
Fig. 13. Stereo pipeline module control (a) Structure of single pipeline stage hardware,
(b) Control path of stereo pipeline module, (c) Timing diagram of pipeline control
signal
IV. IMPLEMENTATION RESULTS
1. Chip Implementation Results
The proposed 1400x2000 μm2 DEP shown in Fig. 14 is fabricated by 65 nm 1P8M logic CMOS process and Table 3 summarizes the chip specification. We redesigned the previous DEP block (29) into the standalone chip with improvement at debugging functionality, resolution
scalability, external interface, and timing performance. It consumes 47.2 mW with
175 fps (5.71 ms) throughput which is the maximum performance on 1.2 V supply voltage
and 166 MHz operating frequency, and only 15.56 mW with 105 fps (9.52 ms) with 1.0
V and 100 MHz. The proposed hardware estimates QVGA resolution depth images where
the maximum disparity is 60 level. Its maximum energy efficiency is 34 pJ/level·pixel
at 1.0 V supply voltage. The required memory is reduced by 54.6% to 17.9 KB compared
with the state-of-the-art result (10) and it makes possible to integrate all intermediate data into on-chip memory due
to the algorithm and pipeline buffer optimization. Also, the measured 15.56 mW power
dissipation and 34 pJ/level·pixel energy consumption, which corresponds to 75.6% reduction
compared to the state-of-the-art (9).
Fig. 14. Chip photography
Table 3. Specification of the proposed DEP
2. Evaluation System Implementation
Fig. 15 shows the evaluation system of the proposed DEP that is integrated on the HMD systems
and the DEP communicates with a host processor (Exynos-5422 application processor)
by USB 3.0 I/F. Stereo images are retrieved from the customized stereo camera and
the images are converted to grayscale by the host processor. After that, the host
processor sends the images to the target HMD platform and eventually sent to the DEP.
Overall stereo processing latency is 9.95 ms including with USB 3.0 communication
latency between the DEP and the host processor, which is hidden behind depthestimation
operations due to its streaming processing. The host processor performs 3D hand pose
estimation by (18), and 3D hand poses are utilized for customized UI. The final result of the extracted
depth maps from the DEP is visualized at a monitor.
Table 4. Performance comparison table
Table 5. Average depth error on Middlebury dataset (28)
3. Evaluation Results
We evaluate the proposed DEP for both Middlebury stereo dataset (28) and hand pose estimation errors. To acquire hand pose estimation errors, (18) is applied to extracted depth maps. Table 5 shows an average depth error on (28) that includes Tsukuba, Venus, Teddy, and Cones images. It is evaluated for all regions,
nonoccluded regions, and depth discontinuity regions of test images, and its average
errors are 10.7%, 7.1%, and 16.7%, respectively. Compared with the original algorithm
(24,27), only 0.1% of accuracy is degraded for the three categories which are negligible
in the 3D HGI. In addition, we also evaluate our proposed DEP with hand pose estimation
algorithm (18) and the HMD system shown in Fig. 15. To set hand pose estimation to 30 ms latency, we reduced sample points and iteration
to 128 points and 16 iterations, respectively. Also, our evaluation software is pipelined
with image retrieving, depth-estimation, hand pose estimation, and visualization to
realize overall 40 ms latency system. Fig. 16 shows evaluation results of hand pose estimation with the DEP. First, input images
are sent to the DEP and it generates depth maps shown in 2nd and 5th columns. Even
though they show depth errors for background regions due to occlusion by foreground
hands, they show reasonably accurate depth quality on hand regions. The 3rd and 6th
columns of Fig. 16 show the final hand pose results. Because (18) performs hand model regression with sampled depth points, which are the most reliable
128 depth points in the hand regions, the results show accurate hand poses.
Fig. 15. Evaluation system
Fig. 16. The evaluation results of hand pose estimation for the 3D hand gesture interface
Table 6. Hand pose estimation error
Table 6 shows hand pose estimation errors in the range of 25~35 cm which are usual active
distance in the 3D HGI on the HMD systems. The maximum error is 13.64 mm and 12.00
mm on regions of fingers and palm, respectively, where the corresponding average errors
are 7.18 mm and 6.28 mm. Since the original algorithm
(18) which utilizes the ToF sensor instead of stereo matching shows average 5 mm of hand
tracking error, the accuracy of the hand tracking system with the proposed are adequate
to provide the natural UI for AR/MR systems.