OhHeungmin
LeeMinjung
KimHyungtae
PaikJoonki*
-
(Department of Image Engineering, Processing and Intelligent Systems Laboratory, Graduate
School of Advanced Imaging Science, Multimedia and Film, Chung-Ang University / Seoul
06974, Korea )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Metadata, Object segmentation, Surveillance system
1. Introduction
Recently, the surveillance system has increased the number of cameras for accurately
monitoring crime scenes, traffic accidents, and natural disasters. The surveillance
system operation modes are distinguished as real-time monitoring and specific scene
searching called video summarization. In real-time monitoring, increases of surveillance
cameras enable capturing an amount of a monitoring region at the cost of increasing
humans needed for surveillance. From another viewpoint, increases of cameras cause
a dramatic rise in both storage cost and computational power requirement that is proportional
to the number of cameras because both operation modes are performed on all cameras
and videos.
In order to solve these problems, metadata extraction methods have been researched.
In the surveillance system, the metadata is a descriptor set for representing an object
in real time or a scene in non-real time. The conventional metadata extraction methods
use color, shape, and texture features. Each of the features has different attributes,
which play an important role in object detection, tracking, and recognition [1,3].
The metadata generation methods are classified by camera type as static camera and
non-static camera-based algorithms. The key feature of a static camera is an image
with static background. Kim $\textit{et al.}$ proposed an object detection and metadata
extraction method based on a Gaussian Mixture Model (GMM) and blob analysis [4]. Kim's method separated the foreground and background through GMM-based background
modeling and then detected the object candidate region. Blob analysis determines a
removed region when the region is smaller than a threshold in the detected candidate
region. After a region growing process, metadata is extracted from the remaining blob.
Geronimo $\textit{et al.}$ proposed an Adaptive Gaussian Mixture Model (AGMM)-based
object detection, human action, and appearance extraction method [5]. The methods use GMM-based object detection. A GMM-based method can detect the object
by background and foreground modeling when a static camera has a proper moving object.
On the other hand, if the object is static or has rare movements, the object is classified
as background.
Fig. 1. Block diagram of the proposed metadata extraction algorithm.
Yuk $\textit{et al.}$ proposed motion block-based object detection for metadata extraction
[6]. In this method, similarity of a motion block is used for region growing by merging.
Paek $\textit{et al.}$ proposed Lucas-Kanade optical-flow-based moving object detection
and metadata extraction by using an Active Shape Model (ASM) and color space transformation
[7]. Paek’s method supposed that the illumination condition is stable for color space
transformation. However, the illumination condition is unstable and changeable in
the real world.
Jung $\textit{et al.}$ proposed a 3D modeling-based metadata extraction method [8]. Jung’s method performed camera calibration with object detection and tracking results.
After camera calibration, ellipsoid model-based metadata extraction is executed. Yun
$\textit{et al.}$ proposed representative patch selection from a detected object patch
[9]. The accuracy of Yun’s method depends on the patch size of the object. As a result,
a small object reduces the object detection performance.
As summarized above, most of the metadata extraction methods in a static camera used
background subtraction or frame subtraction. However, the subtraction methods between
adjacent frames are unsuitable for a non-static camera because the camera movements
of a non-static camera become a candidate for an object, which decreases the accuracy
of object detection in adjacent frames subtraction. To overcome the static camera
problem, non-static camera-based algorithms have been studied.
Chavda $\textit{et al.}$ proposed an object detection and metadata extraction method
by performing the background subtraction in the first frame of a Pan-Tilt-Zoom (PTZ)
camera [10]. Chavda’s method cannot detect multiple objects at the same time because the PTZ
camera has a limited field of view (FoV). Hou $\textit{et al.}$ proposed an object
detection and metadata extraction method based on a pre-trained Deformable Part Model
(DPM) and Histogram of Oriented Gradient (HoG) in a non-static camera [11]. Although Hou’s method can apply a non-static camera using pre-trained detectors,
the performance of object detection and metadata extraction decreases for a fast-moving
object and complex background.
Shandong $\textit{et al.}$ proposed object detection and metadata extraction using
the entire frame trajectory based on Lagrangian particle trajectories in a non-static
camera [12]. The object detection and metadata extraction are performed by decomposing the trajectory
of both the non-static camera and the object motion for the entire trajectory. Shandong's
method has a problem of object detection and metadata extraction accuracy decreasing
following a decrease of the difference between the camera and object trajectories.
As described above, metadata extraction methods use the features of both object and
camera motion in a non-static camera. The non-static camera-based algorithms are more
challenging than static camera-based algorithms because the motion in non-static camera
is generated by the movement of both the object and camera. Moreover, illumination
change and swaying leaves also become motion in a video acquired by a camera. Hence,
an alternative object detection method which guarantees overall object shape extraction
is required for metadata extraction.
A notable method is segmentation. Seemanthini $\textit{et al.}$ proposed clustering-based
segmentation and metadata extraction [13]. Seemanthini's method divides the part level segmentation from the input frame. After
clustering, similarity of a cluster is used for region merging, which becomes an object
and feeds into a metadata extraction module. Patel $\textit{et al.}$ generated a saliency
map by estimating distance with an average filter and Gaussian low-pass filter on
a video sequence [14]. Metadata is extracted by performing thresholding and a morphological operation from
the saliency map.
In the segmentation-based method, the object segmentation result directly affects
the accuracy of the metadata generation. As a result, in cases of segmentation of
a wide-angle camera image with dominant background, inaccurate segmentation and incorrect
metadata generation occur. This paper proposes segmentation-based metadata extraction
in static and non-static cameras with various FoVs. The proposed method is designed
as three stages in a framework including a deep learning-based detector, a segmentation
module for background elimination, and metadata extraction. This paper is organized
as follows. Section 2 presents deep learning-based object detection, improved DeepLab
v3+ for background elimination, and the metadata extraction method. Section 3 presents
comparative results among the proposed method and the existing methods in static and
non-static cameras. Section 4 concludes the paper and describes future work.
2. Metadata Extraction of Static and Non-static Camera
In this section, we propose a robust metadata extraction method for applying video
captured by both static and non-static cameras. The proposed method consists of three
steps: i) YOLO-v3 based object detection, ii) improved DeepLab v3+-based background
elimination with scale robustness, and iii) comprehensive metadata extraction. Fig. 1 shows the block diagrams of the proposed algorithm. This paper adopts a deep learning-based
object detector for real-time detection in static and non-static cameras [15] because the proposed method only focuses on accurate metadata extraction, not object
detection. Nevertheless, we additionally train the object detection model by gathering
and augmenting data for enhancing the detection accuracy for diverse camera fields
of view.
After fine-tuning, for accurate object segmentation of the candidate object region
marked as a bounding box, the proposed method improves the DeepLab v3+ model, which
eliminates the background. The proposed method enhances the model by removing a specific
feature map in the feature pyramid of the original DeepLab v3+ encoder network. For
the enhancement of a representative of small object detection, the proposed segmentation
method designs a model to generate a new feature map, which takes detail information
of a small object. Finally, metadata extraction from estimated object information
without background is performed. The proposed metadata consists of the representative
color, size, aspect ratio, and patch of an object.
2.1 Deep Learning-based Object Detection
The object detection module aims for accurate object shape extraction without loss
of an object region in static and non-static cameras and in various FoVs of a camera.
In the view of conventional object detection, the environment of surveillance cameras
is closely restricted. For example, when the surveillance system uses a static and
non-static camera, the FoVs of cameras are similar because the surveillance system
shares the object detection module with all cameras.
Fig. 2 shows that the difference between FoVs of cameras causes a challenge. In Figs. 2(a)
and (b), the pairs of the corresponding objects have different object representation
because each of the cameras has extremely different FoV, which causes dissimilar size
and shape in the same object. In contrast, in Figs. 2(c) and (d), objects marked with
red bounding boxes have similar size and shape. These phenomena are caused by the
lens FoV and viewpoint of the camera.
The consideration for the selection of a proper object detection algorithm includes
the mentioned attributes and environments of the surveillance camera. In addition,
object detection speed for real-time metadata extraction is considered, so we adopt
a proper object detector from the existing methods. For the objective model adaptation,
we executed an ablation study with deep learning-based object detectors, including
YOLO v3, Faster R-CNN, and Single Shot Multiple Detector (SSD) [16,17]. We finally adopted the YOLO v3 detector based on the results of the ablation study.
The details of the ablation study are represented in the experimental results.
YOLO-v3 is still unsuitable for the preprocessing of metadata extraction even though
the adaption is decided by an ablation study because the basic YOLO-v3 is trained
with a public dataset excluding diverse FoVs and viewpoints. However, the proposed
method deals with not only standard-lens and fisheye-lens cameras but also mobile
and static cameras. Hence, YOLO-v3 was additionally trained by collecting and augmenting
data for enhancing the performance. The dataset for fine-tuning was gathered by considering
object scales and angles. We collected images from various locations and additionally
train the YOLO v3 detector. Fig. 3 shows sample images of the dataset for fine tuning.
Fig. 4 presents experimental results of basic YOLO-v3 and additionally trained YOLO-v3.
Fig. 4(a) shows that the original YOLO v3 detector cannot detect smaller objects. On the other
hand, YOLO v3 with fine-turning can detect a smaller object region that the original
YOLO v3 fails to detect. For computational efficiency, the proposed algorithm detects
the object with frame intervals. We set up 24 intervals.
Fig. 2. Object detection results in images with different viewpoint and camera FoV (a)-(c) static camera, (d) Dash camera. (a) and (b) are recorded at the same place.
Fig. 3. Fine-tuning dataset.
Fig. 4. The comparison result of original YOLO v3 and fine-tuned YOLO v3 (a) result of YOLO v3 without fine-tuning, (b) result of YOLO v3 with fine-tuning.
2.2 Object Extraction without Background
The object detection results are commonly represented as bounding boxes. All detection
results include the background region that is near the object. However, the background
in a bounding box contaminates the color metadata because the background region is
bigger than or equal to the foreground. As a result, the background color contains
the equal or more information than the foreground color. In order to reduce the effect
of background, the proposed method preforms an additional segmentation process to
extract accurate metadata. In the segmentation process, we consider a robust multi-scale
segmentation method to eliminate background in detected objects with various shape
and scale.
The proposed segmentation method uses the Atrous Spatial Pyramid Pooling (ASPP) model.
The ASPP structure constructs a feature pyramid hierarchy with multiple receptive
fields for extracting detail information of multi-scale objects. A popular model of
ASPP structure is DeepLab v3+ [18]. DeepLab v3+ applies an atrous convolution to extend the receptive field with the
same computational cost as standard convolution. Atrous convolution is defined as:
where $x$, $r$, and $w$ are the input frame, convolution filter, and atrous rate,
respectively. However, the original DeepLab v3+ performs segmentation at the cost
of loss of detail information in small objects.
To solve the problem, we propose an improved DeepLab v3+ model. In the feature pyramid,
a wide receptive field and narrow receptive field are suitable for large-scale and
small-scale objects, respectively. We utilize the relationship between the object
scale and receptive field. The improved model was designed by fusing feature maps
of small receptive fields that are extracted from the feature pyramid of the original
DeepLab v3+.
Fig. 5 shows the proposed DeepLab v3+ model. Improved DeepLab v3+ deletes the largest feature
map in the encoder network of the original DeepLab v3+ model because the largest feature
map that condenses the semantic information destroys the important details of small
objects. In Fig. 5, the green dashed box represents the eliminated feature map. To compensate the information
of a large object and to enhance the detail information for a small object, we generate
a novel feature map by concatenating feature maps with a different expansion ratio
to preserve segment information of small objects. After the concatenation, we add
a 3 x 3 and 1 x 1 convolution layer.
In Fig. 5, the yellow dashed box is the novel feature map. The encoder network of improved
DeepLab v3+ is constructed by replacing the deleted feature map with a novel feature
map. The decoder network is the same as the original DeepLab v3+. The proposed method
transfers the information of both small object context and important details to the
output feature map for eliminating the background. Furthermore, the network of improved
DeepLab v3+ is lighter than the original because the proposed network reduces the
layer.
Fig. 6 shows comparative experimental results of background elimination by the original
DeepLab v3+ and the proposed method. The second and third columns of Fig. 6 represent background elimination results of the detected object in a static camera
and non-static camera, respectively. As the yellow box in Figs. 6(b) and (e) shows,
the original DeepLab v3 + presents a result of misclassified partial background as
an object. On the other hand, as shown in Figs. 6(c) and (f), the improved DeepLab
v3+ exactly eliminates the background region where the original DeepLab v3+ fails
at background elimination.
Fig. 5. The framework of the proposed DeepLab v3+ model.
Fig. 6. Experimental results of comparing the background elimination of the original DeepLab v3+ and the improved DeepLab v3+ in the detected object by static and non-static cameras (a), (d) detected object region, (b), (e) original DeepLab v3+-based background elimination result, (c), (f) improved DeepLab v3+ based background elimination result.
2.3 Metadata Extraction
In this section, we describe the proposed metadata extraction method from the detected
object without background. The proposed metadata consists of the representative color,
size, aspect ratio, and patch of the object. The color metadata is the most effective
object attribute information. Color metadata enhances the accuracy of video search
by distinguishing the colors of the upper body and lower body.
Size and aspect ratio are used for verification when the detector performs false detection
of a pedestrian. Generally, the height of a pedestrian is longer than the width. By
using the characteristic of pedestrians, the size and aspect ratio metadata increase
the efficiency of object search by excluding falsely detected objects. The patch of
objects is important metadata in a multi-camera environment.
Most of the existing object search algorithms use a patch of objects with background
information. Unfortunately, in a multi-camera system, each object is captured in dissimilar
background. As a result, the proposed object patch metadata clearly represents an
object feature. This is because the suggested patch image has no background effect.
The color metadata extraction methods are distinguished as color chip and model methods
by a color name designation method. The color chip and model methods extract the representative
color based on a pre-defined color name and color distribution learning, respectively.
Although the color chip method is effective in specific applications, it is limited
by illumination conditions. On the other hand, the model-based method learns color
distribution by considering the illumination condition of a real-world environment.
In consideration of this problem, the proposed method adopts a PLSA-based generative
model for extracting an object’s representative color metadata from images acquired
in the real world [19]. The PLSA-based generative model uses a set of images from Google to extract representative
color from a real-world image. To achieve this aim, this method selects 11 representative
color names and learns the color distribution by collecting 250 training images for
each color. the PLSA-based generative model is defined as:
where $M_{C},p\left(\cdot \right),f_{p},f_{c}$, and $f_{D}$are the color metadata,
conditional probability, object $L^{*}a^{*}b^{*}$color value, representative color,
and detected object region, respectively.
As shown in Fig. 7(c), the PLSA-based generative model extracts robust representative color without limitation
of real-world illumination conditions. The size and aspect ratio have been estimated
from the object edge region as in Fig. 7(d). Fig. 7(e) shows the patch-of-object metadata. The patch-of-object metadata is extracted through
pixel storage from the object without background. The presented metadata excluding
color is defined as:
where $W,\,H,\,f_{Rx}^{S},\,f_{Lx}^{S},\,f_{Ty}^{S},f_{By}^{S},\,$ and $f_{B}$ are
the width of object segmentation, height of object segmentation, right side of segmentation
region, left side of segmentation region, top of object segmentation region, bottom
of object segmentation region, and background region, respectively.
The extracted metadata is stored in a database and described in Table 1. The stored colors are the three most extracted colors using the PLSA-based generative
model from objects without a background. Table 2 describes 11 color names that the PLSA-based generative model learned.
Table 1. Object metadata configuration.
Camera id
|
Camera identification number
|
Video id
|
Video identification number
|
Frame number
|
Frame number
|
Object metadata
|
Object identification number
Object size
Object aspect ratio
Object representative color
Object patch
|
Table 2. Representative color list.
Black
|
1
|
Blue
|
2
|
Brown
|
3
|
Grey
|
4
|
Green
|
5
|
Orange
|
6
|
Pink
|
7
|
Purple
|
8
|
Red
|
9
|
White
|
10
|
Yellow
|
11
|
Fig. 7. The presented metadata extraction result (a) detected object region using YOLO v3, (b) object region without background, (c) representative color metadata result, (d) object edge result, (e) object patch metadata result.
3. Experimental Results
In this section, we consider diverse environments to verify the objective performance
of the proposed method. To consider diverse environments, public and handcrafted datasets
were used in an experiment. The public dataset is DukeMTMC-ReID [20]. The DukeMTMC-ReID dataset provides a detected object region and consists of 16522
training datasets and 17661 test datasets for object re-identification. To consider
the hard case, the handcrafted dataset was acquired using static cameras and dash
cameras. The static camera uses a 34' standard lens and a 180' fisheye lens. And then,
dush camera use a 170' fisheye lens.
The proposed method evaluates the color extraction performance by background elimination
using the public and handcrafted datasets. In the experimental results, we additionally
trained the YOLO v3 module by gathering and augmenting data from 20,000 pedestrians
to enhance the detection performance. The improved DeepLab v3+ for background elimination
was trained with the PASCAL VOC 2012 dataset.
3.1 Ablation Study to Adopt Deep Learning-based Object Detector
Table 3 presents the result of an ablation study to select a proper object detector in the
proposed method. As shown in Table 3, the SSD shows the fastest detection speed of 54.73 ms, which is much lower detection
performance than that of other detection methods (0.46 ms). When objects are small
in images captured with a fisheye lens, the detection performance of SSD is extremely
low at 0.32 and 0.37.
Faster R-CNN presents the highest detection performance of 0.84 in the ablation study.
However, Faster R-CNN has the slowest detection speed of 2864.28 ms. Although this
method shows accurate detection power, it has an unsuited result for the proposed
method, which considers real-time object detection. On the other hand, the YOLO v3
detector recodes 0.78 detection performance <note: ambiguous>, which is a little lower
than the Faster R-CNN result. Fortunately, the average detection speed is 454.69 ms,
which is 6 times faster than Faster R-CNN. This is a suitable result for the proposed
method for real-time object detection.
Furthermore, additionally trained YOLO v3 has 0.06 higher detection power than the
original YOLO v3 result, which is the same result as Faster R-CNN. The speed of additionally
trained YOLO v3 is 305.27 ms, which is faster than the original YOLO v3 and Faster
R-CNN. As a result, the proposed method detects the object candidate region by using
an additionally trained YOLO v3-based detector.
Table 3. Result of ablation study using existing deep learning detector.
|
SSD
|
Faster R-CNN
|
YOLOv3
|
YOLO v3 with fine-tune
|
34’standard lens1
|
0.50
(57.89 ms)
|
0.78
(2690.85 ms)
|
0.72
(458.96 ms)
|
0.77
(301.48 ms)
|
34’standard lens2
|
0.64
(54.15 ms)
|
0.94
(2593.20 ms)
|
0.89
(299.42 ms)
|
0.93
(302.07 ms)
|
180’fisheye lens1
|
0.50
(57.52 ms)
|
0.88
(2808.78 ms)
|
0.79
(432.54 ms)
|
0.89
(320.28 ms)
|
180’fisheye lens2
|
0.32
(52.46 ms)
|
0.75
(3194.20 ms)
|
0.74
(510.15 ms)
|
0.76
(301.16 ms)
|
170’fisheye lens
|
0.37
(51.68 ms)
|
0.84
(3034.37 ms)
|
0.75
(454.69 ms)
|
0.84
(301.40 ms)
|
Average
|
0.46
(54.73 ms)
|
0.84
(2864.28 ms)
|
0.78
(431.15 ms)
|
0.84
(305.27 ms)
|
3.2 Metadata Extraction without Background Information
The background elimination method for metadata extraction was compared using GMM,
original DeepLab v3+, and an enhanced model. Fig. 8 presents the GMM-based metadata extraction results in a static camera with a 34'
standard lens. As shown in Figs. 8(d) and (e), the GMM-based background elimination
method presents loss for an object region and metadata from small and slow-moving
objects because movement features of objects are insufficient.
Fig. 9 shows the original DeepLab v3+-based background elimination and metadata extraction
results in a static camera with a 34' standard lens. The original DeepLab v3+ can
eliminate the background from the slow-moving object because the deep learning-based
segmentation method does not utilize the movement feature. However, as shown in Fig. 9(d), the original DeepLab v3+-based background elimination presents loss and misclassification
of an object region. Metadata is then lost, as shown in Fig. 9(e), because the feature maps of original DeepLab v3+ lost semantic information from
small objects.
On the other hand, Fig. 10 shows the improved DeepLab v3+-based background elimination and metadata extraction
result in a static camera with a 34' standard lens. As shown in Fig. 10(d), an enhanced model presents accurate background elimination performance for the yellow
box region, where the GMM and original DeepLab v3+ failed. As shown in Fig. 10(e), accurate background elimination enhances the metadata because the enhanced model
does not use movement features and reinforces semantic information of small objects.
Fig. 11 shows results of both background elimination and metadata extraction in a non-static
camera with a 170' fisheye lens. In a non-static camera environment, the GMM-based
background elimination method cannot detect an object region. Therefore, we compared
the original DeepLab v3+ and enhanced model to evaluate the performance of both background
elimination and metadata extraction in non-static cameras.
Figs. 11(d) and (e) show the original DeepLab v3+-based background elimination and
color metadata extraction result. The original DeepLab v3+ presents low background
elimination performance because it has a problem in that background partial is classified
as an object, as shown in the yellow box in Fig. 11(d). As a result, it extracts inaccurate metadata, as shown in Fig. 11(e). On the other hand, the improved DeepLab v3+-based background elimination method
eliminates the background accurately for the region where the original DeepLab v3+
failed to eliminate the background, as shown in Fig. 11(f). As a result, it extracts accurate metadata, as shown in Fig. 11(g).
Fig. 12 shows the performance of DeepLab v3+ and improved DeepLab v3+-based background elimination
and color metadata extraction on a static camera using a 180' zoom lens. As shown
in Fig. 12(a), the static camera using a 180' zoom lens generated object distortion. For this reason,
the original DeepLab v3+-based background elimination method fails to eliminate the
partial background, as shown in the yellow box in Fig. 12(d), and it fails to extract the partial metadata, as shown in Fig. 12(e). On the other hand, the improved DeepLab v3+ method eliminates the background accurately
for the region where the original DeepLab v3+ failed to eliminate the background,
as shown in the yellow box in Fig. 12(f). As a result, it extracts accurate metadata, as shown in Fig. 12(g).
The representative color metadata extraction accuracy of the proposed method is defined
as:
where $f_{gt},\,f_{c,\,}$and $f_{t}$ respectively represent color metadata ground-truth,
extracted color metadata, and image number. Table 4 shows the results to evaluate the effect of the background in the metadata extraction
using the DukeMTMC-ReID dataset. In Table 4, the comparison method uses objects with background, original DeepLab v3+-based background
elimination, and improved DeepLab v3+-based background elimination. The metadata of
objects with background shows two levels of representative color extraction performance
because for objects with background, it is difficult to classify the upper body and
lower body.
As shown in Table 4, the object with background presents extremely low performance of 67.54% and 52.89%
in the training dataset and 66.47% and 50.62% in the test dataset. Next, as shown
in Table 4, the proposed method’s upper-body clothing color extraction accuracy is 3.4% more
than that of the original DeepLab v3+. Its lower-body clothing color extraction accuracy
is 1.5% more than that of the original DeepLab v3+ for the training dataset. In addition,
it is 3.4% better for upper-body clothing and 3.0% better for lower-body clothing
for the test dataset.
Table 13 shows the result of metadata extraction for the DukeMTMC-ReID dataset. As
shown in Fig. 13(b), metadata of an object with the background includes more background information than
object information. Eventually, the object metadata is lower than the background metadata.
The original DeepLab v3+-based method of Fig. 13(c) contains unnecessary information about the partial background. On the other hand,
the improved DeepLab v3+-based method in Fig. 13(d) eliminates the background accurately and then extracts enhancement metadata and shape
information.
Table 4. Performance comparison between the original DeepLab v3+ method and our method using DukeMTMC-ReID.
|
DukeMTMC-ReID dataset
|
Upper-body clothing
|
Lower-body clothing
|
Object with background
|
Training dataset
|
67.54%
|
52.89%
|
Test dataset
|
66.47%
|
50.62%
|
Original DeepLab v3+ method
|
Training dataset
|
83.8%
|
83.5%
|
Test dataset
|
82.2%
|
81.5%
|
Our method
|
Training dataset
|
87.2%
|
85.0%
|
Test dataset
|
85.6%
|
84.5%
|
Fig. 8. GMM-based background elimination and metadata extraction result on a static camera using 34'zoom lens (a) input frame, (b) GMM frame, (c) ground-truth object, (d) object detection result using GMM, and (e) color metadata extracted result.
Fig. 9. Original DeepLab v3+-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) original DeepLab v3+-based background elimination result, (e) color metadata extracted result.
Fig. 10. Improved DeepLab v3-based background elimination and metadata extract result on static camera using 34' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object result, (d) improved DeepLab v3+-based background elimination result, (e) color metadata extracted result.
Fig. 11. Comparative experiment results of DeepLab v3+ and improved DeepLab v3+ on a non-static camera using a 170’ zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object region result, (d) original DeepLab v3+-based background elimination result, (e) original DeepLab v3+-based color metadata extraction result, (f) improved DeepLab v3+-based background elimination result, (g) improved DeepLab v3+-based color metadata extraction result.
Fig. 12. Comparative experiment results of DeepLab v3+ and improved DeepLab v3+ on a static camera using a 180' zoom lens (a) input frame, (b) YOLO v3 result, (c) detected object region result, (d) original DeepLab v3+-based background elimination result, (e) original DeepLab v3+-based color metadata extraction result, (f) improved DeepLab v3+-based background elimination result, (g) color metadata extraction result.
Fig. 13. Metadata extraction result of DukeMTMC-Reid dataset (a) input object, (b) metadata of object with the background, (c) original DeepLab v3+-based metadata result, (d) improved DeepLab v3+-based metadata result.
5. Conclusion
This paper proposed a metadata extraction method to solve the problem of human effort
and storage size of real-time and non-real-time monitoring in an intelligent surveillance
system. The proposed method adopts the YOLO v3 detector to detect the object of interest.
In this paper, we proposed an improved DeepLab v3+ that is robust to multiple scales
to solve the original DeepLab v3+ problems of background elimination performance for
small objects.
The improved DeepLab v3+ method was used to extract an accurate object without background.
Finally, the metadata consists of an object’s representative color, size, aspect ratio,
and patch extracted from the object without background. The performance of the proposed
method was validated through experimental results using public datasets and handcrafted
datasets. Consequently, the proposed metadata extraction method can be applied to
a wide range of surveillance systems, such as object search and large public space
monitoring in multi-camera and mobile camera-based surveillance systems.
ACKNOWLEDGMENTS
This work was partly supported by a grant from the Institute for Information & communications
Technology Promotion (IITP) funded by the Korea government (MSIT) (2017-0-00250, Intelligent
Defense Boundary Surveillance Technology Using Collaborative Reinforced Learning of
Embedded Edge Camera and Image Analysis) and by the ICT R&D program of MSIP/IITP [2014-0-00077,
development of global multi-target tracking and event prediction techniques based
on real-time large-scale video analysis].
REFERENCES
Garcia-Lamont F., Cervantes J., Lopez A., Rodriguez L., 2018, Segmentation of images
by color features: A survey, Neurocomputing, Vol. 292, pp. 1-27
Yang M., Kpalma K., Ronsin J., 2008, A survey of shape feature extraction techniques,
Pattern Recognition Techniques
Humeau-Heurtier A., 2019, Texture feature extraction methods: A survey, IEEE Access,
Vol. 7, pp. 8975-9000
Kim T., Kim D., Kim P., Kim P., Dec. 2016, The Design of Object-of-Interest Extraction
System Utilizing Metadata Filtering from Moving Object, Journal of KIISE, Vol. 43,
No. 12, pp. 1351-1355
Geronimo D., Kjellstrom H., Aug. 2014, Unsupervised surveillance video retrieval based
on human action and appearance, in Proc. IEEE Int. Conf. Pattern Recognit, pp. 4630-4635
Yuk J.S-C., Wong K-Y.K., Chung R.H-Y., Chow K.P., Chin F. Y-L., Tsang K. S-H., 2007,
Object based surveillance video retrieval system with realtime indexing methodology,
The Proceedings of the International Conference on Image Analysis and Recognition,
pp. 626-637
Paek I., Park C., Ki M., park K., Paik J., November 2007, Multiple-view object tracking
using metadata, Proc. Int. Conf. Wavelet Analysis and Pattern Recognition, Vol. 1,
No. 1, pp. 12-17
Jung J., Yoon I., Lee S., Paik J., June 2016, Normalized Metadata Generation for Human
Retrieval Using Multiple Video Surveillance Cameras, Sensors, Vol. 16, No. 7, pp.
1-9
Yun S., Yun K., Kim S.W., Yoo Y., Jeong J., 26-29 Aug. 2014, Visual surveillance briefing
system: Event-based video retrieval and summarization, In Proceedings of the 2014
11th IEEE International Conference on Advanced Video and Signal Based Surveillance
(AVSS), pp. 204-209
Chavda H. K., Dhamecha M., 2017, Moving object tracking using PTZ camera in video
surveillance system, 2017 International Conference on Energy, Communication, Data
Analytics and Soft Computing
Hou L., Wan W., Lee K-H., Hwang J-N., Okopal G., Pitton J., 2017, Robust Human Tracking
Based on DPM Constrained Multiple-Kernel from a Moving Camera, Journal of Signal Processing
Systems, Vol. 86, No. 1, pp. 27-39
Wu S., Oreifej O., Shah M., Nov. 2011, Action recognition in videos acquired by a
moving camera using motion decomposition of Lagrangian particle trajectories, IEEE
International Conference on Computer Vision(ICCV), pp. 1419-1426
Seemanthini K., Manjunath S. S., Jan. 2018, Human detection and tracking using HOG
for action recognition, Procedia Comput Science, Vol. 132, pp. 1317-1326
Patel C.I., Garg S., Zaveri T., Banerjee A., Aug. 2018, Human action recognition using
fusion of features for unconstrained video sequences, Computers and Electrical Engineering,
Vol. 70, pp. 284-301
Redmon J., Farhadi A., April 2018, YOLOv3: An Incremental Improvement, IEEE Conference
on Computer Vision and Pattern Recognition (CVPR)
Ren S., He K., Girshick R., Sun J., 2015, Faster R-CNN: Toward Real-Time Object Detection
with Region Proposal Networks, Advances in Neural Information Processing Systems 28
(NIPS)
Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C. Y., Berg A. C., September
2016, SSD: Single Shot MultiBox Detector, European Conference on Computer Vision(ECCV),
pp. 21-37
Chen L. C., Zhu Y., Papandreou G., Schroff F., Adam H., 2018, Encoder-decoder with
atrous separable convolution for semantic image segmentation, European Conference
on Computer Vision(ECCV)
Joost van de Weijer J. V., Cordelia Schmid , Larlu D., 2009, Learning color names
from real-world applications, in IEEE Transactions on Image Processing, Vol. 18, pp.
1512-1523
Zheng Z., Zheng L., Yang Y., 2017, Unlabeled Samples Generated by GAN Improve the
Person Re-Identification Baseline in Vitro, arXiv preprint arXiv:1701.07717
Author
Heungmin Oh was born in Busan, Korea, in 1994. He received a B.S. in computer engineering
from Silla University, Korea, in 2020. Currently, he is pursuing an M.S. in digital
imaging engineering at Chung-Ang University. His research interests include object
segmentation and artificial intelligence.
Minjung Lee was born in Busan, Korea, in 1994. She received a B.S. degree in electronics
engineering from Silla University in 2017 and an M.S. degree in image engineering
in 2019 from Chung-Ang University. She is currently working toward a Ph.D. in image
engineering at Chung-Ang University, Seoul. Her research interest includes geometric
distortion correction, object parsing, and feature extraction.
Hyungtae Kim was born in Seoul, Korea, in 1986. He received a B.S. degree from
the Department of Electrical Engineering of Suwon University in 2012 and an M.S. degree
in image engineering in 2015 from Chung-Ang University. He is currently pursuing a
Ph.D. in image engineering at Chung-Ang University. His research interests include
multi-camera calibration based on large-scale video analysis.
Joonki Paik was born in Seoul, Korea, in 1960. He received a BSc in Control and
Instrumentation Engineering from Seoul National University in 1984. He received an
MSc and a PhD in Electrical Engineering and Computer Science from Northwestern University
in 1987 and 1990, respectively. From 1990 to 1993, he worked at Samsung Electronics,
where he designed image stabilization chip sets for consumer camcorders. Since 1993,
he has been on the faculty at Chung-Ang University, Seoul, Korea, where he is currently
a Professor in the Graduate School of Advanced Imaging Science, Multimedia, and Film.
From 1999 to 2002, he was a Visiting Professor in the Department of Electrical and
Computer Engineering at the University of Tennessee, Knoxville. Dr. Paik was a recipient
of the Chester Sall Award from the IEEE Consumer Electronics Society, the Academic
Award from the Institute of Electronic Engineers of Korea, and the Best Research Professor
Award from Chung-Ang University. He has served the IEEE Consumer Electronics Society
as a member of the editorial board. Since 2005, he has been the head of the National
Research Laboratory in the field of image processing and intelligent systems. In 2008,
he worked as a full-time technical consultant for the System LSI Division at Samsung
Electronics, where he developed various computational photographic techniques, including
an extended depth-of-field (EDoF) system. From 2005 to 2007, he served as Dean of
the Graduate School of Advanced Imaging Science, Multimedia, and Film. From 2005 to
2007, he was Director of the Seoul Future Contents Convergence (SFCC) Cluster established
by the Seoul Research and Business Development (R&BD) Program. Dr. Paik is currently
serving as a member of the Presidential Advisory Board for Scientific/Technical Policy
of the Korean government and is a technical consultant for the Korean Supreme Prosecutor’s
Office for computational forensics.