LeeSeunghyun1
LeeSungwook1
SongByung Cheol1*
-
(Department of Electronic Engineering, Inha University / Incheon, Korea lsh910703@gmail.com,
leesw0623@naver.com, bcsong@inha.ac.kr
)
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
Object detection, Deep learning
1. Introduction
The convolutional neural network (CNN) has proved outstanding in the field of computer
vision and has become an essential technology in various fields, such as image classification
[1,2], object detection [3-5], and image segmentation [6,7]. In addition, the CNN has dramatically improved the performance of object detection,
which is highly useful, to a level that can be applied to actual applications. Accordingly,
recent object detection algorithm research has been extended to not only improving
performance but reducing costs through efficient structures [8-11] or new learning methods [12,13] for real-time operation on edge devices.
Most object detection studies focus on improving performance with benchmark datasets
such as PASCAL VOC [14] and MS COCO [15]. In this case, a common approach is to use the same configuration on all datasets
to make the comparison as fair as possible. However, the problem with that approach
is not considering the characteristics of the objects in the datasets. In addition,
analysis for real-world environments is rarely conducted, and may seem impractical
to researchers and developers who apply object detectors to their applications.
This paper analyzes performance change according to the configuration and learning
strategy in a basic object detection algorithm. The analysis confirms that a configuration
considering data distribution is a variable that significantly influences performance
improvement, rather than the complexity of the algorithm, e.g., knowledge distillation
[12]. For example, it was shown that performance can be improved by up to 2.6\% AP with
the KITTI dataset [16] only by modifying the configuration. We expect that the results of this study will
provide meaningful insights for researchers who introduce object detection into applications.
2. Related Work
Object detection algorithms are classified into the two-stage object detector (e.g.,
Faster-RCNN [17], Mask-RCNN [18]) and the one-stage object detector (e.g., SSD [3] and Bi-FPN [8]). In this paper, we analyze the one-stage object detector, which is used more frequently,
in practice, due to its lightness.
The first one-stage object detector, the SSD, senses several feature maps in a backbone
CNN architecture. Then, classification and localization are performed by feeding them
to the detection heads. Here, for efficient detection, the SSD utilizes several default
boxes. After that, the SSD flow-up techniques improve performance by fusing and improving
the feature map while maintaining this framework. Representatively, the FPN [4] fuses feature maps of different sizes by adding top-down and bottom-up pathways (so-called
necks) before feeding them to the detection heads. This framework improves semantic
information and location information in large feature maps and small feature maps,
respectively. The Bi-FPN [8] goes further and adds a residual connection to the neck to enable aggregation of
higher-level feature information. Recently, a method of directly injecting semantic
information into an image was proposed [19]. The necks proposed so far can successfully improve the performance of SSDs. However,
because they require complex feature aggregation, they incur burdensome memory costs.
Therefore, when applying the object detection algorithm to a real-time application,
it is necessary to select an appropriate neck considering the cost.
Another way to improve the performance of the object detector is to utilize external
information. For example, there are ways to improve the backbone network to provide
a better representation [20,21] via self-supervision. One way is to train large datasets, such as ImageNet, through
self-supervision tasks to obtain better visual representations. Another example is
knowledge distillation [12]. Knowledge distillation is a technique to improve performance by transferring information
from a more extensive network to a smaller target network. Since they do not change
the architecture, they have the advantage of no additional cost for inference. However,
knowledge distillation utilizes two networks, which incurs a high cost. In addition,
if there is a great difference between the source domain and the target domain of
the external information, the expected improvement may not be achieved.
3. Method
In this section, we introduce various methods to improve the performance of the object
detectors analyzed experimentally in this paper.
The neck type is one of the essential factors determining the performance of one-stage
object detectors. Therefore, it is necessary to decide which neck to apply, carefully
staying within the available cost budget. We used the most basic structures of the
SSD and the FPN for experiments with the brief structures shown in Fig. 1. Since performance differences have already been discussed in previous papers, we
analyze the differences in feature maps with two necks through knowledge distillation.
Knowledge distillation is a technique to improve performance by transferring information
from one network to another. We selected and analyzed feature transfer techniques,
i.e., AT [22], FT [23], SP [24], and SKD [25] to apply knowledge distillation to the object detector. The feature sensing points
and structure diagram for knowledge distillation are shown in Fig. 2. To distill knowledge, we sense feature maps F for input to the detection heads in
the SSD and the FPN and input them to each detection module, D. Then, the distilled
knowledge from the teacher and student network is trained through objective function
O. This process is expressed as follows:
For example, AT normalizes the teacher's and the student's feature maps, and minimizes
their L2 distance. Therefore each component can be expressed as $D(x)=x /|x|, O(x,
y)=\|x-y\| 2^{2}$. For other techniques, please refer to previous papers.
For the datasets, we adopted PASCAL VOC [14] and MS COCO [15], which are widely used in the object detection field. KITTI [16], which has a similar environment to real-world data, was also used. The KITTI dataset
consists of road images at 1224x370 that were captured by a single camera. Therefore,
we can say that KITTI is closer to a real environment than the existing benchmark
datasets. In addition, since there is a big difference between KITTI and the PASCAL
VOC and MS COCO data distributions, we checked the performance differences based on
training strategies through KITTI.
Fig. 1. One-stage object detector frameworks.
Fig. 2. Examples of knowledge distillation frameworks.
4. Experiments
In this section, we present the results of each comparison experiment. First, we checked
the effect of knowledge distillation based on the architecture and neck type for PASCAL
VOC. The experimental results in Table 1 show that most combinations failed to improve performance when the architecture and
neck type were different. Especially when the architecture was different, there was
little or no performance improvement, regardless of the neck. Therefore, if there
is no teacher network with the same structure, knowledge distillation is a bad choice
for performance improvement.
Next, to further observe the effect of knowledge distillation, we show experimental
results from MS COCO. In this experiment, we adopted the FPN, which was the most sensitive
in previous experiments. Our results are in Table 2. First, by replacing the teacher network with ResNet-101, we checked the tendency
when the size difference between teacher and student increased. Experimental results
showed significant performance improvement, even when there was a big difference in
the size of the architecture, which is contrary to results from the classification
task [27]. This phenomenon is presumed to occur because over-constraints are prevented by mitigating
the feature distribution difference in backbone networks with large-capacity differences
owing to the neck. Also, in the ResNet-50 and MobileNet-v2 pairing, there was improvement
in all techniques, unlike with PASCAL VOC. In particular, it is noteworthy that SKD
showed the highest degree of improvement. SKD is a knowledge distillation technique
designed to prevent over- and mis-constraints. This characteristic provides relatively
high performance by suppressing mis-constraints caused by architecture differences.
Nevertheless, it is difficult to find a clear trend in the results of each experiment,
unlike the classification task. These results suggest that existing knowledge distillation
techniques focus on classification, and even though they are generalized in classification
datasets, they might not be in detection datasets.
Next, the transfer learning effect based on the dataset was verified. ResNet-50-FPN
was used for the experiments. We used ImageNet and MS COCO as the source datasets,
with KITTI as the target dataset. As a method of transferring external information,
the most generally effective AT (-AT) among knowledge distillation techniques and
fine-tuning was used. In the pre-trained networks, we used a public ImageNet pre-trained
model that had only semi-weakly supervised learning (ImageNet- SWSL) [28] and an MS COCO pre-trained model.
The experimental results are shown in Table 3. Compared to the general benchmark dataset, the objects of the KITTI dataset exist
sparsely. For this reason, we can see that the ImageNet pre-trained model had difficulty
learning. Therefore, to obtain high performance, a pre-trained network is required
as a detection dataset, such as MS COCO. However, because the KITTI dataset has wide
images, each object has a very narrow form when resized to a rectangle. Therefore,
although performance was improved when it was used as a pre-trained network, it is
difficult to expect additional performance improvement through knowledge distillation.
Therefore, when transferring knowledge, this difference in data distribution must
be considered.
Table 1 Object detection performance based on architectures, necks, and knowledge distillation algorithms on PASCAL VOC. The numerical value indicates mean average precision (mAP); bold indicates the best result.
Teacher
|
Student
|
Neck
|
Baseline
|
AT
|
FT
|
SP
|
SKD
|
ResNet-50
|
ResNet-18
|
FPN→FPN
|
72.88
|
73.55
|
73.28
|
73.02
|
73.14
|
SSD→SSD
|
69.00
|
69.24
|
68.98
|
68.73
|
68.95
|
FPN→SSD
|
69.00
|
69.04
|
69.12
|
68.74
|
68.98
|
ResNet-50
|
MobileNet-v2
|
FPN→FPN
|
71.24
|
71.87
|
71.34
|
71.26
|
71.26
|
SSD→SSD
|
64.99
|
63.80
|
64.23
|
64.81
|
64.82
|
FPN→SSD
|
64.99
|
64.51
|
64.49
|
64.81
|
64.82
|
Table 2 Object detection performance based on architectures, necks, and knowledge distillation algorithms on MS COCO. AP@0.5 indicates IOU is higher than 0.5; AR columns contain mean results; subscripts S, M, and L signify small, medium, and large objects, respectively; bold indicates the best result.
Teacher
|
Student
|
Neck
|
Distill.
|
AP@0.5
|
AP
|
APS
|
APM
|
APL
|
AR
|
ARS
|
ARM
|
ARL
|
ResNet-101
|
ResNet-18
|
FPN
|
Baseline
|
39.5
|
21.9
|
5.9
|
23.5
|
35.7
|
34.2
|
10.9
|
36.5
|
53.7
|
AT
|
40.9
|
23.0
|
6.1
|
24.7
|
37.8
|
35.2
|
11.1
|
37.5
|
55.4
|
FT
|
40.2
|
22.4
|
6.2
|
24.4
|
37.2
|
34.9
|
11.5
|
37.2
|
55.2
|
SP
|
40.5
|
22.7
|
6.1
|
24.5
|
37.1
|
35.1
|
11.4
|
37.5
|
55.2
|
SKD
|
40.6
|
22.8
|
6.2
|
24.6
|
37.6
|
35.0
|
11.5
|
37.6
|
55.3
|
ResNet-50
|
MobileNet-v2
|
FPN
|
Baseline
|
37.4
|
20.1
|
4.6
|
21.6
|
32.7
|
32.7
|
9.7
|
34.5
|
51.3
|
AT
|
38.1
|
20.4
|
5.3
|
22.1
|
34.3
|
33.2
|
10.0
|
35.2
|
52.9
|
FT
|
37.8
|
20.3
|
5.1
|
21.8
|
33.8
|
33.1
|
10.0
|
35.1
|
52.5
|
SP
|
38.0
|
20.5
|
5.4
|
22.1
|
33.9
|
33.3
|
10.1
|
35.3
|
52.2
|
SKD
|
38.2
|
20.5
|
5.4
|
22.2
|
34.0
|
33.3
|
10.1
|
35.5
|
53.0
|
Table 3 Performance based on type of external information and default box configuration.
External information
|
Default box configuration
|
AP@0.5
|
ImageNet
|
MS COCO
|
72.1
|
ImageNet-SWSL
|
MS COCO
|
72.8
|
MS COCO
|
MS COCO
|
76.6
|
MS COCO-AT
|
MS COCO
|
77.4
|
MS COCO
|
KITTI
|
80.0
|
Fig. 2. Pseudo code for default box configurations for the MS COCO and KITTI datasets. Note that KITTI uses only tall default boxes.
We also found that when training datasets with unusual data distributions, such as
KITTI, adjusting the configuration is more effective than applying special training
strategies like knowledge distillation. In order to fit object shapes in KITTI, we
modified default boxes as shown in Fig. 3. Those experimental results are the last
row of Table 3. Even though only the default boxes changed, it gave higher performance than knowledge
distillation. Examples of the detection results are in Fig. 4, showing that a detector
with modified default boxes detects narrow objects quite well.
The experimental results in this paper show that widely used techniques to improve
classification performance are not generalized well to detection tasks. On the other
hand, the more commonly used data considering the configuration setup showed effective
performance improvement. Therefore, in order to improve the performance of the detection
algorithm, a parameter search based on analysis of the target dataset should be conducted,
rather than using the general configuration as it is.
5. Conclusion
In this paper, we analyzed how to improve performance practically while adjusting
various learning strategies and configurations. Experimental results showed that sometimes
a heuristic and data-oriented approach is more effective than a complex learning strategy.
This trend must be considered in the practical application stage. We expect this paper
will provide meaningful insights to researchers or developers who want to apply object
detectors to actual applications.
ACKNOWLEDGMENTS
This work was supported by Institute of Information & communications Technology Planning
& Evaluation(IITP) grant funded by the Korea government(MSIT) (2020-0-01389, Artificial
Intelligence Convergence Research Center(Inha University)) and the MSIT(Ministry of
Science and ICT), Korea, under the ITRC(Information Technology Research Center) support
program(IITP-2021-0-02052) supervised by the IITP(Institute for Information & Communications
Technology Planning & Evaluation).
REFERENCES
He Kaiming, et al. , 2016, Deep residual learning for image recognition., Proceedings
of the IEEE conference on computer vision and pattern recognition
Sandler Mark, et al. , 2018, Mobilenetv2: Inverted residuals and linear bottlenecks.,
Proceedings of the IEEE conference on computer vision and pattern recognition
Liu Wei, et al. , 2016, Ssd: Single shot multibox detector., European conference on
computer vision. Springer, Cham
Lin Tsung-Yi, et al. , 2017, Feature pyramid networks for object detection., Proceedings
of the IEEE conference on computer vision and pattern recognition
Choi Jun Ho, et al. , 2020, Multi-scale Non-local Feature Enhancement Network for
Robust Small-object Detection., IEIE Transactions on Smart Processing & Computing,
Vol. 9, No. 4, pp. 274-283
Ronneberger Olaf , Philipp Fischer , Thomas Brox , 2015, U-net: Convolutional networks
for biomedical image segmentation., International Conference on Medical image computing
and computer-assisted intervention. Springer, Cham
Long , Jonathan , Evan Shelhamer , Trevor Darrell. , 2015, Fully convolutional networks
for semantic segmentation., Proceedings of the IEEE conference on computer vision
and pattern recognition
Tan Mingxing, Ruoming Pang, Quoc V. Le., 2020, Efficientdet: Scalable and efficient
object detection., Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition
Uzkent, Burak , Christopher Yeh , Stefano Ermon , 2020, Efficient object detection
in large images using deep reinforcement learning., Proceedings of the IEEE/CVF Winter
Conference on Applications of Computer Vision
Quang, Tran Ngoc , Seunghyun Lee , Byung Cheol Song. , 2021, Object Detection Using
Improved Bi-Directional Feature Pyramid Network., Electronics, Vol. 10, No. 6, pp.
746
Kim Donggeun, et al. , 2020, Real-time Robust Object Detection Using an Adjacent Feature
Fusion-based Single Shot Multibox Detector., IEIE Transactions on Smart Processing
& Computing, Vol. 9, No. 1, pp. 22-27
Chen Guobin, et al. , 2017, Learning efficient object detection models with knowledge
distillation., Advances in neural information processing systems
Shih, Kuan-Hung , Ching-Te Chiu , Yen-Yu Pu. , 2019, Real-time object detection via
pruning and a concatenated multi-feature assisted region proposal network., ICASSP
2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE
Everingham Mark, et al. , 2015 , The pascal visual object classes challenge: A retrospective.,
International journal of computer vision, Vol. 111, No. 1, pp. 98-136
Lin Tsung-Yi, et al. , 2014, Microsoft coco: Common objects in context., European
conference on computer vision. Springer, Cham
Geiger, Andreas , Philip Lenz , Raquel Urtasun. , 2012, Are we ready for autonomous
driving? the kitti vision benchmark suite., 2012 IEEE conference on computer vision
and pattern recognition. IEEE
Ren Shaoqing, et al. , 2015, Faster r-cnn: Towards real-time object detection with
region proposal networks., Advances in neural information processing systems, Vol.
28, pp. 91-99
He Kaiming, et al. , 2017, Mask r-cnn., Proceedings of the IEEE international conference
on computer vision
Pang Yanwei, et al. , 2019, Efficient featurized image pyramid network for single
shot detector., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition
Grill Jean-Bastien, et al. , 2020, Bootstrap Your Own Latent: A new approach to self-supervised
learning., Neural Information Processing Systems
Li Chunyuan, et al. , 2021, Efficient Self-supervised Vision Transformers for Representation
Learning., arXiv preprint arXiv:2106.09785
Zagoruyko, Sergey , Nikos Komodakis , 2016, Paying more attention to attention: Improving
the performance of convolutional neural networks via attention transfer., arXiv preprint
arXiv:1612.03928
Kim, Jangho , SeongUk Park , Nojun Kwak. , 2018, Paraphrasing complex network: network
compression via factor transfer., Proceedings of the 32nd International Conference
on Neural Information Processing Systems
Tung, Frederick , Greg Mori. , 2019, Similarity-preserving knowledge distillation.,
Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen Defang, et al. , 2021., Cross-layer distillation with semantic calibration.,
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 8
Paszke Adam, et al. , 2019, Pytorch: An imperative style, high-performance deep learning
library., Advances in neural information processing systems, Vol. 32, pp. 8026-8037
Mirzadeh, Seyed Iman , et al. , 2020, Improved knowledge distillation via teacher
assistant., Proceedings of the AAAI Conference on Artificial Intelligence., Vol. 34,
No. 04
Yalniz, I. Zeki , et al. , 2019, Billion-scale semi-supervised learning for image
classification., arXiv preprint arXiv:1905.00546
Author
Seunghyun Lee received a B.S. in electronic engineering from Inha University, Incheon,
South Korea, in 2017, where he is currently working toward a combined degree in electronic
engineering. His research interests include computer vision and machine learning.
Sungwook Lee received a B.S. in electronic engineering from Inha University, Incheon,
South Korea, in 2020, where is he is currently working toward an M.S. in electrical
and computer engineering. His research interests include computer vision and deep
learning.
Byung Cheol Song received a B.S., an M.S., and a Ph.D. in electrical engineering
from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea,
in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer
with the Digital Media R&D Center, Samsung Electronics Company Ltd., Suwon, South
Korea. In 2008, he joined the Department of Electronic Engineering, Inha University,
Incheon, South Korea, and is currently a professor. His research interests include
the general areas of image processing and computer vision.