Mobile QR Code QR CODE

  1. (Department of Electronic Engineering, Inha University / Incheon, Korea lsh910703@gmail.com, leesw0623@naver.com, bcsong@inha.ac.kr )



Object detection, Deep learning

1. Introduction

The convolutional neural network (CNN) has proved outstanding in the field of computer vision and has become an essential technology in various fields, such as image classification [1,2], object detection [3-5], and image segmentation [6,7]. In addition, the CNN has dramatically improved the performance of object detection, which is highly useful, to a level that can be applied to actual applications. Accordingly, recent object detection algorithm research has been extended to not only improving performance but reducing costs through efficient structures [8-11] or new learning methods [12,13] for real-time operation on edge devices.

Most object detection studies focus on improving performance with benchmark datasets such as PASCAL VOC [14] and MS COCO [15]. In this case, a common approach is to use the same configuration on all datasets to make the comparison as fair as possible. However, the problem with that approach is not considering the characteristics of the objects in the datasets. In addition, analysis for real-world environments is rarely conducted, and may seem impractical to researchers and developers who apply object detectors to their applications.

This paper analyzes performance change according to the configuration and learning strategy in a basic object detection algorithm. The analysis confirms that a configuration considering data distribution is a variable that significantly influences performance improvement, rather than the complexity of the algorithm, e.g., knowledge distillation [12]. For example, it was shown that performance can be improved by up to 2.6\% AP with the KITTI dataset [16] only by modifying the configuration. We expect that the results of this study will provide meaningful insights for researchers who introduce object detection into applications.

2. Related Work

Object detection algorithms are classified into the two-stage object detector (e.g., Faster-RCNN [17], Mask-RCNN [18]) and the one-stage object detector (e.g., SSD [3] and Bi-FPN [8]). In this paper, we analyze the one-stage object detector, which is used more frequently, in practice, due to its lightness.

The first one-stage object detector, the SSD, senses several feature maps in a backbone CNN architecture. Then, classification and localization are performed by feeding them to the detection heads. Here, for efficient detection, the SSD utilizes several default boxes. After that, the SSD flow-up techniques improve performance by fusing and improving the feature map while maintaining this framework. Representatively, the FPN [4] fuses feature maps of different sizes by adding top-down and bottom-up pathways (so-called necks) before feeding them to the detection heads. This framework improves semantic information and location information in large feature maps and small feature maps, respectively. The Bi-FPN [8] goes further and adds a residual connection to the neck to enable aggregation of higher-level feature information. Recently, a method of directly injecting semantic information into an image was proposed [19]. The necks proposed so far can successfully improve the performance of SSDs. However, because they require complex feature aggregation, they incur burdensome memory costs. Therefore, when applying the object detection algorithm to a real-time application, it is necessary to select an appropriate neck considering the cost.

Another way to improve the performance of the object detector is to utilize external information. For example, there are ways to improve the backbone network to provide a better representation [20,21] via self-supervision. One way is to train large datasets, such as ImageNet, through self-supervision tasks to obtain better visual representations. Another example is knowledge distillation [12]. Knowledge distillation is a technique to improve performance by transferring information from a more extensive network to a smaller target network. Since they do not change the architecture, they have the advantage of no additional cost for inference. However, knowledge distillation utilizes two networks, which incurs a high cost. In addition, if there is a great difference between the source domain and the target domain of the external information, the expected improvement may not be achieved.

3. Method

In this section, we introduce various methods to improve the performance of the object detectors analyzed experimentally in this paper.

The neck type is one of the essential factors determining the performance of one-stage object detectors. Therefore, it is necessary to decide which neck to apply, carefully staying within the available cost budget. We used the most basic structures of the SSD and the FPN for experiments with the brief structures shown in Fig. 1. Since performance differences have already been discussed in previous papers, we analyze the differences in feature maps with two necks through knowledge distillation.

Knowledge distillation is a technique to improve performance by transferring information from one network to another. We selected and analyzed feature transfer techniques, i.e., AT [22], FT [23], SP [24], and SKD [25] to apply knowledge distillation to the object detector. The feature sensing points and structure diagram for knowledge distillation are shown in Fig. 2. To distill knowledge, we sense feature maps F for input to the detection heads in the SSD and the FPN and input them to each detection module, D. Then, the distilled knowledge from the teacher and student network is trained through objective function O. This process is expressed as follows:

(1)
$ \textit{L}^{KD} = \textit{O}(\textit{D}(\textit{F}^{teacher}), \textit{D}(\textit{F}^{student})) $

For example, AT normalizes the teacher's and the student's feature maps, and minimizes their L2 distance. Therefore each component can be expressed as $D(x)=x /|x|, O(x, y)=\|x-y\| 2^{2}$. For other techniques, please refer to previous papers.

For the datasets, we adopted PASCAL VOC [14] and MS COCO [15], which are widely used in the object detection field. KITTI [16], which has a similar environment to real-world data, was also used. The KITTI dataset consists of road images at 1224x370 that were captured by a single camera. Therefore, we can say that KITTI is closer to a real environment than the existing benchmark datasets. In addition, since there is a big difference between KITTI and the PASCAL VOC and MS COCO data distributions, we checked the performance differences based on training strategies through KITTI.

Fig. 1. One-stage object detector frameworks.
../../Resources/ieie/IEIESPC.2021.11.1.34/fig1.png
Fig. 2. Examples of knowledge distillation frameworks.
../../Resources/ieie/IEIESPC.2021.11.1.34/fig2.png

4. Experiments

In this section, we present the results of each comparison experiment. First, we checked the effect of knowledge distillation based on the architecture and neck type for PASCAL VOC. The experimental results in Table 1 show that most combinations failed to improve performance when the architecture and neck type were different. Especially when the architecture was different, there was little or no performance improvement, regardless of the neck. Therefore, if there is no teacher network with the same structure, knowledge distillation is a bad choice for performance improvement.

Next, to further observe the effect of knowledge distillation, we show experimental results from MS COCO. In this experiment, we adopted the FPN, which was the most sensitive in previous experiments. Our results are in Table 2. First, by replacing the teacher network with ResNet-101, we checked the tendency when the size difference between teacher and student increased. Experimental results showed significant performance improvement, even when there was a big difference in the size of the architecture, which is contrary to results from the classification task [27]. This phenomenon is presumed to occur because over-constraints are prevented by mitigating the feature distribution difference in backbone networks with large-capacity differences owing to the neck. Also, in the ResNet-50 and MobileNet-v2 pairing, there was improvement in all techniques, unlike with PASCAL VOC. In particular, it is noteworthy that SKD showed the highest degree of improvement. SKD is a knowledge distillation technique designed to prevent over- and mis-constraints. This characteristic provides relatively high performance by suppressing mis-constraints caused by architecture differences. Nevertheless, it is difficult to find a clear trend in the results of each experiment, unlike the classification task. These results suggest that existing knowledge distillation techniques focus on classification, and even though they are generalized in classification datasets, they might not be in detection datasets.

Next, the transfer learning effect based on the dataset was verified. ResNet-50-FPN was used for the experiments. We used ImageNet and MS COCO as the source datasets, with KITTI as the target dataset. As a method of transferring external information, the most generally effective AT (-AT) among knowledge distillation techniques and fine-tuning was used. In the pre-trained networks, we used a public ImageNet pre-trained model that had only semi-weakly supervised learning (ImageNet- SWSL) [28] and an MS COCO pre-trained model.

The experimental results are shown in Table 3. Compared to the general benchmark dataset, the objects of the KITTI dataset exist sparsely. For this reason, we can see that the ImageNet pre-trained model had difficulty learning. Therefore, to obtain high performance, a pre-trained network is required as a detection dataset, such as MS COCO. However, because the KITTI dataset has wide images, each object has a very narrow form when resized to a rectangle. Therefore, although performance was improved when it was used as a pre-trained network, it is difficult to expect additional performance improvement through knowledge distillation. Therefore, when transferring knowledge, this difference in data distribution must be considered.

Table 1 Object detection performance based on architectures, necks, and knowledge distillation algorithms on PASCAL VOC. The numerical value indicates mean average precision (mAP); bold indicates the best result.

Teacher

Student

Neck

Baseline

AT

FT

SP

SKD

ResNet-50

ResNet-18

FPN→FPN

72.88

73.55

73.28

73.02

73.14

SSD→SSD

69.00

69.24

68.98

68.73

68.95

FPN→SSD

69.00

69.04

69.12

68.74

68.98

ResNet-50

MobileNet-v2

FPN→FPN

71.24

71.87

71.34

71.26

71.26

SSD→SSD

64.99

63.80

64.23

64.81

64.82

FPN→SSD

64.99

64.51

64.49

64.81

64.82

Table 2 Object detection performance based on architectures, necks, and knowledge distillation algorithms on MS COCO. AP@0.5 indicates IOU is higher than 0.5; AR columns contain mean results; subscripts S, M, and L signify small, medium, and large objects, respectively; bold indicates the best result.

Teacher

Student

Neck

Distill.

AP@0.5

AP

APS

APM

APL

AR

ARS

ARM

ARL

ResNet-101

ResNet-18

FPN

Baseline

39.5

21.9

5.9

23.5

35.7

34.2

10.9

36.5

53.7

AT

40.9

23.0

6.1

24.7

37.8

35.2

11.1

37.5

55.4

FT

40.2

22.4

6.2

24.4

37.2

34.9

11.5

37.2

55.2

SP

40.5

22.7

6.1

24.5

37.1

35.1

11.4

37.5

55.2

SKD

40.6

22.8

6.2

24.6

37.6

35.0

11.5

37.6

55.3

ResNet-50

MobileNet-v2

FPN

Baseline

37.4

20.1

4.6

21.6

32.7

32.7

9.7

34.5

51.3

AT

38.1

20.4

5.3

22.1

34.3

33.2

10.0

35.2

52.9

FT

37.8

20.3

5.1

21.8

33.8

33.1

10.0

35.1

52.5

SP

38.0

20.5

5.4

22.1

33.9

33.3

10.1

35.3

52.2

SKD

38.2

20.5

5.4

22.2

34.0

33.3

10.1

35.5

53.0

Table 3 Performance based on type of external information and default box configuration.

External information

Default box configuration

AP@0.5

ImageNet

MS COCO

72.1

ImageNet-SWSL

MS COCO

72.8

MS COCO

MS COCO

76.6

MS COCO-AT

MS COCO

77.4

MS COCO

KITTI

80.0

Fig. 2. Pseudo code for default box configurations for the MS COCO and KITTI datasets. Note that KITTI uses only tall default boxes.
../../Resources/ieie/IEIESPC.2021.11.1.34/fig3.png

We also found that when training datasets with unusual data distributions, such as KITTI, adjusting the configuration is more effective than applying special training strategies like knowledge distillation. In order to fit object shapes in KITTI, we modified default boxes as shown in Fig. 3. Those experimental results are the last row of Table 3. Even though only the default boxes changed, it gave higher performance than knowledge distillation. Examples of the detection results are in Fig. 4, showing that a detector with modified default boxes detects narrow objects quite well.

The experimental results in this paper show that widely used techniques to improve classification performance are not generalized well to detection tasks. On the other hand, the more commonly used data considering the configuration setup showed effective performance improvement. Therefore, in order to improve the performance of the detection algorithm, a parameter search based on analysis of the target dataset should be conducted, rather than using the general configuration as it is.

5. Conclusion

In this paper, we analyzed how to improve performance practically while adjusting various learning strategies and configurations. Experimental results showed that sometimes a heuristic and data-oriented approach is more effective than a complex learning strategy. This trend must be considered in the practical application stage. We expect this paper will provide meaningful insights to researchers or developers who want to apply object detectors to actual applications.

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (2020-0-01389, Artificial Intelligence Convergence Research Center(Inha University)) and the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2021-0-02052) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).

REFERENCES

1 
He Kaiming, et al. , 2016, Deep residual learning for image recognition., Proceedings of the IEEE conference on computer vision and pattern recognitionDOI
2 
Sandler Mark, et al. , 2018, Mobilenetv2: Inverted residuals and linear bottlenecks., Proceedings of the IEEE conference on computer vision and pattern recognitionDOI
3 
Liu Wei, et al. , 2016, Ssd: Single shot multibox detector., European conference on computer vision. Springer, ChamDOI
4 
Lin Tsung-Yi, et al. , 2017, Feature pyramid networks for object detection., Proceedings of the IEEE conference on computer vision and pattern recognitionDOI
5 
Choi Jun Ho, et al. , 2020, Multi-scale Non-local Feature Enhancement Network for Robust Small-object Detection., IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 4, pp. 274-283DOI
6 
Ronneberger Olaf , Philipp Fischer , Thomas Brox , 2015, U-net: Convolutional networks for biomedical image segmentation., International Conference on Medical image computing and computer-assisted intervention. Springer, ChamDOI
7 
Long , Jonathan , Evan Shelhamer , Trevor Darrell. , 2015, Fully convolutional networks for semantic segmentation., Proceedings of the IEEE conference on computer vision and pattern recognitionDOI
8 
Tan Mingxing, Ruoming Pang, Quoc V. Le., 2020, Efficientdet: Scalable and efficient object detection., Proceedings of the IEEE/CVF conference on computer vision and pattern recognitionDOI
9 
Uzkent, Burak , Christopher Yeh , Stefano Ermon , 2020, Efficient object detection in large images using deep reinforcement learning., Proceedings of the IEEE/CVF Winter Conference on Applications of Computer VisionDOI
10 
Quang, Tran Ngoc , Seunghyun Lee , Byung Cheol Song. , 2021, Object Detection Using Improved Bi-Directional Feature Pyramid Network., Electronics, Vol. 10, No. 6, pp. 746DOI
11 
Kim Donggeun, et al. , 2020, Real-time Robust Object Detection Using an Adjacent Feature Fusion-based Single Shot Multibox Detector., IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 1, pp. 22-27DOI
12 
Chen Guobin, et al. , 2017, Learning efficient object detection models with knowledge distillation., Advances in neural information processing systemsURL
13 
Shih, Kuan-Hung , Ching-Te Chiu , Yen-Yu Pu. , 2019, Real-time object detection via pruning and a concatenated multi-feature assisted region proposal network., ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEEDOI
14 
Everingham Mark, et al. , 2015 , The pascal visual object classes challenge: A retrospective., International journal of computer vision, Vol. 111, No. 1, pp. 98-136DOI
15 
Lin Tsung-Yi, et al. , 2014, Microsoft coco: Common objects in context., European conference on computer vision. Springer, ChamDOI
16 
Geiger, Andreas , Philip Lenz , Raquel Urtasun. , 2012, Are we ready for autonomous driving? the kitti vision benchmark suite., 2012 IEEE conference on computer vision and pattern recognition. IEEEDOI
17 
Ren Shaoqing, et al. , 2015, Faster r-cnn: Towards real-time object detection with region proposal networks., Advances in neural information processing systems, Vol. 28, pp. 91-99URL
18 
He Kaiming, et al. , 2017, Mask r-cnn., Proceedings of the IEEE international conference on computer visionDOI
19 
Pang Yanwei, et al. , 2019, Efficient featurized image pyramid network for single shot detector., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionDOI
20 
Grill Jean-Bastien, et al. , 2020, Bootstrap Your Own Latent: A new approach to self-supervised learning., Neural Information Processing SystemsURL
21 
Li Chunyuan, et al. , 2021, Efficient Self-supervised Vision Transformers for Representation Learning., arXiv preprint arXiv:2106.09785DOI
22 
Zagoruyko, Sergey , Nikos Komodakis , 2016, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer., arXiv preprint arXiv:1612.03928DOI
23 
Kim, Jangho , SeongUk Park , Nojun Kwak. , 2018, Paraphrasing complex network: network compression via factor transfer., Proceedings of the 32nd International Conference on Neural Information Processing SystemsDOI
24 
Tung, Frederick , Greg Mori. , 2019, Similarity-preserving knowledge distillation., Proceedings of the IEEE/CVF International Conference on Computer VisionDOI
25 
Chen Defang, et al. , 2021., Cross-layer distillation with semantic calibration., Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 8URL
26 
Paszke Adam, et al. , 2019, Pytorch: An imperative style, high-performance deep learning library., Advances in neural information processing systems, Vol. 32, pp. 8026-8037URL
27 
Mirzadeh, Seyed Iman , et al. , 2020, Improved knowledge distillation via teacher assistant., Proceedings of the AAAI Conference on Artificial Intelligence., Vol. 34, No. 04DOI
28 
Yalniz, I. Zeki , et al. , 2019, Billion-scale semi-supervised learning for image classification., arXiv preprint arXiv:1905.00546DOI

Author

Seunghyun Lee
../../Resources/ieie/IEIESPC.2021.11.1.34/au1.png

Seunghyun Lee received a B.S. in electronic engineering from Inha University, Incheon, South Korea, in 2017, where he is currently working toward a combined degree in electronic engineering. His research interests include computer vision and machine learning.

Sungwook Lee
../../Resources/ieie/IEIESPC.2021.11.1.34/au2.png

Sungwook Lee received a B.S. in electronic engineering from Inha University, Incheon, South Korea, in 2020, where is he is currently working toward an M.S. in electrical and computer engineering. His research interests include computer vision and deep learning.

Byung Cheol Song
../../Resources/ieie/IEIESPC.2021.11.1.34/au3.png

Byung Cheol Song received a B.S., an M.S., and a Ph.D. in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer with the Digital Media R&D Center, Samsung Electronics Company Ltd., Suwon, South Korea. In 2008, he joined the Department of Electronic Engineering, Inha University, Incheon, South Korea, and is currently a professor. His research interests include the general areas of image processing and computer vision.