Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 11, No. 1, p.34-39

ISSN (print) :

2287-5255

Received : 04 October 2021Revised : 09 November 2021Accepted : 21 December 2021

DOI :

https://doi.org/10.5573/IEIESPC.2021.11.1.34

Regular Paper

Practical Analysis for Improving Performance from One-stage Object Detectors

LeeSeunghyun¹ LeeSungwook¹ SongByung Cheol^1*

(Department of Electronic Engineering, Inha University / Incheon, Korea lsh910703@gmail.com, leesw0623@naver.com, bcsong@inha.ac.kr )

^* Corresponding Author: Byung Cheol Song

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

Since the advent of the CNN, the performance of object detectors has been greatly improved. In addition, with the one-stage object detector, the detection algorithm has been lightened and improved to a level that can be applied to real-time applications. However, the research directions for one-stage object detectors focus on obtaining high performance on a benchmark dataset, and there is less consideration of how to improve performance with real-world data. In this paper, we check how methods popularly used to enhance performance from neural networks respond to datasets similar to real-world data. Also, we experimentally confirm that a training configuration setup that considers the target dataset can be more effective than complex training strategies like knowledge distillation. Although this analysis is somewhat heuristic, we expect it will provide meaningful insights for researchers and developers who want to apply an object detector to actual applications.

Keywords

Object detection, Deep learning

1. Introduction

The convolutional neural network (CNN) has proved outstanding in the field of computer vision and has become an essential technology in various fields, such as image classification ^[1,^2], object detection ^[3-^5], and image segmentation ^[6,^7]. In addition, the CNN has dramatically improved the performance of object detection, which is highly useful, to a level that can be applied to actual applications. Accordingly, recent object detection algorithm research has been extended to not only improving performance but reducing costs through efficient structures ^[8-^11] or new learning methods ^[12,^13] for real-time operation on edge devices.

Most object detection studies focus on improving performance with benchmark datasets such as PASCAL VOC ^[14] and MS COCO ^[15]. In this case, a common approach is to use the same configuration on all datasets to make the comparison as fair as possible. However, the problem with that approach is not considering the characteristics of the objects in the datasets. In addition, analysis for real-world environments is rarely conducted, and may seem impractical to researchers and developers who apply object detectors to their applications.

This paper analyzes performance change according to the configuration and learning strategy in a basic object detection algorithm. The analysis confirms that a configuration considering data distribution is a variable that significantly influences performance improvement, rather than the complexity of the algorithm, e.g., knowledge distillation ^[12]. For example, it was shown that performance can be improved by up to 2.6\% AP with the KITTI dataset ^[16] only by modifying the configuration. We expect that the results of this study will provide meaningful insights for researchers who introduce object detection into applications.

2. Related Work

Object detection algorithms are classified into the two-stage object detector (e.g., Faster-RCNN ^[17], Mask-RCNN ^[18]) and the one-stage object detector (e.g., SSD ^[3] and Bi-FPN ^[8]). In this paper, we analyze the one-stage object detector, which is used more frequently, in practice, due to its lightness.

The first one-stage object detector, the SSD, senses several feature maps in a backbone CNN architecture. Then, classification and localization are performed by feeding them to the detection heads. Here, for efficient detection, the SSD utilizes several default boxes. After that, the SSD flow-up techniques improve performance by fusing and improving the feature map while maintaining this framework. Representatively, the FPN ^[4] fuses feature maps of different sizes by adding top-down and bottom-up pathways (so-called necks) before feeding them to the detection heads. This framework improves semantic information and location information in large feature maps and small feature maps, respectively. The Bi-FPN ^[8] goes further and adds a residual connection to the neck to enable aggregation of higher-level feature information. Recently, a method of directly injecting semantic information into an image was proposed ^[19]. The necks proposed so far can successfully improve the performance of SSDs. However, because they require complex feature aggregation, they incur burdensome memory costs. Therefore, when applying the object detection algorithm to a real-time application, it is necessary to select an appropriate neck considering the cost.

Another way to improve the performance of the object detector is to utilize external information. For example, there are ways to improve the backbone network to provide a better representation ^[20,^21] via self-supervision. One way is to train large datasets, such as ImageNet, through self-supervision tasks to obtain better visual representations. Another example is knowledge distillation ^[12]. Knowledge distillation is a technique to improve performance by transferring information from a more extensive network to a smaller target network. Since they do not change the architecture, they have the advantage of no additional cost for inference. However, knowledge distillation utilizes two networks, which incurs a high cost. In addition, if there is a great difference between the source domain and the target domain of the external information, the expected improvement may not be achieved.

3. Method

In this section, we introduce various methods to improve the performance of the object detectors analyzed experimentally in this paper.

The neck type is one of the essential factors determining the performance of one-stage object detectors. Therefore, it is necessary to decide which neck to apply, carefully staying within the available cost budget. We used the most basic structures of the SSD and the FPN for experiments with the brief structures shown in Fig. 1. Since performance differences have already been discussed in previous papers, we analyze the differences in feature maps with two necks through knowledge distillation.

Knowledge distillation is a technique to improve performance by transferring information from one network to another. We selected and analyzed feature transfer techniques, i.e., AT ^[22], FT ^[23], SP ^[24], and SKD ^[25] to apply knowledge distillation to the object detector. The feature sensing points and structure diagram for knowledge distillation are shown in Fig. 2. To distill knowledge, we sense feature maps F for input to the detection heads in the SSD and the FPN and input them to each detection module, D. Then, the distilled knowledge from the teacher and student network is trained through objective function O. This process is expressed as follows:

(1)

$ \textit{L}^{KD} = \textit{O}(\textit{D}(\textit{F}^{teacher}), \textit{D}(\textit{F}^{student})) $

For example, AT normalizes the teacher's and the student's feature maps, and minimizes their L2 distance. Therefore each component can be expressed as $D(x)=x /|x|, O(x, y)=\|x-y\| 2^{2}$. For other techniques, please refer to previous papers.

For the datasets, we adopted PASCAL VOC ^[14] and MS COCO ^[15], which are widely used in the object detection field. KITTI ^[16], which has a similar environment to real-world data, was also used. The KITTI dataset consists of road images at 1224x370 that were captured by a single camera. Therefore, we can say that KITTI is closer to a real environment than the existing benchmark datasets. In addition, since there is a big difference between KITTI and the PASCAL VOC and MS COCO data distributions, we checked the performance differences based on training strategies through KITTI.

Fig. 1. One-stage object detector frameworks.

Fig. 2. Examples of knowledge distillation frameworks.

4. Experiments

In this section, we present the results of each comparison experiment. First, we checked the effect of knowledge distillation based on the architecture and neck type for PASCAL VOC. The experimental results in Table 1 show that most combinations failed to improve performance when the architecture and neck type were different. Especially when the architecture was different, there was little or no performance improvement, regardless of the neck. Therefore, if there is no teacher network with the same structure, knowledge distillation is a bad choice for performance improvement.

Next, to further observe the effect of knowledge distillation, we show experimental results from MS COCO. In this experiment, we adopted the FPN, which was the most sensitive in previous experiments. Our results are in Table 2. First, by replacing the teacher network with ResNet-101, we checked the tendency when the size difference between teacher and student increased. Experimental results showed significant performance improvement, even when there was a big difference in the size of the architecture, which is contrary to results from the classification task ^[27]. This phenomenon is presumed to occur because over-constraints are prevented by mitigating the feature distribution difference in backbone networks with large-capacity differences owing to the neck. Also, in the ResNet-50 and MobileNet-v2 pairing, there was improvement in all techniques, unlike with PASCAL VOC. In particular, it is noteworthy that SKD showed the highest degree of improvement. SKD is a knowledge distillation technique designed to prevent over- and mis-constraints. This characteristic provides relatively high performance by suppressing mis-constraints caused by architecture differences. Nevertheless, it is difficult to find a clear trend in the results of each experiment, unlike the classification task. These results suggest that existing knowledge distillation techniques focus on classification, and even though they are generalized in classification datasets, they might not be in detection datasets.

Next, the transfer learning effect based on the dataset was verified. ResNet-50-FPN was used for the experiments. We used ImageNet and MS COCO as the source datasets, with KITTI as the target dataset. As a method of transferring external information, the most generally effective AT (-AT) among knowledge distillation techniques and fine-tuning was used. In the pre-trained networks, we used a public ImageNet pre-trained model that had only semi-weakly supervised learning (ImageNet- SWSL) ^[28] and an MS COCO pre-trained model.

The experimental results are shown in Table 3. Compared to the general benchmark dataset, the objects of the KITTI dataset exist sparsely. For this reason, we can see that the ImageNet pre-trained model had difficulty learning. Therefore, to obtain high performance, a pre-trained network is required as a detection dataset, such as MS COCO. However, because the KITTI dataset has wide images, each object has a very narrow form when resized to a rectangle. Therefore, although performance was improved when it was used as a pre-trained network, it is difficult to expect additional performance improvement through knowledge distillation. Therefore, when transferring knowledge, this difference in data distribution must be considered.

Table 1 Object detection performance based on architectures, necks, and knowledge distillation algorithms on PASCAL VOC. The numerical value indicates mean average precision (mAP); bold indicates the best result.

Teacher	Student	Neck	Baseline	AT	FT	SP	SKD
ResNet-50	ResNet-18	FPN→FPN	72.88	73.55	73.28	73.02	73.14
		SSD→SSD	69.00	69.24	68.98	68.73	68.95
		FPN→SSD	69.00	69.04	69.12	68.74	68.98
ResNet-50	MobileNet-v2	FPN→FPN	71.24	71.87	71.34	71.26	71.26
		SSD→SSD	64.99	63.80	64.23	64.81	64.82
		FPN→SSD	64.99	64.51	64.49	64.81	64.82

Table 2 Object detection performance based on architectures, necks, and knowledge distillation algorithms on MS COCO. AP@0.5 indicates IOU is higher than 0.5; AR columns contain mean results; subscripts S, M, and L signify small, medium, and large objects, respectively; bold indicates the best result.

Teacher	Student	Neck	Distill.	AP@0.5	AP	APS	APM	APL	AR	ARS	ARM	ARL
ResNet-101	ResNet-18	FPN	Baseline	39.5	21.9	5.9	23.5	35.7	34.2	10.9	36.5	53.7
			AT	40.9	23.0	6.1	24.7	37.8	35.2	11.1	37.5	55.4
			FT	40.2	22.4	6.2	24.4	37.2	34.9	11.5	37.2	55.2
			SP	40.5	22.7	6.1	24.5	37.1	35.1	11.4	37.5	55.2
			SKD	40.6	22.8	6.2	24.6	37.6	35.0	11.5	37.6	55.3
ResNet-50	MobileNet-v2	FPN	Baseline	37.4	20.1	4.6	21.6	32.7	32.7	9.7	34.5	51.3
			AT	38.1	20.4	5.3	22.1	34.3	33.2	10.0	35.2	52.9
			FT	37.8	20.3	5.1	21.8	33.8	33.1	10.0	35.1	52.5
			SP	38.0	20.5	5.4	22.1	33.9	33.3	10.1	35.3	52.2
			SKD	38.2	20.5	5.4	22.2	34.0	33.3	10.1	35.5	53.0

Table 3 Performance based on type of external information and default box configuration.

External information	Default box configuration	AP@0.5
ImageNet	MS COCO	72.1
ImageNet-SWSL	MS COCO	72.8
MS COCO	MS COCO	76.6
MS COCO-AT	MS COCO	77.4
MS COCO	KITTI	80.0

Fig. 2. Pseudo code for default box configurations for the MS COCO and KITTI datasets. Note that KITTI uses only tall default boxes.

We also found that when training datasets with unusual data distributions, such as KITTI, adjusting the configuration is more effective than applying special training strategies like knowledge distillation. In order to fit object shapes in KITTI, we modified default boxes as shown in Fig. 3. Those experimental results are the last row of Table 3. Even though only the default boxes changed, it gave higher performance than knowledge distillation. Examples of the detection results are in Fig. 4, showing that a detector with modified default boxes detects narrow objects quite well.

The experimental results in this paper show that widely used techniques to improve classification performance are not generalized well to detection tasks. On the other hand, the more commonly used data considering the configuration setup showed effective performance improvement. Therefore, in order to improve the performance of the detection algorithm, a parameter search based on analysis of the target dataset should be conducted, rather than using the general configuration as it is.

5. Conclusion

In this paper, we analyzed how to improve performance practically while adjusting various learning strategies and configurations. Experimental results showed that sometimes a heuristic and data-oriented approach is more effective than a complex learning strategy. This trend must be considered in the practical application stage. We expect this paper will provide meaningful insights to researchers or developers who want to apply object detectors to actual applications.

ACKNOWLEDGMENTS

This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) (2020-0-01389, Artificial Intelligence Convergence Research Center(Inha University)) and the MSIT(Ministry of Science and ICT), Korea, under the ITRC(Information Technology Research Center) support program(IITP-2021-0-02052) supervised by the IITP(Institute for Information & Communications Technology Planning & Evaluation).

REFERENCES

He Kaiming, et al. , 2016, Deep residual learning for image recognition., Proceedings of the IEEE conference on computer vision and pattern recognition

Sandler Mark, et al. , 2018, Mobilenetv2: Inverted residuals and linear bottlenecks., Proceedings of the IEEE conference on computer vision and pattern recognition

Liu Wei, et al. , 2016, Ssd: Single shot multibox detector., European conference on computer vision. Springer, Cham

Lin Tsung-Yi, et al. , 2017, Feature pyramid networks for object detection., Proceedings of the IEEE conference on computer vision and pattern recognition

Choi Jun Ho, et al. , 2020, Multi-scale Non-local Feature Enhancement Network for Robust Small-object Detection., IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 4, pp. 274-283

Ronneberger Olaf , Philipp Fischer , Thomas Brox , 2015, U-net: Convolutional networks for biomedical image segmentation., International Conference on Medical image computing and computer-assisted intervention. Springer, Cham

Long , Jonathan , Evan Shelhamer , Trevor Darrell. , 2015, Fully convolutional networks for semantic segmentation., Proceedings of the IEEE conference on computer vision and pattern recognition

Tan Mingxing, Ruoming Pang, Quoc V. Le., 2020, Efficientdet: Scalable and efficient object detection., Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Uzkent, Burak , Christopher Yeh , Stefano Ermon , 2020, Efficient object detection in large images using deep reinforcement learning., Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

Quang, Tran Ngoc , Seunghyun Lee , Byung Cheol Song. , 2021, Object Detection Using Improved Bi-Directional Feature Pyramid Network., Electronics, Vol. 10, No. 6, pp. 746

Kim Donggeun, et al. , 2020, Real-time Robust Object Detection Using an Adjacent Feature Fusion-based Single Shot Multibox Detector., IEIE Transactions on Smart Processing & Computing, Vol. 9, No. 1, pp. 22-27

Chen Guobin, et al. , 2017, Learning efficient object detection models with knowledge distillation., Advances in neural information processing systems

Shih, Kuan-Hung , Ching-Te Chiu , Yen-Yu Pu. , 2019, Real-time object detection via pruning and a concatenated multi-feature assisted region proposal network., ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE

Everingham Mark, et al. , 2015 , The pascal visual object classes challenge: A retrospective., International journal of computer vision, Vol. 111, No. 1, pp. 98-136

Lin Tsung-Yi, et al. , 2014, Microsoft coco: Common objects in context., European conference on computer vision. Springer, Cham

Geiger, Andreas , Philip Lenz , Raquel Urtasun. , 2012, Are we ready for autonomous driving? the kitti vision benchmark suite., 2012 IEEE conference on computer vision and pattern recognition. IEEE

Ren Shaoqing, et al. , 2015, Faster r-cnn: Towards real-time object detection with region proposal networks., Advances in neural information processing systems, Vol. 28, pp. 91-99

He Kaiming, et al. , 2017, Mask r-cnn., Proceedings of the IEEE international conference on computer vision

Pang Yanwei, et al. , 2019, Efficient featurized image pyramid network for single shot detector., Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Grill Jean-Bastien, et al. , 2020, Bootstrap Your Own Latent: A new approach to self-supervised learning., Neural Information Processing Systems

Li Chunyuan, et al. , 2021, Efficient Self-supervised Vision Transformers for Representation Learning., arXiv preprint arXiv:2106.09785

Zagoruyko, Sergey , Nikos Komodakis , 2016, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer., arXiv preprint arXiv:1612.03928

Kim, Jangho , SeongUk Park , Nojun Kwak. , 2018, Paraphrasing complex network: network compression via factor transfer., Proceedings of the 32nd International Conference on Neural Information Processing Systems

Tung, Frederick , Greg Mori. , 2019, Similarity-preserving knowledge distillation., Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen Defang, et al. , 2021., Cross-layer distillation with semantic calibration., Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, No. 8

Paszke Adam, et al. , 2019, Pytorch: An imperative style, high-performance deep learning library., Advances in neural information processing systems, Vol. 32, pp. 8026-8037

Mirzadeh, Seyed Iman , et al. , 2020, Improved knowledge distillation via teacher assistant., Proceedings of the AAAI Conference on Artificial Intelligence., Vol. 34, No. 04

Yalniz, I. Zeki , et al. , 2019, Billion-scale semi-supervised learning for image classification., arXiv preprint arXiv:1905.00546

Author

Seunghyun Lee

Seunghyun Lee received a B.S. in electronic engineering from Inha University, Incheon, South Korea, in 2017, where he is currently working toward a combined degree in electronic engineering. His research interests include computer vision and machine learning.

Sungwook Lee

Sungwook Lee received a B.S. in electronic engineering from Inha University, Incheon, South Korea, in 2020, where is he is currently working toward an M.S. in electrical and computer engineering. His research interests include computer vision and deep learning.

Byung Cheol Song

Byung Cheol Song received a B.S., an M.S., and a Ph.D. in electrical engineering from the Korea Advanced Institute of Science and Technology, Daejeon, South Korea, in 1994, 1996, and 2001, respectively. From 2001 to 2008, he was a Senior Engineer with the Digital Media R&D Center, Samsung Electronics Company Ltd., Suwon, South Korea. In 2008, he joined the Department of Electronic Engineering, Inha University, Incheon, South Korea, and is currently a professor. His research interests include the general areas of image processing and computer vision.