Mobile QR Code QR CODE

  1. (Department of Computer Engineering, Kwangwoon University / Seoul 01897, Korea )



Multi-object tracking, Swimming dataset, Scene detection network, cIoU

1. Introduction

Recent advances in artificial intelligence have significantly impacted various fields, with multi-object tracking (MOT) being one of the most widely used applications in image processing. MOT involves monitoring multiple objects simultaneously within continuous frames. It is utilized for tracking objects in embedded systems from diverse fields to detect abnormal behavior [1], and plays a crucial role in autonomous driving [2].

An essential objective is to identify and differentiate multiple objects. Achieving this requires not only effective detection of specific objects but continuous and prolonged tracking of their presence. It is crucial that a detected object remains distinguishable from other objects throughout the tracking process. Temporary non-detection during tracking, or the inability to confirm its identity as the same object from the previous image, results in recognizing it as a different object, significantly diminishing accuracy from the tracking process.

Leveraging diverse design strategies, deep learning networks have demonstrated outstanding performance in object detection. Their superior performance ensures robust capabilities in simultaneously and persistently tracking multiple objects. This tracking-by-detection method involves receiving the coordinate values of the object from its detection in every frame. State-of-the-art (SOTA) tracking networks employ one-stage detectors [3,4] to enhance accuracy, capitalizing on their speed and high precision [5-7].

While existing deep learning tracking networks primarily focus on videos featuring static surroundings and well-defined object boundaries, such as pedestrians on roads [8,9], this paper introduces a unique dataset centered on swimming [10,11]. In this dataset, the surrounding environment is highly dynamic, and occlusion is particularly severe. Several distinctions set the swimming dataset apart from conventional MOT datasets. First, frequent occlusion is not only caused by other objects but also arises from dynamic surroundings such as water and intense spray. Secondly, even mild spray can introduce noise, posing challenges for object detection. Third, swimmers exhibit substantial posture deformations owing to rapid movements, especially during departure and turns, and even while swimming in a straight line. These distinctive characteristics are responsible for a more significant degradation in tracking accuracy compared to other datasets.

In this paper, the FairMOT [12] tracking-by-detection method is employed as a baseline architecture. In addition to FairMOT’s object detection and Re-ID responsible for adjusting object IDs, a scene detection network accounting for environmental changes to object detection and tracking is integrated into CenterNet [4] (the feature extraction network of FairMOT). This additional network adjusts the value of the IoU metric at the object detection stage and the predicted object location of the Kalman filter.

In object detection, a modified IoU metric (cIoU) transforms the existing IoU metric to allow consideration of not only the overlap between the detected bounding box and the predicted bounding box but also additional factors such as center point distance and aspect ratio. In cIoU, a scene-specific metric is achieved by assigning weights to each input parameter based on detected scene information from each frame. Performance is further enhanced by adaptively adjusting the ratio reflecting the predicted object location of the Kalman filter to the detected object bounding box from the network.

2. Related Work

2.1 CenterNet

CenterNet serves as the detection branch of the FairMOT network. It employs an anchor-free method for object detection, where the key point representing the center of the object is identified as the peak point in a heatmap generated using a Gaussian kernel. The height and width of the object are then determined based on these center points. The object's bounding box is predicted using the calculated height and width, and the offset of the center point is predicted in order to mitigate errors introduced by the stride applied during key point generation.

CenterNet's architecture utilizes backbone networks such as ResNet [14], Hourglass [15], and DLA [16] to predict key points on a heatmap. Features from these backbone networks are transmitted to each head, which predicts heatmap, box size, and offset. Each head trains values through respective losses, contributing to the final prediction of the object's bounding box.

2.2 Deep SORT

In Deep SORT [17] (a tracking algorithm based on the tracking-by-detection method), improvements were made to address the ID switching problem in SORT [18]. The network output Re-ID feature is applied to the matching algorithm, enhancing the handling of bounding boxes from the detection network.

Deep SORT calculates the distance between the predicted value from the Kalman filter and the detection value from the previous frame to track the object's location. Cascade matching and IoU matching are performed in Deep SORT for this purpose. Cascade matching assesses similarity through cosine distance, allocating and matching using the Hungarian algorithm. IoU matching calculates distance through 1-IoU for unmatched detection and prediction values, with matching also done through the Hungarian algorithm based on these distance values.

2.3 Swimmer Tracking

In one study, three modules were developed for swimmer detection and tracking, comprising background modeling, swimmer detection, and swimmer tracking [19]. Swimmers and swimming pool backgrounds are separated using background modeling, and pixels related to the swimmer are grouped through a mean-shift clustering algorithm. Following this image pre-processing, swimmers are detected using a cascaded boosting learning algorithm (a type of machine learning). However, several issues arose in the detection results. The accuracy varied based on the pixel size of the detector. In [19], using a detector size of 10 x 10 pixels, the lowest hit rate observed was 56%. Detection accuracy was affected by the dynamic background changes. Moreover, incorrect detections were attributed to spray, a characteristic present in the swimming dataset. To address false detections caused by spray, an appropriate threshold had to be set, and pre-processing was implemented. In [19], a significant difference was observed between the pre-processed and non–pre-processed images, with bounding box overlap even in the pre-processed image.

The study in [11] proposed a network system designed to automatically determine the number of strokes by a swimmer from overhead race videos (ORVs). ORVs are captured for viewing or analytical purposes, and are taken with broadcast or professional camera equipment, encompassing scenarios with and without camera movement. In [11], a network based on the VGG16 architecture [20] was trained to predict swimming strokes. The stroke cycle was defined as a sine curve in the range [0,1], with 1 corresponding to the stroke going past the ear and 0 when going past the body. Additionally, YOLOv3~[21] was employed to detect the swimmers. For a comparative analysis based on network size, both YOLOv3-416-tiny using Darknet15 as the backbone and YOLOv3-416 using Darknet53 as the backbone were utilized and compared. The model pretrained on the COCO dataset was employed for training, and the SORT algorithm was implemented for tracking. The tracking results revealed a high multi-object tracking accuracy (MOTA) score of 89.34% for the training set, but the MOTA score for the test set was significantly lower at 11.21%. While the system performed well for the swimming class, which had the most data and was easier to track, other classes exhibited low MOTA values due to data scarcity, changes in camera viewpoint, water refraction, and occlusion by individuals near the swimming pool.

3. The Proposed Method

3.1 Scene Detection Network

We incorporated a scene detection head into the existing FairMOT network to classify the corresponding scene in every frame. Following the backbone network, the FairMOT network was segmented into heatmap, box size, offset, and Re-ID heads, respectively, to generate output values. Additionally, a head for classifying swimming event scenes was introduced, and class labeling for each frame was applied to the swimming dataset to train the network to classify current input images. The approximate network architecture is shown in Fig. 1. Classes were categorized into Swimming, Turning, Diving, Finish, and On-block. The Swimming class (Fig. 2), encompasses scenes after swimmers dive into the water (i.e., bodies submerged); they swim until just before the return point, swim again after the turn, and finish swimming at the endpoint. However, since swimmers race at varying speeds, the moment they initiate the turn differs. Consequently, in that scene, the class switches from Swimming to Turning.

Fig. 1. The Architecture of the Entire Network.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig1.png
Fig. 2. An Example of the Swimming Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig2.png

In the freestyle competition, when swimmers reach the return point (Fig. 3), the Turning class is from the moment swimmers begin to turn their bodies while putting their heads into the water until last swimmer reaches that point. For the breaststroke competition, the Turning class starts at the point where swimmers reach out and touch the wall until they push off the return point with their feet and resume the breaststroke.

Fig. 3. An Example of the Turning Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig3.png

As depicted in Fig. 4, Diving is classified as such from the moment the players' hands fall off the blocks until their whole bodies enter the water.

Fig. 4. An Example of the Diving Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig4.png

As shown in Fig. 5, the Finish class is from the moment swimmers stop at the end of the race through to the movement they stand up while reaching out to the wall.

Fig. 5. An Example of the Finish Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig5.png

The On-block class is from the moment swimmers wait on the starting blocks before the start of the race until their hands fall off the blocks, as shown in Fig. 6. In the backstroke, the preparation posture differs in the On-block class because swimmers hold the rods under the blocks and wait, as shown in Fig. 7, up until the moment their hands stop holding the rods.

Fig. 6. An Example of the On-block Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig6.png
Fig. 7. An Example of the Backstroke On-block Class.
../../Resources/ieie/IEIESPC.2024.13.4.337/fig7.png

3.2 The cIoU Metric

When tracking within the FairMOT network, subsequent movement of the object is predicted using cascade matching and IoU matching in Deep SORT. We propose a method of weighting the cIoU formula that additionally considers the distance between the center points of the two bounding boxes and the aspect ratio from the formula used in IoU matching by Deep SORT. We multiply the formula obtaining the distance between the center points by $\left(1-IoU\right)^{2}$, as shown in (1), to reduce the penalty for the distance between the center points of the bounding box when the IoU is high.

(1)
cIoU=IoU – (1 – IoU)$^{2}$ *center point – $\omega $*aspect ratio

Conversely, if the IoU is a low value, indicating significant inconsistency, it can be excluded from matching by imposing a penalty through the distance between the center points of the bounding boxes. Moreover, when the aspect ratio undergoes significant changes by introducing weight $\omega $, a penalty is applied during matching to achieve more accuracy.

We propose a method for adjusting the weights of these formulas according to the class of the detected scene. If the class is for the image input from the previously proposed scene detection network, the optimal $\omega $ value for each class, determined through the grid search method, is applied in the cIoU matching step.

3.3 Adjusting the Kalman Filter Hyper-parameter

We propose a method for adjusting position and velocity (hyper-parameters of the Kalman filter) based on the scene class. Initially, we determine the optimal hyper-parameter value for each class by varying the parameter values to 100, 10, 1, 1/10, and 1/100 by using a grid search method. Table 1 displays the optimal Kalman filter hyper-parameter value for each class identified through the grid search. Afterward, when the input image is classified in the scene detection network, the optimal hyper-parameter value for each class of the Kalman filter is set based on the class.

Table 1. Optimized Hyper-parameters for each Class.

                Parameter

Class                

Position

Velocity

Swimming

100

10

Turning

100

100

Diving

1 / 100

1 / 100

Finish

100

10

On-block

1 / 100

1 / 100

4. Experiments

As shown in Table 2, we initially conducted experiments with sequence videos classified by using the existing FairMOT and measured with tracking performance through the metrics in multi-object tracking.

Table 2. Metrics from FairMOT Tracking.

                Metrics

Videos                

IDF1

IDP

IDR

MT

PT

ML

MOTA

MOTP

Swimming class

SWIM-02

79.7%

81.6%

78%

7

1

0

85.4%

23.8%

SWIM-03

86.5%

88.5%

84.6%

7

1

0

83.4%

23.5%

Turning, Swimming classes

SWIM-07

49.4%

59.6%

42.2%

3

4

1

47.5%

28.3%

SWIM-08

49.3%

54.7%

45%

3

5

0

53.5%

30.6%

On-block, Diving, Swimming classes

SWIM-27

35.3%

41.8%

30.5%

1

7

0

34.9%

27.4%

SWIM-31

42.2%

48.2%

37.5%

3

5

0

51.3%

22.3%

Swimming, Finish classes

SWIM-06

52.3%

57.5%

48%

2

6

0

52.2%

26.8%

SWIM-14

66.4%

72.6%

61.1%

3

5

0

47.4%

28.9%

In SWIM-02 and -03, which consist solely of the Swimming class, the MOTA score was high at 85.4% and 83.4%, respectively. In SWIM-02 and -03, swimmers move consistently without significant changes in movements, allowing for stable tracking.

In SWIM-07 and -08 (sequences where swimmers reach the return point and turn), MOTA scores were 47.5% and 53.5%, respectively. The lower MOTA scores compared to the Swimming class can be attributed to the camera angle being from the side, as shown in Fig. 2. Players farther from the camera are less visible than those who are closer, which makes tracking them challenging. Additionally, when swimmers turn, they are covered by spray from movements in the water, leading to ID switching due to the significant changes in motion.

SWIM-27 and -31, with MOTA scores of 34.9% and 51.3%, respectively, are sequences that include On-block, Diving, and Swimming classes. In these sequences, the most ID switching occurred when transitioning the class from On-block to Diving. When swimmers extend their bodies from a crouching position into the dive, significant changes in features and IoU take place, leading to failed matches in both cascade matching and IoU matching in Deep SORT. Due to significant changes in motion, the IDs of some swimmers switched not only when transitioning from On-block to Diving but also during the diving motion.

SWIM-31 exhibited a higher MOTA score compared to SWIM-27. The reason is that when transitioning from On-block to Diving, the IDs of swimmers changed, and these IDs were subsequently well-maintained. The ID changed at the moment the class switched due to the change in IoU, but afterward, tracking was maintained through Re-ID features. The difference in SWIM-27 was that it was easy to find Re-ID features from having more data on topless male swimmers during training. On the other hand, SWIM-27 had relatively more difficulty finding Re-ID features because female swimmers wore full-body swimsuits.

SWIM-06 and -14 are sequences that contain scenes at the end of the swimming competition, and the sequences were completed with the Finish class. The two sequences showed MOTA scores of 52.2% and 47.2%, respectively. When shooting the Finish class, the composition changed to filming next to the swimming pool, resulting in one swimmer being covered by the referee, leading to ID switching. There was also a case where ID switching occurred due to spray in the swimming scene, and further ID switching occurred due to changes in motion and features as the Swimming class transitioned to the Finish class.

We experimented by changing the IoU formula to the cIoU formula in IoU matching by Deep SORT, and by weighting the aspect ratio. We first used a grid search to find the weight of the optimal aspect ratio for each class. As a result, Diving and On-block classes showed optimal MOTA values when the weight was 41, and the rest of the classes showed the highest values when it was 1. Based on these values, we set the optimal weight value according to the scene through the scene detection network. We show the results in Table 3. SWIM-02, -03, -06, and -14 do not appear to have affected IoU matching, and the MOTA score slightly increased in SWIM-07, -27, and -31. SWIM-02, -03, -06, and -14 are sequences in which significant IoU changes do not occur and were not affected by cIoU formula changes. SWIM-07, -27, and -31 have classes with large IoU changes, such as On-block, Diving, and Turning, so the cIoU formula affected performance.

Table 3. Metrics from cIoU’s Weight Adjustment Tracking.

                Metrics

Videos                

IDF1

IDP

IDR

MT

PT

ML

MOTA

MOTP

Swimming class

SWIM-02

79.7%

81.6%

78%

7

1

0

85.4%

23.8%

SWIM-03

86.5%

88.5%

84.6%

7

1

0

83.4%

23.5%

Turning, Swimming classes

SWIM-07

49%

59%

41.9%

3

4

1

47.7%

28.4%

SWIM-08

49.3%

54.7%

45%

3

5

0

53.5%

30.6%

On-block, Diving, Swimming classes

SWIM-27

37.8%

45%

32.6%

1

6

1

35.5%

26.8%

SWIM-31

41%

46.9%

36.5%

3

5

0

51.8%

21.9%

Swimming, Finish classes

SWIM-06

52.3%

57.5%

48%

2

6

0

52.2%

26.8%

SWIM-14

66.2%

72.5%

60.9%

3

5

0

47.2%

28.9%

In addition, through the scene detection network, hyper-parameters of the Kalman filter, the position, and the velocity can be found and applied to each scene, as shown in Table 1. We present the results in Table 4. SWIM-02 and -03, which only have the Swimming class, showed improved performance in IDF1 and MOTA scores by applying hyper-parameters optimal for the Swimming class. In SWIM-07 and -08, where Turning and Swimming appear, IDF1 showed a significant improvement in performance, and in SWIM-07, the MOTA score improved by 5.2 compared to the original FairMOT. In SWIM-07 and -27, adjustment of the Kalman filter's hyper-parameters was not significantly effective. As seen in Table 1, the optimal value for Diving and On-block classes was not significantly affected by the hyper-parameter weight of the Kalman filter, and the fact that the hyper-parameter value was set to an inappropriate value in switching from Diving to Swimming caused performance degradation. In SWIM-06 and -14, performance improvement was achieved in IDF1 through hyper-parameter adjustment, and ID was maintained even when the swimmer was occluded by the referee. Additionally, ID switching also showed improvement at the moment Swimming changed to Finish.

Table 4. Metrics from cIoU’s Weight and the Kalman Filter Hyper-parameter Adjustment.

                Metrics

Videos                

IDF1

IDP

IDR

MT

PT

ML

MOTA

MOTP

Swimming class

SWIM-02

88.3%

90.2%

86.6%

8

0

0

87.7%

24.2%

SWIM-03

91.9%

93.8%

90.1%

7

1

0

84.1%

23.4%

Turning, Swimming classes

SWIM-07

71.2%

84.4%

61.6%

4

3

1

52.7%

28.2%

SWIM-08

73.9%

80.3%

68.5%

2

6

0

53.4%

32.4%

On-block, Diving, Swimming classes

SWIM-27

37.5%

45.5%

32%

1

6

1

35.5%

26.7%

SWIM-31

40.2%

46.9%

35.2%

3

5

0

46.9%

21.9%

Swimming, Finish classes

SWIM-06

62.5%

68%

57.8%

2

6

0

47%

28%

SWIM-14

70.4%

75.8%

65.8%

2

6

0

48.1%

28.8%

5. Conclusion

In this paper, we proposed a method to enhance performance by incorporating additional factors. We achieved this by modifying the IoU formula in the IoU matching of Deep SORT into the cIoU formula, thereby improving tracking performance on swimming datasets using the FairMOT tracking network. MOTA scores were enhanced by up to 0.6% in specific sequences by using this method. Additionally, the MOTA score saw an improvement of up to 5.2% through the adjustment of hyper-parameters in the Kalman filter based on class.

In future studies, we intend to enhance multi-object tracking performance by individually analyzing and considering the scene characteristics of each object, especially when dealing with swimmers detected in various states of motion, such as Turning.

ACKNOWLEDGMENTS

This work was supported by a National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF-2021R1F1A1060183), by a Korea Institute for Advancement of Technology (KIAT) grant funded by the Korea Government (MOTIE) (P0017124, HRD Program for Industrial Innovation), and by a Research Grant from Kwangwoon University in 2021.

REFERENCES

1 
Shehzed, Ahsan, Ahmad Jalal, and Kibum Kim. "Multi-person tracking in smart surveillance system for crowd counting and normal/abnormal events detection." 2019 international conference on applied and engineering mathematics (ICAEM). IEEE, 2019.DOI
2 
Guo, Lie, et al. "Pedestrian tracking based on camshift with kalman prediction for autonomous vehicles." International Journal of Advanced Robotic Systems 13.3 (2016): 120.DOI
3 
Redmon, Joseph, et al. "You only look once: Unified, real-time object detection." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.DOI
4 
Zhou, Xingyi, Dequan Wang, and Philipp Krähenbühl. "Objects as points." arXiv preprint arXiv:1904.07850(2019).URL
5 
Aharon, Nir, Roy Orfaig, and Ben-Zion Bobrovsky. "BoT-SORT: Robust associations multi-pedestrian tracking." arXiv preprint arXiv:2206.14651 (2022).URL
6 
Wang, Yu-Hsiang. "SMILEtrack: SiMIlarity LEarning for Multiple Object Tracking." arXiv preprint arXiv:2211.08824 (2022).URL
7 
Maggiolino, Gerard, et al. "Deep oc-sort: Multi-pedestrian tracking by adaptive re-identification." arXiv preprint arXiv:2302.11813 (2023).DOI
8 
Milan, Anton, et al. "MOT16: A benchmark for multi-object tracking." arXiv preprint arXiv: 1603.00831 (2016).URL
9 
Dendorfer, Patrick, et al. "Mot20: A benchmark for multi object tracking in crowded scenes." arXiv preprint arXiv:2003.09003 (2020).URL
10 
Woinoski, Timothy, Alon Harell, and Ivan V. Bajic. "Towards automated swimming analytics using deep neural networks." arXiv preprint arXiv:2001.04433 (2020).URL
11 
Woinoski, Timothy, and Ivan V. Bajić. "Swimmer stroke rate estimation from overhead race video." 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2021.DOI
12 
Zhang, Yifu, et al. "Fairmot: On the fairness of detection and re-identification in multiple object tracking." International Journal of Computer Vision 129 (2021): 3069-3087.DOI
13 
Zheng, Zhaohui, et al. "Distance-IoU loss: Faster and better learning for bounding box regression." Proceedings of the AAAI conference on artificial intelligence. Vol. 34. No. 07. 2020.DOI
14 
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.DOI
15 
Newell, Alejandro, Kaiyu Yang, and Jia Deng. "Stacked hourglass networks for human pose estimation." Computer Vision-ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII 14. Springer International Publishing, 2016.DOI
16 
Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.DOI
17 
Wojke, Nicolai, Alex Bewley, and Dietrich Paulus. "Simple online and realtime tracking with a deep association metric." 2017 IEEE international conference on image processing (ICIP). IEEE, 2017.DOI
18 
Bewley, Alex, et al. "Simple online and realtime tracking." 2016 IEEE international conference on image processing (ICIP). IEEE, 2016.DOI
19 
Sha, Long, et al. "Understanding and analyzing a large collection of archived swimming videos." IEEE Winter Conference on Applications of Computer Vision. IEEE, 2014.DOI
20 
Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).DOI
21 
Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv: 1804.02767 (2018).URL

Author

Dong-yeon Shin
../../Resources/ieie/IEIESPC.2024.13.4.337/au1.png

Dong-yeon Shin received a B.S. in Electronic Engineering from Korea National University of Transportation in 2022. Currently, he is pursuing an M.S. in Computer Engineering from Kwangwoon University, South Korea. His research interests include multi-object tracking and deep learning.

Seong-won Lee
../../Resources/ieie/IEIESPC.2024.13.4.337/au2.png

Seong-won Lee (Member, IEEE) received a B.Sc. and an M.Sc. in control and instrumentation engi-neering from Seoul National University, South Korea, in 1988 and 1990, respectively, and a Ph.D. in electrical engineering from the University of Southern California in 2003. From 1990 to 2004, he worked on VLSI/SoC design at Samsung Electronics Company Ltd., South Korea. Since March 2005, he has been a Professor with the Department of Computer Engineering, Kwangwoon University, Seoul, South Korea. His research interests include image signal processing, signal processing SoC, edge AI systems, and computer architectures.