3.1 Experimental Setup
The experiments were conducted under the TensorFlow framework [18] on a Linux system. The change of tensor format is achieved with the reshape function,
and 3${\times}$3 convolution is used for the convolution kernel of the first two layers
of the 3D-CNN, with 1${\times}$1 convolution used for the last two layers. Max-pooling
is used, and the Adam optimization algorithm is used for training the network at a
learning rate of 0.001.
There were two kinds of experimental data. The first was the DanceDB dataset [19], which contained 48 dance videos involving 12 dance movements. Each movement was
labeled with an emotion tag (e.g., scared, angry). The frame rate of the videos was
20 frames per second, and the size of each frame was 480${\times}$360.
Second was a self-built dataset containing 96 dance videos recorded by six students
majoring in dance, all in the same context. These videos included three types of dance,
each with complex and variable movements. The frame rate of the videos was also 30
FPS, and again, the size of each frame was 480${\times}$360. Some of the frames are
in Fig. 3.
All the videos had key frames marked by professional dance teachers, and the following
indicators were selected to assess the effects of key frame extraction.
(1) Recall ratio: the number of key frames correctly extracted, divided by the number
of key frames correctly extracted plus the number of missed frames.
(2) Precision ratio: the number of key frames correctly extracted, divided by the
number of key frames correctly extracted plus the number of frames falsely detected.
(3) Deletion factor: the number of frames falsely detected, divided by the number
of key frames correctly extracted.
The effectiveness of movement recognition was evaluated using accuracy: the number
of videos correctly recognized divided by the total number of videos.
Fig. 3. Example dance video frames from the self-built dataset.
3.2 Analysis of Results
First, the key frame extraction method from this paper was compared with the following
two methods:
① The color feature-based approach proposed by Jadhav and Jadhav [20], and
② The scale-invariant feature transform approach proposed by Hannane et al. [21].
The key frame extraction results of the three methods are shown in Table 2.
From Table 2, we see that, first, the recall ratios of the methods proposed by Jadhav and Jadhav
and Hannane et al. were below 80% for the DanceDB dataset, while the recall ratio
of the proposed multi-feature fusion method was 82.27% (10.82% higher than Jadhav
and Jadhav, and 8.81% higher than Hannane et al.). Secondly, the precision ratios
of the Jadhav and Jadhav and Hannane et al. methods were below 70%, while the precision
ratio from multi-feature fusion was 72.84% (5.97% higher than Jadhav and Jadhav and
4.65% higher than Hannane et al.). Third, the deletion factor of the multi-feature
fusion method on the DanceDB dataset was 3.01, which was 1.41 lower than the method
by Jadhav and Jadhav, and 0.55 lower than the method by Hannane et al.
The recall and precision ratios of all the methods when processing the self-built
dataset improved to some extent, compared to the DanceDB dataset, and the deletion
factors were also smaller, which may be due to the relatively small number of dance
types in the self-built dataset. In comparison, the recall rate of the multi-feature
fusion method was 84.82%, and the precision ratio was 81.07%, both of which are higher
than 80% and significantly higher than the other two methods. The deletion factor
with multi-feature fusion was 2.25, which was significantly smaller than the other
two methods. Based on the results with the two datasets, the multi-feature fusion
method had fewer cases of missed and incorrect detections of key frames, as well as
better extraction results.
Taking the dance called Searching as an example, the key frames output by all three
methods are presented in Fig. 4. More key frames were extracted by the Jadhav and Jadhav and the Hannane et al. methods,
but some frames are not clear. Key frames extracted by the multi-feature fusion method
gave a complete description of the movement changes in the dance video, a good overview
of the video, and the extracted movements are clear. Therefore, the multi-feature
fusion method can be used to provide movement recognition services.
Based on the key frame extractions, the results of movement recognition were analyzed
to compare the effects of different features and different classifiers on the results
of dance video movement recognition. The compared methods are
Method 1: using only spatial features plus the softmax classifier,
Method 2: using only temporal features plus the softmax classifier, and
Method 3: using spatio-temporal features plus the SVM classifier [22].
Comparison of accuracy from the above methods and the multi-feature fusion method
is presented in Fig. 5.
According to Fig. 5, movement recognition accuracy was low in all cases when only one feature was used.
The multi-feature fusion method exhibited accuracy of 42.67% with the DanceDB dataset,
which is 11% higher than Method 1 and 8.71% higher than Method 2. For the self-built
dataset, multi-feature fusion had an accuracy of 50.64%, which is 17.26% higher than
Method 1 and 14.72% higher than Method 2. This indicates that the extracted features
could have some influence on movement recognition when using the same classifier.
The comparison shows that using only spatial features or only temporal features led
to a decrease in recognition accuracy, while the combination of spatio-temporal features
produced better recognition of dance video movements.
Comparing Method 3 and the multi-feature fusion method, the difference in classifiers
resulted in differences in accuracy. Recognition accuracy from Method 3 on the DanceDB
dataset was 38.56% (4.11% lower than the method proposed in this paper), and with
the self-built dataset, recognition accuracy from Method 3 was 40.11% (10.53% lower
than the proposed method). This indicates that the softmax classifier provided better
recognition of different dance movements than the SVM classifier. The SVM required
a lot of computation time for multi-classification recognition, and the selection
of the kernel function and parameters depended on manual experience, which is somewhat
arbitrary. Therefore, the softmax classifier was more reliable.
A movement identification approach based on trajectory feature fusion was proposed
by Megrhi et al. [23]. It was compared with the proposed multi-feature fusion method, and the results are
presented in Fig. 6.
Fig. 6 shows that movement recognition accuracy from the method proposed in this paper was
significantly higher for the two dance video datasets. The reason the accuracy of
both methods was higher with the self-built dataset is that it included more dance
movements. For DanceDB, the recognition accuracy of the Megrhi et al. method was 39.52%
(3.15% lower than multi-feature fusion), and for the self-built dataset, recognition
accuracy of the Megrhi et al. method was 46.62% (4.02% lower than multi-feature fusion).
These results demonstrated the multi-feature fusion method is effective in identifying
movements from dance videos.
The recognition accuracies of the method proposed by Megrhi et al. and the multi-feature
fusion method for different dance movements in the self-built set were further analyzed,
and the results are shown in Fig. 7.
Fig. 7 shows that accuracy from the Megrhi et al. method was below 50% for all three dances,
among which the lowest accuracy (45.12%) was with Green Silk Gauze Skirt, and the
highest accuracy (47.96%) was with Memories of the South. Compared with the Megrhi
et al. method, recognition accuracy from multi-feature fusion was 5%, 2.55%, and 4.51%
higher for the three dances, which proves the reliability of the multi-feature fusion
method in recognizing different types of dance movement.
Table 2. Comparison of Key Frame Extraction Effects.
Dataset
|
Method
|
Recall ratio
|
Precision ratio
|
Deletion factor
|
DanceDB
|
Jadhav and Jadhav
|
71.45%
|
66.87%
|
4.42
|
Hannane
et al.
|
73.46%
|
68.19%
|
3.56
|
Multi-feature fusion
|
82.27%
|
72.84%
|
3.01
|
Self-built dataset
|
Jadhav and Jadhav
|
73.06%
|
71.29%
|
2.77
|
Hannane
et al.
|
75.77%
|
73.36%
|
2.41
|
Multi-feature fusion
|
84.82%
|
81.07%
|
2.25
|
Fig. 4. Comparison of key frame extraction results.
Fig. 5. Comparison of accuracy from dance video movement recognition.
Fig. 6. Comparison of recognition accuracy from multi-feature fusion and trajectory feature fusion.
Fig. 7. Accuracy comparison with the self-built set.