Mobile QR Code QR CODE




Deep learning, Convolutional neural network, Movement recognition

1. Introduction

Recognition of movements in videos has been one of the problems to be solved in the field of computer vision [1]. The computer’s ability to recognize human movements has a significant impact and practical significance for people’s lives. For example, movement can be used to command a computer to perform certain tasks, thus drawing attention to recognition of human body movements. With the continuous development of science and technology, deep learning is gaining more and more attention, ultimately still falling under machine learning. The technology for human action recognition based on deep learning is also progressing, with methods such as the abstract graph convolutional network (GCN) [2], the deep recurrent neural network (RNN) [3], and the convolutional neural network (CNN) [4]. Most existing approaches to recognizing human actions use deep learning techniques. Previously, traditional computer vision methods that manually extract features were commonly applied to recognizing human movements; however, changes in the environments where movements occur can affect recognition results. Nowadays, human movement recognition is mainly used in sports training [5], human-computer interaction [6], intelligent monitoring [7], elderly care, etc. Yu et al. tested whether a CNN can recognize human behaviors in videos and found that the CNN could determine whether people are in a critical state by recognizing human behaviors in order to give a timely warning, thus playing a big role in first aid before entering hospital [8]. Xu and Qiu automatically extracted activity features related to human life using a CNN algorithm and found it could recognize six human activities, such as sitting, standing, walking, jogging, and going up stairs [9]. Liu et al. found that the accuracy of a 3D CNN in recognizing movements from the EgoGesture dataset was 72.4%, surpassing current dynamic gesture recognition methods and confirming its effectiveness [10]. Wang and Zhang proposed a CNN with multidimensional serial feature extraction modules for obscured faces, and combined it with a deep learning method to improve the recognition rate [11]. A review of research by domestic and international scholars found that most techniques mainly use pictures or videos for human action recognition. Therefore, videos were chosen because they can show a series of movements that people make when giving commands to computers in their daily lives. This paper takes ballet movements as examples and based on research into deep learning uses a CNN method to recognize and analyze those movements, optimizing training of the CNN by using a particle swarm optimization (PSO) algorithm. Then, the optimized CNN, traditional CNN, and SVM methods are compared by using 1,000 ballet movement videos, which provided an effective reference for recognition and training support.

2. Dance Movement Recognition Methods

2.1 Convolutional Neural Networks

The CNN [12] is one of the most typical and frequently used approaches in deep learning. Its features, such as rotation, translation, and size scaling, make it highly suitable for processing image data. Here is a brief introduction to the most important parts of the CNN approach. The convolutional layer is responsible for extracting features from the input and is the most important part of the CNN approach. It is executed multiple times with different convolutional kernels each time due to variations in extracted data. Its calculation is as follows:

(1)
$y_{mn}=f\left(\sum \sum \mathrm{x}_{m+i,~ n+j}w_{ij}+b\right)$,

where $x_{m+i,n+j}$ represents the image pixel value of point (m+i, n+j), $w_{ij}$ represents the weight of the convolutional kernel scale at point (i, j), b represents the size of the bias in the layer, and f is the network activation function.

The pooling layer is mainly responsible for subsampling the dance movement feature map obtained in the convolutional layer. It can compress a large amount of image data while maintaining the scale feature. The specific calculation is:

(2)
$y_{mn}=f\left(w\frac{1}{S_{1}S_{2}}\sum _{j=0}^{S_{2}-1}\sum _{i=0}^{S_{1}}x_{m*{S_{1+i}},n*{S_{2}}+j}+b\right)$,

where $x_{m*{S_{1+i}},n*{S_{2}}+j}$ represents the pixel value of point $\left(m*S_{1+i},\,\,\,n*S_{2+j}\right)$, and $y_{mn}$ represents the output value after the pooling operation.

The fully connected layer connects the extracted feature map and classifies input images based on the training data using the features. Every neuron in this layer is connected to the neuron of the last layer. The input of this layer is the vector output from the last layer, and the output of this layer is the final output of the CNN.

The activation function is a crucial component of a CNN. If the activation function is missing, the network input and output will be linear. For a nonlinear structure, the activation function should be added as a nonlinear unit. Commonly used activation functions include sigmoid, tanh, and ReLU [13]. All values of the ReLU function are compared with 0, and the largest value is selected. Therefore, when the input value is less than or equal to 0, the output value is also 0. The function only needs to judge whether input value x of the function falls within a positive interval, resulting in a small computation amount, i.e., its performance in network convergence speed surpasses that of the aforementioned two functions. This paper uses ReLU not only because it reduces network fit issues; it also improves network convergence speed. Its formula is

(3)
$Relu\left(x\right)=\left\{\begin{array}{l} x,x\geq 0\\ 0,x<0 \end{array}\right.$.

When training the network, if the model predicts unknown data, error in the test set for the model will be larger than in the training set, indicating poor prediction performance. To prevent this kind of network fit phenomenon, this paper adopts the dropout method [14] to optimize the model. The dropout principle is as follows. In the process of network training via deep learning, some neurons are removed while the remaining neurons continue to participate in network training without any changes. The removed neurons are randomly selected and their previous values are restored in subsequent network training sessions. This operation reduces the correlation between neurons, and prevents local features from fitting into the network. To prevent network fit issues, a dropout parameter of 0.5 was set.

2.2 Improving Movement Recognition

The above CNN method is able to recognize ballet movements, but the traditional CNN easily falls into the problem of overfitting during the training process. To improve recognition performance from the CNN, the PSO algorithm replaces the backward adjustment of weight parameters in the traditional CNN method. The forward calculation of the improved CNN method in the training process is consistent with the traditional CNN, and when the error obtained from the forward calculation does not converge within the set range, the iterative formula of the PSO algorithm is used to iterate the position and velocity of the particle swarm. The coordinates of each particle in the swarm represent a parameter scheme, and the iterative formula of the PSO algorithm is

(4)
$\left\{\begin{array}{l} v_{i}(t+1)=\varpi v_{i}(t)+c_{1}r_{1}(P_{i}(t)-x_{i}(t))+c_{2}r_{2}(G_{g}(t)-x_{i}(t))\\ x_{i}(t+1)=x_{i}(t)+v_{i}(t+1) \end{array}\right.$,

where $v_{i}(t+1)$ and $x_{i}(t+1)$ are the speed and position of particle $i$ after one iteration, $v_{i}(t)$ and $x_{i}(t)$ are the speed and position of particle $i$ before the iteration, $\varpi $ is the inertia weight of the particle, $c_{1}$ and $c_{2}$ are learning factors, $r_{1}$ and $r_{2}$ are random numbers between 0 and 1, $P_{i}(t)$ denotes the optimal position experienced by particle~$i$ (excluding particles exceeding the limit), and $G_{g}(t)$ is the best position experienced by the particle population after excluding particles that exceed the limit. After the iteration, the parameter scheme represented by the particle is substituted into the CNN method, which performs forward calculation again, repeating the above process until the error converges to within a preset range.

The improved CNN approach autonomously extracts features from input images of real dance movements, eliminating the need for manual feature input. The specific process of dance movement recognition using the CNN approach is as follows. (1) A sufficient number of dance action videos are collected to form an initial dataset. (2) In data pre-processing after segmentation and denoising, image frames are extracted from the collected dance action data videos and are modified in terms of size, color, etc. to meet the standard requirements for image input network models. (3) The network model is trained, the processed image frames are divided into two parts, one of which is used as the training set, constantly adjusting parameters for the image data in that set until optimal performance is achieved. (4) After obtaining the optimal network model, the remaining image frames are used as input for the test set, and the dance movements are recognized by the network model.

3. Experiment Analysis

3.1 Image Pre-processing

The pre-processing of images in dance action videos is an important link in the recognition system because it determines the effect of CNN training and testing. To ensure the validity of the experiment results, this paper randomly selected 1,000 ballet training video samples as a dataset. The dance movements in the videos were divided into five main categories: drawing circles with legs (Fig. 1(1)), small kicks (Fig. 1(2)), two-position mid-jumps (Fig. 1(3)), single-leg squats (Fig. 1(4)), and large squats (Fig. 1(5)). The number of videos for each specific type is shown in Table 1. The reason for choosing ballet movements is that different types of dances have their own characteristics, and conducting comprehensive training and recognition would result in an overwhelming workload. Therefore, ballet was chosen as the focus of recognition. Ballet is a graceful art form originating from the Italian Renaissance which flourished in France throughout its development and perfection. It combines elements of dance, music, and drama while highlighting dancers' body posture techniques as well as expressions. Being a classical dance style, it not only showcases elegance with artistic beauty but carries on a historical and cultural heritage. At the same time, the captured dance video images underwent pre-processing that mainly involved noise removal, contrast adjustment, grayscale transformation, video length reduction, and uniform resizing to 224 ${\times}$ 224. Additionally, image processing operations such as Gaussian blur were applied to prevent image blur after cropping. These image pre-processing operations ensured that the dance videos in the dataset had similar resolutions, durations, contrast, and sizes for faster training of the CNN model.

Fig. 1. Five types of dance movement.
../../Resources/ieie/IEIESPC.2024.13.3.209/fig1.png
Table 1. The Dance Video Input.

Movement

Number of videos

Proportion

Drawing circles with legs

189

18.9%

Small kick

213

21.3%

Two-position mid-jump

147

14.7%

Single-leg squat

218

21.8%

Large squat

233

23.3%

3.2 Experiment Design

In order to verify the performance of the optimized CNN method in recognizing ballet movements, traditional CNN and SVM methods were also tested. The optimized CNN was obtained by adding PSO to the traditional CNN, so both methods were compared to test the improvement from PSO on the recognition performance of the CNN. The SVM method was a traditional machine learning classification algorithm. The recognition of ballet movements in this paper can also be considered recognition of dance movement types, and recognition was used to verify the performance of the optimized CNN method compared to other recognition algorithms [15]. The 1,000 ballet videos in the initial dataset were divided into a training set and a test set at an 8:2 ratio. The parameters of the traditional CNN are shown in Table 2. The parameters of the optimized CNN were the same as the traditional CNN. The parameters of the PSO are as follows: particle swarm size: 20, learning factor: 1.5, and inertia weight: 0.8. The parameters of the SVM method are as follows. The kernel used a sigmoid function, and the penalty factor was set to 1.

Table 2. Initial CNN Parameter Settings.

Parameter

Value

Batch size

5

Learning rate

0.001

Optimizer

Adam

Activation function

ReLU

Number of Iterations

30

3.3 Analysis of Results

The SVM method directly computed the support vector hyperplane based on the data in the training set during the training process, which was different from the step-by-step iterative process of the traditional and improved CNN methods. Fig. 2 shows the error convergence curves of the traditional and optimized CNN methods during the training process. We can see that the recognition error in both algorithms decreased and converged to stability with an increase in the number of iterations. The improved CNN was the fastest, converging to stability after about five iterations, and the traditional method converged to stability after about 20 iterations. After convergence stability, the error of the optimized CNN method was smaller than the traditional CNN.

Fig. 2. The convergence curves of the traditional and optimized CNN methods.
../../Resources/ieie/IEIESPC.2024.13.3.209/fig2.png

The impact from the number of consecutive frames on network recognition accuracy was initially determined. The 1,000 dance videos were converted to images. Six sets of frames at five-frame intervals (from 5 frames up to 30 frames) were extracted for experimentation. We see from Fig. 3 that as the number of consecutive frames increased, recognition accuracy also increased and gradually tended to be stable, and the increase in the accuracy was very little after the number of frames exceeded 25. It is known that more consecutive frames require more computations and more time. Therefore, considering the experiment’s length, the final number of frames was set at 25.

Fig. 3. The influence of the number of consecutive frames from an image on recognition accuracy.
../../Resources/ieie/IEIESPC.2024.13.3.209/fig3.png

The recognition accuracy and speed of the three algorithms are in Table 3. From the comparison, we can see that the SVM method had the lowest accuracy and the least efficiency. The traditional CNN method had greater accuracy and efficiency, and the optimized CNN method had the highest accuracy and efficiency.

Table 3. Results of ballet movement recognition by the different methods.

Accuracy (%)

Speed (in seconds)

SVM method

Traditional CNN

Optimized CNN

SVM

Traditional CNN

Optimized CNN

Drawing circles with legs

82.03

90.31

95.79

3.47

2.03

1.13

Small kick

85.69

91.22

96.62

3.56

2.15

1.69

Two-position

mid-jump

85.27

89.45

94.27

2.98

1.87

1.02

Single-leg squat

86.16

92.32

97.09

3.19

2.36

1.55

Large squat

81.72

89.24

94.51

3.41

2.47

1.36

Average

84.17

90.16

95.66

3.32

2.68

1.35

4. Discussion

As an art form, dance not only cultivates the emotions, but exercises the body. Traditionally, during dance practice, a coach needs to help correct the dance movements, but coaches bring their own habits to the teaching process and have limited energy, which may result in incorrect movements by the dancers and limited teaching efficiency. As they develop, intelligent algorithms are gradually applied in the field of image recognition. For this paper, intelligent algorithms were applied to recognize dance movements. In order to facilitate the research, this paper focuses on ballet, uses the CNN to recognize movements, and introduces the PSO algorithm to adjust the weight parameters and improve recognition performance of the CNN method. In the following analysis, the optimized CNN method was compared with traditional CNN and SVM methods for the final results shown above. In the comparison of results, the optimized CNN method had the highest efficiency and recognition accuracy for ballet movements, the traditional CNN method was second, and the SVM method was last. The SVM method obtained the image features first when recognizing images of the dance movements. Since the features were extracted manually, the information contained in them was not comprehensive enough. Although the SVM method used kernel functions to project the image features into a high-dimensional space, it was still difficult to effectively fit the hyperplane of the SVM method to the nonlinear features of the images. Compared with the SVM, the traditional CNN automatically extracted image features using convolutional kernels, and combined global convolutional features using more than one convolutional kernel, which made full use of image feature information. The activation function effectively fit the nonlinear law, so it provided better recognition accuracy and efficiency. The optimized CNN with the PSO algorithm to adjust the parameters retained the advantages of the traditional CNN method. Moreover, PSO was used to iterate the particle swarm to avoid overfitting during CNN training, so the recognition accuracy and efficiency were the highest.

5. Conclusion

This paper introduced human movement recognition via CNN approaches. Based on deep learning, the CNN was used to identify ballet movements. Training for the CNN was optimized using the PSO algorithm. Then, 1,000 ballet movement videos were used as the dataset for comparing the optimized CNN, traditional CNN, and SVM methods. Compared to the traditional CNN, the optimized method converged faster during training with less error after convergence to stability. The SVM method showed the least efficiency and lowest recognition accuracy of the ballet movements, whereas the traditional CNN was higher, and the optimized CNN method was the highest.

This paper used the convolutional kernel in the CNN method to automatically extract image features to recognize ballet movements. To improve recognition performance from the CNN algorithm, the PSO algorithm was introduced to adjust the weight parameters during the training process, which provided an effective reference for intelligent recognition of ballet movements and assisted exercise of dance movements. The limitation of this paper is that the optimized CNN method was only used for ballet movement recognition, so a future research direction is to generalize the CNN algorithm to recognize other kinds of dance movements.

REFERENCES

1 
A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, ``Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features,'' IEEE Access, Vol. 6, No. 99, pp. 1155-1166, 2018.DOI
2 
B. K. Gao, L. Dong, H. B. Bi, and Y. Z. Bi, ``Focus on temporal graph convolutional networks with unified attention for skeleton-based action recognition,'' Applied Intelligence, Vol. 52, pp. 5608-5616, 2022.DOI
3 
H. Wang, and L. Wang, ``Learning content and style: Joint action recognition and person identification from human skeletons,'' Pattern Recognition, Vol. 81, pp. 23-35, Sep. 2018.DOI
4 
Z. Yang, Y. Li, J. Yang, and J. Luo, ``Action Recognition with Spatio-Temporal Visual Attention on Skeleton Image Sequences,'' IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, No. 8, pp. 2405-2415, Aug. 2018.DOI
5 
F. Malawski, and B. Kwolek, ``Recognition of Action Dynamics in Fencing Using Multimodal Cues,'' Image and Vision Computing, Vol. 75, No. JUL., pp. 1-10, May. 2018.DOI
6 
Y. Xu, J. Cheng, L. Wang, H. Xia, F. Liu, and D. Tao, ``Ensemble One-dimensional Convolution Neural Networks for Skeleton-based Action Recognition,'' IEEE Signal Processing Letters, Vol. 25, No. 7, pp. 1044-1048, Jan. 2018.DOI
7 
H. Zhang, M. Xin, S. Wang, Y. Yang, L. Zhang, and H. Wang, ``End-to-end temporal attention extraction and human action recognition,'' Machine Vision and Applications, Vol. 29, No. 7, pp. 1127-1142, 2018.DOI
8 
Q. Yu, P. Jiang, Y. Wang, and Z. Wang, ``Research on first aid measures based on convolutional neural network recognition human actions,'' Chinese Critical Care Medicine, Vol. 32, No. 11, pp. 1385-1387, Nov. 2020.DOI
9 
Y. Xu, and T. T. Qiu, ``Human Activity Recognition and Embedded Application Based on Convolutional Neural Network,'' Journal of Artificial Intelligence and Technology, Vol. 2021, No. 1, pp. 51-60, Dec. 2021.DOI
10 
Y. Liu, D. Jiang, H. Duan, Y. Sun, and G. Li, ``Dynamic Gesture Recognition Algorithm Based on 3D Convolutional Neural Network,'' Computational Intelligence and Neuroscience, Vol. 2021, No. 12, pp. 1-12, 2021.DOI
11 
X. Wang, and W. Zhang, ``Anti-occlusion face recognition algorithm based on a deep convolutional neural network,'' Computers & Electrical Engineering, Vol. 96, pp. 1-12, 2021.DOI
12 
X. Ran, Z. Shan, Y. Shi, and C. Lin, ``Short-Term Travel Time Prediction: A Spatiotemporal Deep Learning Approach,'' International Journal of Information Technology & Decision Making (IJITDM), Vol. 18, No. 04, pp. 1087-1111, April. 2019.DOI
13 
J. M. Kudari, A. Jebakumari, and S. Kumar, ``Adlin Jebakumari S and Sushma B S, Image Classifier Using the Adam Optimizer and the Relu Activation Function,'' International Journal of Advanced Research in Engineering & Technology, Vol. 12, No. 3, pp. 56-60, March. 2021.DOI
14 
A. Poernomo, and D. K. Kang, ``Biased Dropout and Crossmap Dropout: Learning towards effective dropout regularization in convolutional neural network,'' Neural Networks, Vol. 104, pp. 60-67, April. 2018.DOI
15 
S. Mehrang, J. Pietilä, and I. Korhonen, ``An Activity Recognition Framework Deploying the Random Forest Classifier and A Single Optical Heart Rate Monitoring and Triaxial Accelerometer Wrist-Band,'' Sensors, Vol. 18, No. 2, pp. 1-13, Feb. 2018.DOI
Guiheng Zhi
../../Resources/ieie/IEIESPC.2024.13.3.209/au1.png

Guiheng Zhi, born in October 1983, graduated from Guangxi Arts University in 2007 with a major in choreography and then stayed there to teach. He studied at Xiamen University in 2012 and obtained a master's degree in engineering in 2014. He is a lecturer and teaches courses that include basic training of classical dance, national folk dance, modern dance, and choreography techniques. His professional research directions are performance and choreography.