Mobile QR Code QR CODE

  1. (College of General Education, Guangxi Vocational College of Water Resources and Electric Power, Nanning 530105, China )



AI, Tennis training, Action recognition

1. Introduction

With the development of tennis in recent years, it has received extensive attention from all walks of life. At present, tennis has become an elegant sport for all ages and is especially favored by college students [1]. Tennis is a very professional sport with strong technical requirements. In many kinds of ball games, the requirements of its actions and technical standards are extremely strict. The professional level of this kind of sport is shown in that the technical levels of amateur group and professional group are very different [2]. Professional athletes need to be trained from an early age, and it takes many years to reach a higher level and maintain the highest level.

In the actual tennis teaching process, some students’ own factors and external factors extremely easily cause a series of wrong actions, which greatly reduces the tennis level of students. This affects students’ enthusiasm for learning and training in tennis and also affects the improvement of tennis teaching quality [3]. These mistakes are inevitable in the process of skill acquisition and area normal phenomenon.

Especially in the first stage of mastery of technical movements, students are most prone to doing all kinds of wrong technical movements [4]. Therefore, in the teaching process, teachers must be good at finding and correcting students’ wrong actions in time so as to avoid the formation of wrong patterns, which will affect students’ interest in learning and the effect of teaching tennis skills. In the process of preliminarily mastering technical movements, if students’ wrong movements are not corrected in time, it is easy for students to form wrong technical habits, which will adversely affect their future tennis learning and improvement [5]. Based on this, this paper discusses the recognition of tennis training errors.

In recent years, motion recognition has become one of the most relevant topics in the field of artificial intelligence (AI) and computer vision and has been widely used in various fields. AI has brought great changes to the traditional sports industry [6]. DL (Deep learning) is a research hotspot in the field of AI, and various research results based on DL methods have been applied to practice [7].

At present, the mainstream research on human motion recognition is based on DL. Before the application of DL, the computer vision method of manually extracting features was widely used in human motion recognition [8,9]. It is of great significance to construct a method for recognizing tennis error training actions with excellent performance. As an important branch of computer vision, video-based motion recognition studies how to recognize specific human actions from specified video sequences [10].

On the basis of successfully realizing motion capture and feature extraction, motion recognition automatically recognizes human motion by analyzing the obtained human motion feature parameters. Motion recognition technology has significance and broad application prospects in man-machine interface, intelligent monitoring, and sports analysis [11]. In this study, a recognition method of wrong tennis training action was constructed based on AI technology. The random projection algorithm was used to reduce the dimension of feature vectors, and a CNN (convolutional neural network) model was used to learn the training samples after the dimension reduction to build a model for recognition of tennis training error. The preprocessing of the data in this study includes converting the original Cartesian coordinate system into a cylindrical coordinate system and normalizing the time of the skeletal motion sequence. We deepened the depth of the original NN (neural network), added a batch standardization layer after each convolution layer, and redesigned the network structure.

2. Related Work

Hu et al. proposed a method of using the explicit learning method to train the basic skills of serving and using the implicit learning method to improve the skills [12]. Chen et al. pointed out that in the process of practical teaching, complex actions are usually decomposed. For example, the forehand technique is divided into four parts: lead, swing, hit, and follow-swing [13]. These four links are complementary to each other. Once there is an error in a previous link, the effect of a latter link will be unsatisfactory.

Liu et al. proposed a method combining a conditional random field and conditional probability density propagation for action segmentation and recognition of continuous human actions [14]. This method decomposes continuous action recognition into a divide-and-conquer method for individual action recognition. Yu et al. proposed an upper-body human object detection algorithm without pose constraints [15]. Marlon et al. proposed a multimodal-based human motion recognition method and implemented a motion data acquisition system and a human motion recognition system [16].

Nazir et al. proposed a feature dimensionality reduction and Gaussian mixture model for sports action recognition [17,18]. Ramezani et al. used ResNet-34 with a deeper network structure to improve the NN and conducted experiments to verify that the change of the network structure can indeed slightly improve the recognition accuracy [19]. Xu et al. believe that there are both subjective and objective reasons for the occurrence of wrong technical movements in tennis. Subjective reasons refer to students’ subjective willingness to learn tennis skills, and objective reasons refer to teachers’ teaching behaviors, teaching methods, and teaching environment factors [20]. In tennis teaching, teachers should be good at discovering and correcting students’ mistakes in a timely manner, and at the same time, they should also cultivate good independent learning skills and learning enthusiasm.

Yong et al. pointed out that establishing explicit rules through the development of the knowledge base of tennis technique and combining it with an artificial NN to analyze technical movements will be a development trend in tennis technique diagnosis in the future. The application of AI in movement technique analysis will represent a development in sports biomechanics and significant progress [21]. Lin et al. considered the limitation of observation sequence and the problem of "marker bias" in traditional probability model. They proposed a method to describe a human action feature sequence based on a conditional random field model [22].

3. Methodology

3.1 Technology and Theoretical Basis of Motion Recognition

Tennis is extremely demanding, not only in terms of technique, but also in terms of physical conditions [23]. Because of its complicated technical actions and various other reasons, it is easy to make mistakes in the learning process. Especially for beginners, starting it is also quite difficult, and incorrect technical movements may even cause sports injuries such as muscle strain. If the wrong actions are not corrected in time in the early stage, students will easily form wrong technical stereotypes, which will affect the mastery and improvement of the next stage of technique [24]. Therefore, in the process of teaching, teachers should diagnose and analyze students’ wrong actions in time and put forward correction methods for students so as to better improve the quality of tennis teaching.

Human motion recognition based on video is a basic subject in computer vision research. A DL model is a nonlinear network model with many hidden layers. Through training with large-scale original data, the network can extract the features that can best express the original data and then predict or classify samples. The DL model architecture is shown in Fig. 1. With rapid development, DL technology has an advantage in the fields of computer vision, natural language processing, and so on [25].

Fig. 1. DL model architecture diagram.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig1.png

CNN starts at the bottom of the image and gradually extracts features towards the top. At a lower level, CNN learns to extract simple edge and color features, such as lines, curves, and colors. As the hierarchy increases, CNN gradually learns more complex features, such as shapes, object parts, and ultimately complete objects. This feature extraction process from bottom to top is the strength of CNN, as it allows the network to automatically find the most useful features for recognition tasks. In addition, CNN also has good robustness and can handle changes in image size, rotation, and flipping, which makes it perform well in many tasks, especially in image recognition tasks. Compared with the traditional way of extracting data features manually, a CNN automatically extracts richer and more abstract features of objects by knowing the objects themselves. An NN can approximate any nonlinear continuous function with arbitrary precision. Many problems in the modeling process are highly nonlinear. With the continuous development of DL technology, a CNN has been widely used by researchers, and its effect has been well verified in many network models. A CNN adopts the form of partial connection, and only some neurons in the network are connected. Generally, a CNN consists of three parts: a convolution layer for extracting features, a pool layer for reducing the size of feature graph, and a full connection layer. A CNN is characterized by its convolution operation, and the convolution process is shown in Fig. 2.

Fig. 2. Convolution process.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig2.png

In the pixel block of an image, a pixel value undergoes an operation with a corresponding convolution kernel to yield an output value for that specific block. Subsequently, a new pixel block is selected, and the convolution kernel is shifted, allowing for the computation of convolutions across the entire image. This series of operations is referred to as the initial convolution process.

3.2 Human Motion Recognition Method

Human tracking is a technology that uses various sensors, algorithms, and computer vision technologies to identify and track the position and posture of the human body in space in real-time. By establishing a motion model of the human body, it is possible to track and recognize the human body. For example, a 3D human model can be used to simulate human motion and algorithms can be used to fit actual human motion data. By comparing motion models over several consecutive days, the corresponding relationship between the human body or joint points can be determined. There are many ways of matching, such as location-based matching, material and color-based matching, and speed-based matching. The feature extraction method and recognition algorithm are the two most important parts in the recognition process. A diagram of human motion recognition is shown in Fig. 3.

Fig. 3. Diagram of human motion recognition.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig3.png

After different features are used to represent human movements, the recognition of human movements becomes a pattern classification problem. Classifiers can be divided into linear classifiers and nonlinear classifiers according to different classification planes. A nonlinear classification algorithm is difficult to solve, and a large number of human motion classification methods use a linear classifier.

In the task of human motion recognition, data modalities are generally divided into three categories: video data, depth images, and bone motion sequences. According to the different data modalities of the recognition task, different algorithms or models are designed to complete it. The core idea of the method is to extract the whole human body contour (which includes motion features, the whole structure, and the external shape of the human body). We make use of these three characteristics in the model. Finally, the motion recognition is completed by the constructed model.

3.3 Construction of Model

In the modal skeletal motion sequence data, each sample uses human bones and 25 joints to represent a moving individual, and the 3D position change of joint points with time represents human movements. At the same time, the coordinates of these joint points are normalized so that the coordinate data is not affected by the scale. A batch standardization layer was used to optimize the model and enhance the generalization ability of the network. It can unify the scattered data, normalize them, accelerate the convergence of the loss function, and help to reduce the gradient dispersion and spread the gradient.

We collected tennis training action videos and extracted tennis training action features. For a 2D input of size $N\times N$, the convolution kernel is $k\times k$, the stride is $s$, and the zero padding is a convolution operation of $p$. If the output size after convolution is $m\times m$, the calculation of $m$ is as follows:

(1)
$ m=\left[\frac{\left(N+2p\right)-k+1}{s}\right] $

$\left[\cdot \right]$ means rounding down. When appropriate convolution parameters are set, the scale after convolution can be kept unchanged, and not all convolution will reduce the input dimension, although it is necessary in most cases. However, in general, in convolution networks, the pool layer is used for down sampling to achieve the goal of feature fusion and smoothing.

Fig. 4. ReLU function diagram.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig4.png

The ReLU function is shown in Fig. 4. When the input value is negative, its output is 0. When the input is positive, the output is the input. Its expression is as follows:

(2)
$ \mathrm{Re}lu\left(x\right)=\max \left(0,x\right) $

A sigmoid function maps the input to the interval of (0,1) so that the transmission process information does not diverge. Its expression is as follows:

(3)
$ Sig\left(x\right)=\frac{1}{1+e^{-x}} $

The tanh function maps the input to the interval (-1,1), and its average value is 0. The function expression is as follows:

(4)
$ \mathrm{Tanh}\left(x\right)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} $

Suppose that in a 2D CNN, an input tensor of size $C^{l}\times H^{l}\times W^{l}$ is layer $l$. $C^{l}$ is the number of input channels of layer $l$. The size of a single convolution kernel is $C^{l}\times h^{l}\times w^{l}$. Then, corresponding to a convolutional layer with $C^{l+1}$ hidden neurons, the output of the corresponding position is as follows:

(5)
$ y_{d,{i^{l}},{j^{l}}}=\sigma \left(\sum _{c^{l}=0}^{C^{l}}\sum _{i=0}^{h^{l}}\sum _{j=0}^{w^{l}}p_{d,{c^{l}},i,j}\times x_{{c^{l}},{i^{l}}+i,{j^{l}}+j}+b_{d}\right) $

$d$ is the neuron number in layer$l$, and $i^{l}$ and $j^{l}$ represent the location information. The constraints are Eqs. (6) and (7):

(6)
$ 0\leq i^{l}\leq H^{l}-h^{l}+1 $
(7)
$ 0\leq j^{l}\leq W^{l}-w^{l}+1 $

$p$ is the convolution kernel parameter, $b$ is the bias parameter in the convolution, and $\sigma \left(\cdot \right)$ is the activation function.

Given an image $P$ of size $M\times N$, the image matrix can be regarded as a one-dimensional vector in row-major order, and a one-dimensional label vector is defined:

(8)
$ A=\left(A_{1},\ldots ,A_{i},\ldots ,A_{M\times N}\right) $

The value range of element $A_{i}$ in $A$ is $\left\{0,1\right\}$. The image segmentation effect can be evaluated by calculating the cost function labeled $A$.

(9)
$ E\left(A\right)=\lambda \cdot R\left(A\right)+B $
(10)
where $R\left(A\right)=\sum _{i=1}^{M\times N}R_{j}\left(A_{i}\right)$
(11)
$ B\left(A\right)=\sum _{\left\{i,j\right\}}B_{ij}\delta \left(A_{i},A_{j}\right) $

Let $p_{\lambda }$ be the probability density function, $\lambda =\left[\lambda _{1},\lambda _{2},\lambda _{3},\ldots ,\lambda _{M}\right]$ be the $M$ parameter vectors of $p_{\lambda }$, and $X_{1}=\left[x_{t},t=1,2,3,\ldots ,T_{I}\right]$ be the effective features of tennis training action videos. $d$ is the feature dimension after dimension reduction, including $K$ Gaussian unit parameter sets:

(12)
$ \lambda =\left\{w_{i},u_{i},\Sigma _{i}\right\}\,\,\,i=1,2,3,\ldots ,k $

Its Gaussian mixture model is:

(13)
$ p_{\lambda }\left(x_{t}\right)=\sum _{i=1}^{k}w_{i}p_{i}\left(x_{t}\right) $

$w_{i}$, $u_{i}$, and $\Sigma _{i}$ are the mixed weight, mean vector and covariance matrix, respectively, and $p_{i}\left(x_{t}\right)$ is the $i$th Gaussian unit of $x_{t}$. According to the Bayesian equation, the calculation equation of the probability that $x_{t}$ is assigned to the $i$th Gaussian unit is:

(14)
$ r_{t}\left(i\right)=w_{i}p_{i}\left(x_{t}\right)/\sum _{k=1}^{k}w_{k}p_{k}\left(x_{t}\right) $

Then, the gradients of $x_{t}$ with respect to $\lambda =\left\{w_{i},u_{i},\Sigma _{i}\right\}\,\,\,i=1,2,3,\ldots ,k$ are expressed as:

(15)
$ \frac{\partial \mathrm{\ell }_{\lambda }\left(X\right)}{\partial w_{i}}=\sum _{t=1}^{T}\left[\frac{r_{t}\left(i\right)}{w_{i}}-\frac{r_{t}\left(1\right)}{w_{1}}\right] $
(16)
$ \frac{\partial \mathrm{\ell }_{\lambda }\left(X\right)}{\partial u_{i}^{k}}=\sum _{t=1}^{T}r_{t}\left(i\right)\left[\frac{x_{t}^{k}-u_{i}^{k}}{\left(\sigma _{i}^{k}\right)^{2}}\right] $
(17)
$ \frac{\partial \mathrm{\ell }_{\lambda }\left(X\right)}{\partial \sigma _{i}^{k}}=\sum _{t=1}^{T}r_{t}\left(i\right)\left[\frac{\left(x_{t}^{k}-u_{i}^{k}\right)^{2}}{\left(\sigma _{i}^{k}\right)^{3}}-\frac{1}{\sigma _{i}^{k}}\right] $

where $\sigma _{i}^{k}$ represents the standard deviation in the covariance matrix $\Sigma _{i}$.

The identification process essentially selects a model from the set of models that best describes the observed signal. We first perform feature extraction on the input action sequence to obtain an observation feature sequence $O$ corresponding to the action sequence. We then calculate the probability $P\left(S\left| O,\theta \right.\right)$ corresponding to each model in the model set. Finally, the category of the action is determined according to the following maximum likelihood equation:

(18)
$ S^{\ast }=argmax\left(P\left(S\left| O,\theta \right.\right)\right) $

We calculate the conditional probability $P\left(S\left| O\right.\right)$ of the label sequence $S$ given the observation sequence $O$ and use the forward-backward dynamic programming algorithm to calculate the labeling action with the highest probability as the recognition result of this step. In this study, after the three-dimensional pool processing, both the space size and the time size of the feature graph are reduced, which greatly reduces the calculation amount of the subsequent network. The maximum pooling operation can effectively reduce the number of parameters and computational complexity of the network. This enables the network to complete calculations faster when processing input data, improving time vitality.

4. Result Analysis and Discussion

In order to analyze the recognition effect of tennis training action by feature reduction and the Gaussian mixture model, this study used 8564 tennis training video samples with complex background and a wide visual angle. 5,000 samples were selected as the training set, and the other samples were used as the test set. The cropping used in this article is random cropping. This random cropping scheme aims to ensure that the final cropped image contains human body information.

At the same time, in order to enhance the network's generalization ability, prevent the network from overfitting during the training process, and reduce the network's sensitivity to noise, this article also flipped, rotated, and scaled 10% of the cropped images. Before the formal training of the model, the parameters of the model were initialized. In the convolution layer and batch standardization, the convolution kernel parameters and batch standardization parameters were all standardized into random numbers with a mean value of 0.02 and a standard deviation of 1, and the offset parameter was set to 0. The training of the model is shown in Fig. 5.

Fig. 5. Model training.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig5.png

The movement of a human body needs corresponding characteristic parameters to describe it, and different movements may focus on different characteristic parameters, so it is necessary to use features that are as effective and appropriate as possible to express complex movement characteristics. Because the dynamic range distribution of different dimension data is very different, this study normalized each dimension data to make it a unit vector with a modulus of 1, which is convenient for subsequent calculation.

After passing through the classifier, features are mapped to a vector of classification scores. This vector represents the probability distribution of the samples corresponding to the motion features across all motion categories. The index position corresponding to the maximum probability value is then selected as the predicted classification result of the model. The classification accuracy of different algorithms is shown in Fig. 6.

Fig. 6. Classification accuracy of different algorithms.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig6.png

In image recognition or object recognition tasks, skeleton information usually refers to the internal structural information of the object, which can reflect the overall shape and structural characteristics of the object. Depth information refers to the distance relationship between an object and the camera, which reflects the three-dimensional shape and depth perception of the object. In the case of inaccurate skeleton information, there may be some interference factors with large shape changes and unstable structure, which can interfere with the recognition algorithm and lead to a decrease in recognition rate. And depth information can provide a more stable shape description, so using depth information can eliminate the interference caused by skeleton instability and improve recognition accuracy. The tennis training action recognition results of this method and a previous method are shown in Table 1.

Table 1. Accuracy of tennis training action recognition.

Action type

Training sample

Test sample

LPMR +Single-stream method

Methods of this study

LPMR +Single-stream method

Methods of this study

Serve

89.54%

93.54%

82.34%

93.64%

Throw a ball

82.31%

94.16%

85.67%

94.12%

Strike a ball

85.34%

94.87%

88.29%

94.25%

Swing action

88.63%

93.95%

80.15%

93.47%

All the action videos in this database were decomposed into images and stored in folders, which have been preprocessed. The decomposed action videos were made into a training set with 32 non-overlapping frames, and the 32 frames after cropping were randomly selected as the input of the network during the training process. Fig. 7 shows the errors of different algorithms in the training set, Fig. 8 shows the errors of different algorithms on the test set, and Fig. 9 shows the time taken by different algorithms. Using the MATLAB platform, the recognition efficiency of tennis training actions by different methods was tested, and the recognition efficiency was evaluated by the running time. The experimental results of computing time of different feature dimensionality reduction are shown in Table 2.

Table 2. Dimension reduction time of tennis training methods.

Action type

Training sample

Test sample

LPMR +Single-stream method

Methods of this study

LPMR +Single-stream method

Methods of this study

Serve

6.54

5.14

8.14

5.75

Throw the ball

8.14

4.53

6.89

4.01

Strike a ball

7.83

5.47

7.87

3.78

Swing action

6.71

4.61

8.72

5.40

Fig. 7. Errors of different algorithms on training set.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig7.png
Fig. 8. Errors of different algorithms on the test set.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig8.png
Fig. 9. Response time of different algorithms.
../../Resources/ieie/IEIESPC.2024.13.3.303/fig9.png

In order to further verify the performance of this model, four groups of experiments were conducted on the NTU-RGB+D dataset and UTD-MHAD dataset, including three groups of control experiments and one group of basic experiments. The experimental results and model performance were evaluated and analyzed according to the experimental evaluation indexes. Table 3 shows the recognition accuracy of different models on the NTU-RGB+D dataset. Table 4 shows the recognition accuracy of different models on the UTD-MHAD dataset.

Table 3. Recognition accuracy of different models on NTU-RGB+D dataset.

Model

NTU-RGB+D

Single_stream

73.69%

Multi_stream+WF

69.84%

OR+Multi_strea+ AF

79.65%

Methods of this study

95.34%

Table 4. Recognition accuracy of different models on UTD-MHAD dataset.

Model

UTD-MHAD

Single_stream

77.32%

Multi_stream+WF

70.56%

OR+Multi_strea+ AF

82.37%

Methods of this study

94.12%

It can be seen that the recognition accuracy of this model was good on the NTU-RGB+D and UTD-MHAD datasets. The recognition accuracy of this model on the NTU-RGB+D dataset can reach 95.34%. The recognition accuracy of this model on the UTD-MHAD dataset can reach 94.12%. Compared with the previous model, the accuracy of this model was improved, which also verified the superiority of this model.

5. Conclusion

Based on AI technology, this study constructed a recognition method for tennis error training action. In this study, the depth of the original NN was deepened, a batch standardization layer was added after each convolution layer, and the network structure was redesigned. Compared with other methods, the proposed recognition method for tennis error training action showed good recognition accuracy on the NTU-RGB+D dataset (95.34%) and UTD-MHAD dataset (94.12%). The methods proposed in this study were superior to some other methods and were comparable to those with a high recognition rate. This result verifies the superiority of the model in this study.

The model proposed in this study can provide some technical support for the recognition of wrong tennis training actions, improve the tennis teaching effect, and improve students’ learning level. It lays a good foundation for further research. However, in practical applications, there may be complex scenes that are difficult to identify. Therefore, the model needs to be sensitive enough to subtle movements. At the same time, the features extracted from the model need to have sufficient discriminability.

REFERENCES

1 
Yi Y, Zheng Z, Lin M. Realistic action recognition with salient foreground trajectories. Expert Systems with Applications, 2017, 75(JUN.): 44-55.DOI
2 
Han Y, Yang Y, Wu F, et al. Compact and Discriminative Descriptor Inference Using Multi-Cues. IEEE Trans Image Process, 2015, 24(12): 5114-5126.DOI
3 
Wen Z, Wang C, Xiao B, et al. Human action recognition using weighted pooling. Iet Computer Vision, 2014, 8(6): 579-587.DOI
4 
Feng L, Zhao Y, Zhao W, et al. A comparative review of graph convolutional networks for human skeleton-based action recognition. Artificial Intelligence Review, 2022, 55(5): 4275-4305.DOI
5 
Rodrigues A, Pereira A S, Rui M, et al. Using Artificial Intelligence for Pattern Recognition in a Sports Context. Sensors, 2020, 20(11): 3040.DOI
6 
Zhang S, Gao C, Jing Z, et al. Discriminative Part Selection for Human Action Recognition. IEEE Transactions on Multimedia, 2017, 20(99): 769-780.DOI
7 
Wang B, Yu L, Xiao W, et al. Position and locality constrained soft coding for human action recognition. Journal of Electronic Imaging, 2013, 22(4): 041118.DOI
8 
Wu D. Online position recognition and correction method for sports athletes. Cognitive Systems Research, 2018, 52(DEC.): 174-181.URL
9 
Yong D, Yun F, Liang W. Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition. IEEE Transactions on Image Processing, 2016, 25(7): 3010-3022.DOI
10 
Ma S, Bargal S A, Zhang J, et al. Do Less and Achieve More: Training CNNs for Action Recognition Utilizing Action Images from the Web. Pattern Recognition, 2015, 68: 334-345.DOI
11 
Niu L, Li W, Xu D. Exploiting Privileged Information from Web Data for Action and Event Recognition. International Journal of Computer Vision, 2016, 118(2): 130-150.DOI
12 
Hu B, Yuan J, Wu Y. Discriminative Action States Discovery for Online Action Recognition. IEEE Signal Processing Letters, 2016, 23(10): 1374-1378.DOI
13 
Chen C, Jafari R, Kehtarnavaz N. Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors. IEEE Transactions on Human-Machine Systems, 2015, 45(1): 51-61.DOI
14 
Liu L, Shao L, Li X, et al. Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach. IEEE Transactions on Cybernetics, 2015, 46(1): 158-170.DOI
15 
Yu K, Yun F. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition. International Journal of Computer Vision, 2017, 123(3): 350-371.DOI
16 
Marlon, F, Alcantara, et al. Real-time action recognition using a multilayer descriptor with variable size. Journal of Electronic Imaging, 2016, 25(1): 13020-13020.DOI
17 
Nazir S, Yousaf M H, Nebel J C, et al. A Bag of Expression framework for improved human action recognition. Pattern recognition letters, 2018, 103(FEB.1): 39-45.DOI
18 
Liu Y, Dong H, Wang L. Trampoline Motion Decomposition Method Based on Deep Learning Image Recognition. Scientific Programming, 2021, 2021(9): 1-8.DOI
19 
Ramezani M, Yaghmaee F. A review on human action analysis in videos for retrieval applications. Artificial Intelligence Review, 2016, 46(4): 485-514.DOI
20 
Xu W, Miao Z, Yu J, et al. Action Recognition and Localization with Spatial and Temporal Contexts. Neurocomputing, 2019, 333(MAR.14): 351-363.DOI
21 
Yong B, Zhang G, Chen H, et al. Intelligent monitor system based on cloud and convolutional neural networks. Journal OF Supercomputing, 2017, 73(7): 3260-3276.DOI
22 
Lin B, Fang B, Yang W, et al. Human Action Recognition Based on Spatio-temporal Three-Dimensional Scattering Transform Descriptor and An Improved VLAD Feature Encoding Algorithm. Neurocomputing, 2018, 348(JUL.5): 145-157.DOI
23 
Lemieux N, Noumeir R. A Hierarchical Learning Approach for Human Action Recognition. Sensors, 2020, 20(17): 4946.DOI
24 
Wang T, Li J, Wu H N, et al. ResLNet: deep residual LSTM network with longer input for action recognition. Frontiers of Computer Science, 2022, 16(6): 1-9.DOI
25 
Merler M, Mac K, Joshi D, et al. Automatic Curation of Sports Highlights Using Multimodal Excitement Features. IEEE Transactions on Multimedia, 2019, 21(5): 1147-1160.DOI
Yuandong Li
../../Resources/ieie/IEIESPC.2024.13.3.303/au1.png

Yuandong Li was born in Xinxiang, Henan, China, in 1989. From 2007 to 2011, she studied in Henan Agricultural University and received her bachelor's degree in 2011.He received the master's degree from Wuhan Sport University, China. Now, he works Guangxi Vocational College of Water Recources and Electric Power, he is studying in Faculty of Social Sciences and Liberal Arts, UCSI University, Kuala Lum pur. His research interests include physical health and promotion, scientific training methods and guidance.

Qiong Wang
../../Resources/ieie/IEIESPC.2024.13.3.303/au2.png

Qiong Wang was born in Zhoukou, Henan, China, in 1989. From 2007 to 2011, she studied in Zhoukou Normal University and received her bachelor's degree in 2011. From 2011 to 2014, she studied in Guilin University of Electronic Technology and received her Master's degree in 2014. Now, she works Guangxi Vocational College of Water Recources and Electric Power. Her research interests include Basic Mathematics and Information and caculating science.