LiYuandong*
WangQiong1
-
(College of General Education, Guangxi Vocational College of Water Resources and Electric
Power, Nanning 530105, China )
Copyright © The Institute of Electronics and Information Engineers(IEIE)
Keywords
AI, Tennis training, Action recognition
1. Introduction
With the development of tennis in recent years, it has received extensive attention
from all walks of life. At present, tennis has become an elegant sport for all ages
and is especially favored by college students [1]. Tennis is a very professional sport with strong technical requirements. In many
kinds of ball games, the requirements of its actions and technical standards are extremely
strict. The professional level of this kind of sport is shown in that the technical
levels of amateur group and professional group are very different [2]. Professional athletes need to be trained from an early age, and it takes many years
to reach a higher level and maintain the highest level.
In the actual tennis teaching process, some students’ own factors and external factors
extremely easily cause a series of wrong actions, which greatly reduces the tennis
level of students. This affects students’ enthusiasm for learning and training in
tennis and also affects the improvement of tennis teaching quality [3]. These mistakes are inevitable in the process of skill acquisition and area normal
phenomenon.
Especially in the first stage of mastery of technical movements, students are most
prone to doing all kinds of wrong technical movements [4]. Therefore, in the teaching process, teachers must be good at finding and correcting
students’ wrong actions in time so as to avoid the formation of wrong patterns, which
will affect students’ interest in learning and the effect of teaching tennis skills.
In the process of preliminarily mastering technical movements, if students’ wrong
movements are not corrected in time, it is easy for students to form wrong technical
habits, which will adversely affect their future tennis learning and improvement [5]. Based on this, this paper discusses the recognition of tennis training errors.
In recent years, motion recognition has become one of the most relevant topics in
the field of artificial intelligence (AI) and computer vision and has been widely
used in various fields. AI has brought great changes to the traditional sports industry
[6]. DL (Deep learning) is a research hotspot in the field of AI, and various research
results based on DL methods have been applied to practice [7].
At present, the mainstream research on human motion recognition is based on DL. Before
the application of DL, the computer vision method of manually extracting features
was widely used in human motion recognition [8,9]. It is of great significance to construct a method for recognizing tennis error training
actions with excellent performance. As an important branch of computer vision, video-based
motion recognition studies how to recognize specific human actions from specified
video sequences [10].
On the basis of successfully realizing motion capture and feature extraction, motion
recognition automatically recognizes human motion by analyzing the obtained human
motion feature parameters. Motion recognition technology has significance and broad
application prospects in man-machine interface, intelligent monitoring, and sports
analysis [11]. In this study, a recognition method of wrong tennis training action was constructed
based on AI technology. The random projection algorithm was used to reduce the dimension
of feature vectors, and a CNN (convolutional neural network) model was used to learn
the training samples after the dimension reduction to build a model for recognition
of tennis training error. The preprocessing of the data in this study includes converting
the original Cartesian coordinate system into a cylindrical coordinate system and
normalizing the time of the skeletal motion sequence. We deepened the depth of the
original NN (neural network), added a batch standardization layer after each convolution
layer, and redesigned the network structure.
2. Related Work
Hu et al. proposed a method of using the explicit learning method to train the basic
skills of serving and using the implicit learning method to improve the skills [12]. Chen et al. pointed out that in the process of practical teaching, complex actions
are usually decomposed. For example, the forehand technique is divided into four parts:
lead, swing, hit, and follow-swing [13]. These four links are complementary to each other. Once there is an error in a previous
link, the effect of a latter link will be unsatisfactory.
Liu et al. proposed a method combining a conditional random field and conditional
probability density propagation for action segmentation and recognition of continuous
human actions [14]. This method decomposes continuous action recognition into a divide-and-conquer method
for individual action recognition. Yu et al. proposed an upper-body human object detection
algorithm without pose constraints [15]. Marlon et al. proposed a multimodal-based human motion recognition method and implemented
a motion data acquisition system and a human motion recognition system [16].
Nazir et al. proposed a feature dimensionality reduction and Gaussian mixture model
for sports action recognition [17,18]. Ramezani et al. used ResNet-34 with a deeper network structure to improve the NN
and conducted experiments to verify that the change of the network structure can indeed
slightly improve the recognition accuracy [19]. Xu et al. believe that there are both subjective and objective reasons for the occurrence
of wrong technical movements in tennis. Subjective reasons refer to students’ subjective
willingness to learn tennis skills, and objective reasons refer to teachers’ teaching
behaviors, teaching methods, and teaching environment factors [20]. In tennis teaching, teachers should be good at discovering and correcting students’
mistakes in a timely manner, and at the same time, they should also cultivate good
independent learning skills and learning enthusiasm.
Yong et al. pointed out that establishing explicit rules through the development of
the knowledge base of tennis technique and combining it with an artificial NN to analyze
technical movements will be a development trend in tennis technique diagnosis in the
future. The application of AI in movement technique analysis will represent a development
in sports biomechanics and significant progress [21]. Lin et al. considered the limitation of observation sequence and the problem of
"marker bias" in traditional probability model. They proposed a method to describe
a human action feature sequence based on a conditional random field model [22].
3. Methodology
3.1 Technology and Theoretical Basis of Motion Recognition
Tennis is extremely demanding, not only in terms of technique, but also in terms of
physical conditions [23]. Because of its complicated technical actions and various other reasons, it is easy
to make mistakes in the learning process. Especially for beginners, starting it is
also quite difficult, and incorrect technical movements may even cause sports injuries
such as muscle strain. If the wrong actions are not corrected in time in the early
stage, students will easily form wrong technical stereotypes, which will affect the
mastery and improvement of the next stage of technique [24]. Therefore, in the process of teaching, teachers should diagnose and analyze students’
wrong actions in time and put forward correction methods for students so as to better
improve the quality of tennis teaching.
Human motion recognition based on video is a basic subject in computer vision research.
A DL model is a nonlinear network model with many hidden layers. Through training
with large-scale original data, the network can extract the features that can best
express the original data and then predict or classify samples. The DL model architecture
is shown in Fig. 1. With rapid development, DL technology has an advantage in the fields of computer
vision, natural language processing, and so on [25].
Fig. 1. DL model architecture diagram.
CNN starts at the bottom of the image and gradually extracts features towards the
top. At a lower level, CNN learns to extract simple edge and color features, such
as lines, curves, and colors. As the hierarchy increases, CNN gradually learns more
complex features, such as shapes, object parts, and ultimately complete objects. This
feature extraction process from bottom to top is the strength of CNN, as it allows
the network to automatically find the most useful features for recognition tasks.
In addition, CNN also has good robustness and can handle changes in image size, rotation,
and flipping, which makes it perform well in many tasks, especially in image recognition
tasks. Compared with the traditional way of extracting data features manually, a CNN
automatically extracts richer and more abstract features of objects by knowing the
objects themselves. An NN can approximate any nonlinear continuous function with arbitrary
precision. Many problems in the modeling process are highly nonlinear. With the continuous
development of DL technology, a CNN has been widely used by researchers, and its effect
has been well verified in many network models. A CNN adopts the form of partial connection,
and only some neurons in the network are connected. Generally, a CNN consists of three
parts: a convolution layer for extracting features, a pool layer for reducing the
size of feature graph, and a full connection layer. A CNN is characterized by its
convolution operation, and the convolution process is shown in Fig. 2.
Fig. 2. Convolution process.
In the pixel block of an image, a pixel value undergoes an operation with a corresponding
convolution kernel to yield an output value for that specific block. Subsequently,
a new pixel block is selected, and the convolution kernel is shifted, allowing for
the computation of convolutions across the entire image. This series of operations
is referred to as the initial convolution process.
3.2 Human Motion Recognition Method
Human tracking is a technology that uses various sensors, algorithms, and computer
vision technologies to identify and track the position and posture of the human body
in space in real-time. By establishing a motion model of the human body, it is possible
to track and recognize the human body. For example, a 3D human model can be used to
simulate human motion and algorithms can be used to fit actual human motion data.
By comparing motion models over several consecutive days, the corresponding relationship
between the human body or joint points can be determined. There are many ways of matching,
such as location-based matching, material and color-based matching, and speed-based
matching. The feature extraction method and recognition algorithm are the two most
important parts in the recognition process. A diagram of human motion recognition
is shown in Fig. 3.
Fig. 3. Diagram of human motion recognition.
After different features are used to represent human movements, the recognition of
human movements becomes a pattern classification problem. Classifiers can be divided
into linear classifiers and nonlinear classifiers according to different classification
planes. A nonlinear classification algorithm is difficult to solve, and a large number
of human motion classification methods use a linear classifier.
In the task of human motion recognition, data modalities are generally divided into
three categories: video data, depth images, and bone motion sequences. According to
the different data modalities of the recognition task, different algorithms or models
are designed to complete it. The core idea of the method is to extract the whole human
body contour (which includes motion features, the whole structure, and the external
shape of the human body). We make use of these three characteristics in the model.
Finally, the motion recognition is completed by the constructed model.
3.3 Construction of Model
In the modal skeletal motion sequence data, each sample uses human bones and 25 joints
to represent a moving individual, and the 3D position change of joint points with
time represents human movements. At the same time, the coordinates of these joint
points are normalized so that the coordinate data is not affected by the scale. A
batch standardization layer was used to optimize the model and enhance the generalization
ability of the network. It can unify the scattered data, normalize them, accelerate
the convergence of the loss function, and help to reduce the gradient dispersion and
spread the gradient.
We collected tennis training action videos and extracted tennis training action features.
For a 2D input of size N×N, the convolution kernel is k×k, the stride
is s, and the zero padding is a convolution operation of p. If the output size
after convolution is m×m, the calculation of m is as follows:
[⋅] means rounding down. When appropriate convolution parameters
are set, the scale after convolution can be kept unchanged, and not all convolution
will reduce the input dimension, although it is necessary in most cases. However,
in general, in convolution networks, the pool layer is used for down sampling to achieve
the goal of feature fusion and smoothing.
Fig. 4. ReLU function diagram.
The ReLU function is shown in Fig. 4. When the input value is negative, its output is 0. When the input is positive, the
output is the input. Its expression is as follows:
A sigmoid function maps the input to the interval of (0,1) so that the transmission
process information does not diverge. Its expression is as follows:
The tanh function maps the input to the interval (-1,1), and its average value is
0. The function expression is as follows:
Suppose that in a 2D CNN, an input tensor of size Cl×Hl×Wl
is layer l. Cl is the number of input channels of layer l. The size of a
single convolution kernel is Cl×hl×wl. Then, corresponding
to a convolutional layer with Cl+1 hidden neurons, the output of the corresponding
position is as follows:
d is the neuron number in layerl, and il and jl represent the location
information. The constraints are Eqs. (6) and (7):
p is the convolution kernel parameter, b is the bias parameter in the convolution,
and σ(⋅) is the activation function.
Given an image P of size M×N, the image matrix can be regarded as a one-dimensional
vector in row-major order, and a one-dimensional label vector is defined:
The value range of element Ai in A is {0,1}. The image segmentation
effect can be evaluated by calculating the cost function labeled A.
Let pλ be the probability density function, λ=[λ1,λ2,λ3,…,λM] be the M parameter vectors of pλ, and X1=[xt,t=1,2,3,…,TI] be the effective features
of tennis training action videos. d is the feature dimension after dimension reduction,
including K Gaussian unit parameter sets:
Its Gaussian mixture model is:
wi, ui, and Σi are the mixed weight, mean vector and covariance
matrix, respectively, and pi(xt) is the ith Gaussian unit of
xt. According to the Bayesian equation, the calculation equation of the probability
that xt is assigned to the ith Gaussian unit is:
Then, the gradients of xt with respect to λ={wi,ui,Σi}i=1,2,3,…,k are expressed as:
where σki represents the standard deviation in the covariance matrix
Σi.
The identification process essentially selects a model from the set of models that
best describes the observed signal. We first perform feature extraction on the input
action sequence to obtain an observation feature sequence O corresponding to the
action sequence. We then calculate the probability P(S|O,θ)
corresponding to each model in the model set. Finally, the category of the action
is determined according to the following maximum likelihood equation:
We calculate the conditional probability P(S|O) of the label
sequence S given the observation sequence O and use the forward-backward dynamic
programming algorithm to calculate the labeling action with the highest probability
as the recognition result of this step. In this study, after the three-dimensional
pool processing, both the space size and the time size of the feature graph are reduced,
which greatly reduces the calculation amount of the subsequent network. The maximum
pooling operation can effectively reduce the number of parameters and computational
complexity of the network. This enables the network to complete calculations faster
when processing input data, improving time vitality.
4. Result Analysis and Discussion
In order to analyze the recognition effect of tennis training action by feature reduction
and the Gaussian mixture model, this study used 8564 tennis training video samples
with complex background and a wide visual angle. 5,000 samples were selected as the
training set, and the other samples were used as the test set. The cropping used in
this article is random cropping. This random cropping scheme aims to ensure that the
final cropped image contains human body information.
At the same time, in order to enhance the network's generalization ability, prevent
the network from overfitting during the training process, and reduce the network's
sensitivity to noise, this article also flipped, rotated, and scaled 10% of the cropped
images. Before the formal training of the model, the parameters of the model were
initialized. In the convolution layer and batch standardization, the convolution kernel
parameters and batch standardization parameters were all standardized into random
numbers with a mean value of 0.02 and a standard deviation of 1, and the offset parameter
was set to 0. The training of the model is shown in Fig. 5.
The movement of a human body needs corresponding characteristic parameters to describe
it, and different movements may focus on different characteristic parameters, so it
is necessary to use features that are as effective and appropriate as possible to
express complex movement characteristics. Because the dynamic range distribution of
different dimension data is very different, this study normalized each dimension data
to make it a unit vector with a modulus of 1, which is convenient for subsequent calculation.
After passing through the classifier, features are mapped to a vector of classification
scores. This vector represents the probability distribution of the samples corresponding
to the motion features across all motion categories. The index position corresponding
to the maximum probability value is then selected as the predicted classification
result of the model. The classification accuracy of different algorithms is shown
in Fig. 6.
Fig. 6. Classification accuracy of different algorithms.
In image recognition or object recognition tasks, skeleton information usually refers
to the internal structural information of the object, which can reflect the overall
shape and structural characteristics of the object. Depth information refers to the
distance relationship between an object and the camera, which reflects the three-dimensional
shape and depth perception of the object. In the case of inaccurate skeleton information,
there may be some interference factors with large shape changes and unstable structure,
which can interfere with the recognition algorithm and lead to a decrease in recognition
rate. And depth information can provide a more stable shape description, so using
depth information can eliminate the interference caused by skeleton instability and
improve recognition accuracy. The tennis training action recognition results of this
method and a previous method are shown in Table 1.
Table 1. Accuracy of tennis training action recognition.
Action type
|
Training sample
|
Test sample
|
LPMR +Single-stream method
|
Methods of this study
|
LPMR +Single-stream method
|
Methods of this study
|
Serve
|
89.54%
|
93.54%
|
82.34%
|
93.64%
|
Throw a ball
|
82.31%
|
94.16%
|
85.67%
|
94.12%
|
Strike a ball
|
85.34%
|
94.87%
|
88.29%
|
94.25%
|
Swing action
|
88.63%
|
93.95%
|
80.15%
|
93.47%
|
All the action videos in this database were decomposed into images and stored in folders,
which have been preprocessed. The decomposed action videos were made into a training
set with 32 non-overlapping frames, and the 32 frames after cropping were randomly
selected as the input of the network during the training process. Fig. 7 shows the errors of different algorithms in the training set, Fig. 8 shows the errors of different algorithms on the test set, and Fig. 9 shows the time taken by different algorithms. Using the MATLAB platform, the recognition
efficiency of tennis training actions by different methods was tested, and the recognition
efficiency was evaluated by the running time. The experimental results of computing
time of different feature dimensionality reduction are shown in Table 2.
Table 2. Dimension reduction time of tennis training methods.
Action type
|
Training sample
|
Test sample
|
LPMR +Single-stream method
|
Methods of this study
|
LPMR +Single-stream method
|
Methods of this study
|
Serve
|
6.54
|
5.14
|
8.14
|
5.75
|
Throw the ball
|
8.14
|
4.53
|
6.89
|
4.01
|
Strike a ball
|
7.83
|
5.47
|
7.87
|
3.78
|
Swing action
|
6.71
|
4.61
|
8.72
|
5.40
|
Fig. 7. Errors of different algorithms on training set.
Fig. 8. Errors of different algorithms on the test set.
Fig. 9. Response time of different algorithms.
In order to further verify the performance of this model, four groups of experiments
were conducted on the NTU-RGB+D dataset and UTD-MHAD dataset, including three groups
of control experiments and one group of basic experiments. The experimental results
and model performance were evaluated and analyzed according to the experimental evaluation
indexes. Table 3 shows the recognition accuracy of different models on the NTU-RGB+D dataset. Table 4 shows the recognition accuracy of different models on the UTD-MHAD dataset.
Table 3. Recognition accuracy of different models on NTU-RGB+D dataset.
Model
|
NTU-RGB+D
|
Single_stream
|
73.69%
|
Multi_stream+WF
|
69.84%
|
OR+Multi_strea+ AF
|
79.65%
|
Methods of this study
|
95.34%
|
Table 4. Recognition accuracy of different models on UTD-MHAD dataset.
Model
|
UTD-MHAD
|
Single_stream
|
77.32%
|
Multi_stream+WF
|
70.56%
|
OR+Multi_strea+ AF
|
82.37%
|
Methods of this study
|
94.12%
|
It can be seen that the recognition accuracy of this model was good on the NTU-RGB+D
and UTD-MHAD datasets. The recognition accuracy of this model on the NTU-RGB+D dataset
can reach 95.34%. The recognition accuracy of this model on the UTD-MHAD dataset can
reach 94.12%. Compared with the previous model, the accuracy of this model was improved,
which also verified the superiority of this model.
5. Conclusion
Based on AI technology, this study constructed a recognition method for tennis error
training action. In this study, the depth of the original NN was deepened, a batch
standardization layer was added after each convolution layer, and the network structure
was redesigned. Compared with other methods, the proposed recognition method for tennis
error training action showed good recognition accuracy on the NTU-RGB+D dataset (95.34%)
and UTD-MHAD dataset (94.12%). The methods proposed in this study were superior to
some other methods and were comparable to those with a high recognition rate. This
result verifies the superiority of the model in this study.
The model proposed in this study can provide some technical support for the recognition
of wrong tennis training actions, improve the tennis teaching effect, and improve
students’ learning level. It lays a good foundation for further research. However,
in practical applications, there may be complex scenes that are difficult to identify.
Therefore, the model needs to be sensitive enough to subtle movements. At the same
time, the features extracted from the model need to have sufficient discriminability.
REFERENCES
Yi Y, Zheng Z, Lin M. Realistic action recognition with salient foreground trajectories.
Expert Systems with Applications, 2017, 75(JUN.): 44-55.

Han Y, Yang Y, Wu F, et al. Compact and Discriminative Descriptor Inference Using
Multi-Cues. IEEE Trans Image Process, 2015, 24(12): 5114-5126.

Wen Z, Wang C, Xiao B, et al. Human action recognition using weighted pooling. Iet
Computer Vision, 2014, 8(6): 579-587.

Feng L, Zhao Y, Zhao W, et al. A comparative review of graph convolutional networks
for human skeleton-based action recognition. Artificial Intelligence Review, 2022,
55(5): 4275-4305.

Rodrigues A, Pereira A S, Rui M, et al. Using Artificial Intelligence for Pattern
Recognition in a Sports Context. Sensors, 2020, 20(11): 3040.

Zhang S, Gao C, Jing Z, et al. Discriminative Part Selection for Human Action Recognition.
IEEE Transactions on Multimedia, 2017, 20(99): 769-780.

Wang B, Yu L, Xiao W, et al. Position and locality constrained soft coding for human
action recognition. Journal of Electronic Imaging, 2013, 22(4): 041118.

Wu D. Online position recognition and correction method for sports athletes. Cognitive
Systems Research, 2018, 52(DEC.): 174-181.

Yong D, Yun F, Liang W. Representation Learning of Temporal Dynamics for Skeleton-Based
Action Recognition. IEEE Transactions on Image Processing, 2016, 25(7): 3010-3022.

Ma S, Bargal S A, Zhang J, et al. Do Less and Achieve More: Training CNNs for Action
Recognition Utilizing Action Images from the Web. Pattern Recognition, 2015, 68: 334-345.

Niu L, Li W, Xu D. Exploiting Privileged Information from Web Data for Action and
Event Recognition. International Journal of Computer Vision, 2016, 118(2): 130-150.

Hu B, Yuan J, Wu Y. Discriminative Action States Discovery for Online Action Recognition.
IEEE Signal Processing Letters, 2016, 23(10): 1374-1378.

Chen C, Jafari R, Kehtarnavaz N. Improving Human Action Recognition Using Fusion of
Depth Camera and Inertial Sensors. IEEE Transactions on Human-Machine Systems, 2015,
45(1): 51-61.

Liu L, Shao L, Li X, et al. Learning Spatio-Temporal Representations for Action Recognition:
A Genetic Programming Approach. IEEE Transactions on Cybernetics, 2015, 46(1): 158-170.

Yu K, Yun F. Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition.
International Journal of Computer Vision, 2017, 123(3): 350-371.

Marlon, F, Alcantara, et al. Real-time action recognition using a multilayer descriptor
with variable size. Journal of Electronic Imaging, 2016, 25(1): 13020-13020.

Nazir S, Yousaf M H, Nebel J C, et al. A Bag of Expression framework for improved
human action recognition. Pattern recognition letters, 2018, 103(FEB.1): 39-45.

Liu Y, Dong H, Wang L. Trampoline Motion Decomposition Method Based on Deep Learning
Image Recognition. Scientific Programming, 2021, 2021(9): 1-8.

Ramezani M, Yaghmaee F. A review on human action analysis in videos for retrieval
applications. Artificial Intelligence Review, 2016, 46(4): 485-514.

Xu W, Miao Z, Yu J, et al. Action Recognition and Localization with Spatial and Temporal
Contexts. Neurocomputing, 2019, 333(MAR.14): 351-363.

Yong B, Zhang G, Chen H, et al. Intelligent monitor system based on cloud and convolutional
neural networks. Journal OF Supercomputing, 2017, 73(7): 3260-3276.

Lin B, Fang B, Yang W, et al. Human Action Recognition Based on Spatio-temporal Three-Dimensional
Scattering Transform Descriptor and An Improved VLAD Feature Encoding Algorithm. Neurocomputing,
2018, 348(JUL.5): 145-157.

Lemieux N, Noumeir R. A Hierarchical Learning Approach for Human Action Recognition.
Sensors, 2020, 20(17): 4946.

Wang T, Li J, Wu H N, et al. ResLNet: deep residual LSTM network with longer input
for action recognition. Frontiers of Computer Science, 2022, 16(6): 1-9.

Merler M, Mac K, Joshi D, et al. Automatic Curation of Sports Highlights Using Multimodal
Excitement Features. IEEE Transactions on Multimedia, 2019, 21(5): 1147-1160.

Yuandong Li was born in Xinxiang, Henan, China, in 1989. From 2007 to 2011, she
studied in Henan Agricultural University and received her bachelor's degree in 2011.He
received the master's degree from Wuhan Sport University, China. Now, he works Guangxi
Vocational College of Water Recources and Electric Power, he is studying in Faculty
of Social Sciences and Liberal Arts, UCSI University, Kuala Lum pur. His research
interests include physical health and promotion, scientific training methods and guidance.
Qiong Wang was born in Zhoukou, Henan, China, in 1989. From 2007 to 2011, she studied
in Zhoukou Normal University and received her bachelor's degree in 2011. From 2011
to 2014, she studied in Guilin University of Electronic Technology and received her
Master's degree in 2014. Now, she works Guangxi Vocational College of Water Recources
and Electric Power. Her research interests include Basic Mathematics and Information
and caculating science.