Mobile QR Code QR CODE

  1. (School of Design and Fashion, Zhejiang University of Science and Technology, Hangzhou, 310023, China)
  2. (School of Art, Tourism College of Zhejiang, Hangzhou, 310023, China )



Mixture density network, Action feature screening, Intelligent choreography, LSTM, Conversion rules, Feature matching, Continuous sequence, Mixed component

1. Introduction

Traditional choreography refers to the arrangement of dance movements by professional dancers following a musical rhythm. This method of artistic creation has changed with the vigorous development of modern intelligent technology [1,2]. In recent years, virtual characters have emerged in computer games, advertising, movies, and other fields. Animation design of their dance movements requires animators to constantly adjust the bone positions according to keyframes, which is very time-consuming and laborious work [3]. Therefore, intelligent choreography algorithms have come into being. This algorithm can train the model through deep learning and intelligently generate dance actions that meet expectations. In addition, music of different styles can be matched with the corresponding dance clips, significantly improving the choreography efficiency and saving stage costs [4]. The gating mechanism of the long short-term memory (LSTM) network can allow the learning of human dance movement characteristics and obtain the transformation rules of various poses. The LSTM network has several applications. This experiment proposes an innovative model of dance action generation combining the LSTM network and mixed-density network (MDN) and introduces an action screening strategy to enhance the continuity of dance action. Finally, the matching of different styles of music features and dance movements was studied. The constructed model has reference significance to the difficulties faced by traditional choreography [5].

2. Related Work

Many scholars have contributed to the research of human motion feature recognition. The existing literature covers a variety of intelligent algorithms to recognize human motion features. Setiawan et al. simulated the human skeleton after optimizing the graph convolutional neural network with intelligent algorithms. The applicability and superiority of the method were verified on large data sets, and the method could provide more information for human motion feature recognition. Therefore, the proposed method had good practical significance [6]. Qin et al. proposed a knowledge-guided and data-driven method to recognize human actions with few shots. This method takes samples in a dataset as data-driven and uses the bidirectional encoder representations from transformers (BERT) mechanism to construct and extract features from the time series in the video. Experimental verification is then carried out using three kinds of datasets. The verification results show that the proposed model has a more than 10% improvement rate, which is a certain reference value for future research on human action recognition [7]. Nguyen et al. human action recognition was observed by light, background, and the challenge of several factors, such as speed. They proposed a kind of human activity recognition method based on multiple perspectives, namely by effectively integrating information in the multiple-view images and forming a multiple-branch network while extracting the specific characteristics. The final simulation results revealed a model accuracy of 99.56%, with good applicability [8]. Song researchers further improved action recognition based on the human body skeleton. Several studies have examined an incomplete human body skeleton and proposed a flow chart convolution network. This algorithm can model the human body skeleton to reduce the gesture recognition model for differentiating the sensitivity characteristics to nonstandard bone, while improving recognition robustness. Experiments have verified the good performance of the proposed algorithm [9]. Estevan et al. examined zero lens gesture recognition of gesture recognition scenes and objects for the cartesian product of all possible composition structures. The method could make full use of the object and the scene, and a video of objects in the scene and the action of semantic matching could be made, which was easy to implement, and had better application performance [10]. Dhiman et al. reported that action recognition based on the human body skeleton in space and time faces the challenge of feature extraction because it is vulnerable to the influence of variations [10]. They proposed a model driven by a residual initial attention convolution neural network. The network could visualize the dynamics of human action and, to some extent, help overcome difficulties, such as view mutation. Experiments on datasets verified the performance advantages of the algorithm [11].

A hybrid density network is a computer-intelligent algorithm that learns and outputs all distribution parameters mixed in a general distribution after inputting a specific feature. In recent years, the hybrid density network algorithm has been developed continuously. In addition to human action recognition, the MDN has been studied and applied widely in many scientific fields. Shi Y et al. proposed a hybrid density network clustering algorithm based on an artificial immune network and density peak [12]. The algorithm could detect abnormal network flows and malicious attacks in the Internet environment. The final experimental results confirmed the proposed method as a radical anomaly detection algorithm that could achieve high detection accuracy [12]. Li et al. modeled the energy efficiency of the downlink of the cellular heterogeneous network with ultra-dense small cells [13]. They constrained the density of ultra-dense small base stations and the fraction of hybrid back-range links. The method verified the better convergence and calculation accuracy of the algorithm, which is of great significance for green wireless communication in future society [13]. JE Chacon introduced two density-based clustering methods: mixed model clustering and mixed mode clustering. They used the advantages and characteristics of the two clustering methods to conduct hybrid modeling. The proposed method improved the calculation speed while ensuring accurate experimental accuracy [14]. Wang et al. proposed a method of local density definition to solve the defects of incomplete data processing and inaccurate model caused by ignoring some local information in graph convolutional networks [15]. This method trained and generated the final model by inputting detailed node information and conducting relevant training. The final results showed that the model could perform well in classification tasks [15]. Wang J et al. introduced a hybrid density neural network to control the speed and pressure of the engine and control the hydraulic power system of an uncrewed walking platform system more accurately [16]. The new algorithm had a stable operation effect and could meet the requirements of parameter regulation to a certain extent [16].

In conclusion, research on human motion recognition algorithms and hybrid density network algorithms have progressed considerably in their respective fields. The mixed density can use the output of a neural network to parameterize the distribution of multiple mixed components, which has certain research significance for human motion recognition. In this experiment, a choreography algorithm combining the LSTM network and MDN is proposed to achieve intelligent choreography under different styles of music. Based on this algorithm, an action feature screening strategy was used to enhance its consistency. Finally, simulation experiments were conducted to verify the performance of the model.

3. Construction of Choreography Model based on Hybrid Density Network Algorithm and Action Feature Screening Strategy

3.1 Research on Motion Feature Screening Model based on MDN

Classic motion data includes motion capture data and key frame-based motion data. Both types of action data are implemented based on a large amount of manual processing, so they involve high costs and are difficult to operate [17]. More virtual characters are emerging with the development of computer animation technology, and the demand for real human motion data is also increasing. On the other hand, relying only on traditional motion capture and manual production cannot meet current needs. Therefore, deep learning algorithms have also appeared in motion generation. The deep neural network has a strong learning ability, which can overcome the limitations of traditional machine learning algorithms in data acquisition to a certain extent. Therefore, the experimental action generation model selects the sequence generation model based on depth learning [18]. Deep neural networks can discover the hidden structural features in the data and change its internal parameters, but there are limitations in processing continuous sequence samples. On the other hand, dance movement samples are continuous sequences, so the ideal experimental effect cannot rely only on deep neural networks. LSMN and MDN can predict and generate action sequences of unknown lengths, so the experiment combines these two algorithms to process the action data. LSMN is a recurrent neural network with a chain structure composed of a unit, an input gate, a forgetting gate, and an output gate. Gate control can solve the general sequence problem to a certain extent. Fig. 1 presents the basic structure.

Fig. 1. Basic structure of LSTM.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig1.png

A sigmoid neural network layer and a point multiplication operation constitute the LSTM gate. The neural network layer outputs a number between 0 and 1 to indicate the amount of information passing; 0 means no information is retained, and 1 means all information is passed. The algorithmic forgetting gate can decide to retain or discard relevant state information, and its function expression is as follows.

(1)
$ f_{t}=\gamma \left(W_{f}\cdot \left[h_{t-1},x_{t}\right]+b_{f}\right) $

Eq. (1) reflects the retention degree of the forgetting gate to the information at the previous moment, where $W_{f}$ is the weight of the forget gate; $h_{t-i}$is the output at the previous time;$x_{t}$ is the input value at the current time; $\gamma $ is the Sigmoid neural network layer. The value obtained is converted to a number between 0 and 1 through the activation function, and the retention degree is determined. This number is then multiplied by the unit state $C_{t-1}$at the previous time, and the proportion of information retained at the previous time can be obtained. The input gate can determine the number of new states calculated using the current input to be stored in the cell state, and its function can be expressed as follows.

(2)
$ \begin{array}{l} i_{t}=\gamma \left(W_{i}\cdot \left[h_{t-1},x_{t}\right]+b_{i}\right)\\ \widetilde{C_{t}}=\tanh \left(W_{C}\cdot \left[h_{t-1},x_{t}\right]+b_{C}\right) \end{array} $

where $i_{t}$ represents the value to be updated; $\widetilde{C_{t}}$ is the new candidate value vector calculated by the Tanh layer; this vector is added to the cell state. The weight matrices for the input gate and candidate values are $W_{i}$ and $W_{C}$, respectively. The input at the current time, the united state, and the output at the previous time jointly determine the output at the current time. The formula can be expressed as follows:

(3)
$ \begin{array}{l} o_{t}=\gamma \left(W_{o}\cdot \left[h_{t-1},x_{t}\right]+b_{o}\right)\\ h_{t}=o_{t}*\tanh \left(C_{t}\right) \end{array} $

where $h_{t}$ represents the output information. The gating mechanism enables LSTM to learn the dance movement characteristics of the human body and obtain the constraint relationship between its bones and the transformation rules of various action postures. LSTM can fully use its advantages to generate dance movements of arbitrary length, especially when the input and target output data are discrete. On the other hand, dance data are continuous and non-discrete data, and the output data of LSTM, in this case, has no controlled probability distribution. Therefore, MDNs are introduced to refine the generation of dance movements. The MDN mainly parameterizes the distribution of multiple mixed components using the output of the neural network. The probability density function of each dimension in the output tensor of the overall network is not a single position tensor. After MDN is applied to LSTM, the output distribution is affected and restricted by the current input and previous historical input. After applying MDN to LSTM, the output distribution is influenced and restricted by the current input and previous historical input. The linear combination of each mixing component constitutes the probability density of the target data, and its functional expression is as follows:

(4)
$ p\left(t_{a}\left| x\right.\right)=\sum _{i=1}^{m}\alpha _{i}\left(x\right)\varphi _{i}\left(t_{a}\left| x\right.\right) $

Eq. (4) expresses the probability of the target vector $t_{a}$ when output $x$, where, $m$ is the number of mixing components, and $\alpha _{i}$ is the mixing coefficient of each mixing component of output $x$.$\varphi _{i}$ is the conditional density of the $t$$^{\mathrm{th}}$ kernel of the target vector $i$. The Gaussian kernel function is expressed as

(5)
$ \varphi _{i}\left(t\left| x\right.\right)=\frac{1}{\left(2\pi \right)^{\frac{c}{2}}\sigma _{i}\left(x\right)^{c}}e^{\frac{\left\| t-\mu _{i}\left(x\right)\right\| ^{2}}{2\sigma _{i}\left(x\right)^{2}}} $

where $c$ is the dimension of model output data, and $\mu _{i}$ and $\sigma _{i}$are the mean and variance used to parameterize each mixture component, respectively. Thus, the number of MDN output variables is$m\left(c+2\right)$, and its tensor can be expressed as

(6)
$ z=\left[z_{1}^{\alpha },..,z_{m}^{\alpha },z_{m+1}^{\mu },..,z_{mc+m+1}^{\mu },z_{mc+m+2}^{\sigma },..,z_{m\left(c+2\right)}^{\sigma }\right] $

Eq. (6) covers all the parameters required to construct the mixed-density network, where the number of mixed components $m$ is arbitrary. In the experiment, when dance movement data is used to train the MDN model, the movements can be represented by the spatial coordinates of each human skeleton. Hence, the trained MDN model can predict the probability distribution of each human skeleton position at the next moment and then generate dance movements.

3.2 Construction of Choreography Model Integrating Hybrid Density Network and Dance Movement Features

The application of computer-automatic music choreography has been around for some time, and its main purpose is to minimize the degree of human intervention in the process of music choreography using computer-intelligent technology [19,20]. There are three main problems when attempting to achieve the ideal experimental purpose: acquiring real dance movements intelligently and efficiently; music features and action features with high correlation are selected, and these features can better express the characteristics of music and action data; establishing the mapping relationship between music features and action features. For problem 1, the hybrid density network algorithm described above can be used to obtain and generate dance movements. Fig. 2 shows the structure of the dance movement generation model based on MDN. The MDN model includes two structures: a neural network and a mixed-density model, and the neural network can be used to predict dance movements. The back-end MDN uses the output value as the parameter vector to parameterize the model so that the mean, variance, and weight of each mixed component can be determined.

Fig. 2. Structure of the MDN-based dance action generation model.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig2.png

The characteristics of music and the matching of dance movements should be considered when extracting music features. Different styles of music have different features. Constant Q Transform (CQT) algorithm is a spectrum analysis algorithm suitable for music signals, which can be more consistent with music characteristics when processing music signals, and has a wide range of applications in music signal analysis. The Q factor in the CQT algorithm is a constant, and its function expression is as follows.

(7)
$ Q=\frac{f_{k}}{f_{k+1}-f_{k}}=\frac{1}{2^{1/b}-1} $

where $f_{k}$stands for the central frequency of the $k^{th}$ semitone after the initial semitone, and $b$ is the number of semitones divided within an octave, often with a value of 12 or 24. The frequency amplitude of the $k^{th}$ semitone can be obtained after the CQT transformation of a music signal of finite length, and its function expression is

(8)
$ X\left(k\right)=\frac{1}{N_{k}}\sum x\left(n\right)w_{{N_{k}}}\left(n\right)e^{-j\frac{2\pi Q}{N_{k}}} $

where $x\left(n\right)$ is the music signal; $w_{{N_{k}}}\left(n\right)$ is the window function of length$N_{k}$. The window size $N_{k}$ can be expressed as

(9)
$ N_{k}=\left\lfloor Q\frac{f_{s}}{f_{k}}\right\rfloor $

where $f_{s}$ is the sampling frequency of the input audio signal. According to Eq. (9), the window size has an inverse relationship with the center frequency $f_{k}$. Music feature extraction aims to match the action characteristics of dance. Realizing intelligent choreography is to analyze the correlation of music and movement segments accurately and, after that, to find the most matching motion clips and music characteristics, different styles of matching corresponding concert dance moves, and different styles of intelligent choreography. Therefore, the extraction of music features and movement features has become a critical link, and the feature extraction can be carried out according to the rhythm features and intensity features. The extraction of action features includes low-level features, such as changes in motion speed, acceleration, direction of motion, and action morphology, as well as high-position features, such as feelings and style. Among the low-level features, the bone velocity features can be expressed as:

(10)
$ v_{i}^{Arm}=\sum _{f=1}^{L_{Motion}-1}\frac{\left\| p_{f+1}^{Arm}-p_{f}^{Arm}\right\| }{L_{Motion}-1} $

Eq. (10) expresses the average speed of the arm; $L_{Motion}$ is the length of $N_{i}$; $f$is the serial number of the frame in the action clip $N_{i}$; $p_{f}^{Arm}$is the key position of the arm in the frame $f$. Eq. (10) can also calculate the average speed of bones of other joints in the human body. In intelligent choreography, the model can match the action segments with different speeds by changing the speed and value range of the local bones of the human body to realize the choreography of dances with different movement characteristics. Dance movements have spatial characteristics that affect the synthesis effect of dance. The spatial measurement of the action segment $N_{i}$can be expressed as Eq. (11):

(11)
$ e_{i}=\sum _{f=1}^{L_{Motion}-1}\frac{\sqrt{\left(x_{f+1}^{Root}-x_{f}^{Root}\right)^{2}+\left(y_{f+1}^{Root}-y_{f}^{Root}\right)^{2}}}{L_{Motion}-1} $

where $f$ represents the sequence number of the frame in the action clip; $L_{Motion}$ is the length of the action clip; $x_{f}^{Root}$ and $y_{f}^{Root}$ are $f$ and $x$ coordinates of the root node of the frame $y$, respectively. In the intelligent choreography of the model, the spatial characteristics of the movements can be set to match different spatial action segments to realize the diversification of the choreography. The rhythm-matching process of music and dance movements can be realized using Eq. (12):

(12)
$ \begin{array}{l} \hat{s}=\max _{S,f_{0}}\sum \frac{F_{R}^{Music}\left(f\right)\cdot F_{R}^{Motion}\left(s\cdot f+f_{0}\right)}{F_{R}^{Music}\left(f\right)+F_{R}^{Motion}\left(s\cdot f+f_{0}\right)'}\\ f_{0}\in \left[0,L_{motion}-s\cdot L_{music}\right],s\in \left[0.9,1.1\right] \end{array} $

where $L_{music}$ and $L_{motion}$ are the lengths of $M_{i}$ and$N_{i}$, respectively. $f_{0}$ is the translation volume, and $s$ is the scaling scale coefficient. The maximum value of Eq. (12) is obtained, and the first frame $s\cdot L_{music}$ of the translated action clip is the action segment with the highest matching degree. The connectability of the obtained connected dance movements must be analyzed to maintain the true and natural feeling of the synthesized dance movements. At the end of the previous action clip and the start of the next action segment, the window with the length of the frame is intercepted, and the sum of the distance of $k$ frame pairs in the window is obtained:

(13)
$ D\left(f_{i},f_{j}\right)=\sum _{l=0}^{k-1}diff\left(f_{i-k+1+l},f_{j+1}\right) $

where $f_{i}$ is the end of the previous action segment, and $f_{j}$ is the beginning of the next action segment. The A threshold$\varepsilon $ is set on this basis. $f_{i}$ and $f_{j}$ are similar enough to be connected when Eq. (13) is less than the specified threshold. Intensity matching can be performed between the target music sequence and each candidate connectable action sequence to obtain the action sequence with the highest matching degree to the music sequence. The matching formula can be expressed as

(14)
$ \hat{D}=\sum _{f=1}^{L_{music}}\sqrt{\frac{F_{1}^{Music}\left(f\right)}{\sum _{k=1}^{L_{music}}F_{1}^{Music}\left(k\right)}\cdot \frac{F_{1}^{Motion}\left(f\right)}{\sum _{k=1}^{L_{motion}}F_{1}^{Motion}\left(k\right)}} $

Eq. (14) represents the intensity matching formula of music segment$M_{i}$ and action segment $N_{i}$, where $L_{music}$ is the length of the music segment and $L_{motion}$ is the length of the action segment. Thus, all three problems mentioned above can be solved, and intelligent choreography based on computer algorithms can be realized. Fig. 3 shows the overall choreography process based on a hybrid density network and dance movement characteristics.

Fig. 3. Overall flow of music choreography based on MDN.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig3.png

4. Application Effect Analysis of Choreography Model based on Hybrid Density Network and Motion Feature Screening Strategy

4.1 Performance Analysis of Action Feature Screening Model based on Mixed-density Network

The following dance moves were chosen for validation based on the hybrid density network synthesis performance: curtilage dance, street dance, modern dance, and other kinds of data sets. The three types of dance had different styles of music and dance movement characteristics. The characteristics of different movements have different movement speeds. Fig. 4 presents the algorithm automatically generated in relatively short street dance moves at the same time. With time, the training results of the model show that the movement results generated at different times have different characteristics, as shown in Fig. 4. At two minutes, the model cannot generate qualified actions and can only see the blurred human body contours according to the joints at random positions. The model could generate relatively stable and richer hip-hop movements in the second to the ninth hour. When the time length was increased to 22h, the generated street dance movements would slowly appear because of a lack of coherence and overfitting.

Fig. 4. Hip-hop dance action generation results at different times.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig4.png

Fig. 5 shows the model loss in the model training process, which can explain the main reasons for the above situation. The loss trend of the validation and training sets is inconsistent because the choreography of dance movements is not unique. On the contrary, it has the characteristics of diversity, while the essence of the model is to find regularity, and the relationship between the two is contradictory. Although the dance types have been classified at the beginning of the experiment, it cannot guarantee that the randomly selected validation set and training set have the same law. Therefore, the loss of the validation set did not change significantly in the early training process, but the phenomenon of overfitting appeared with the continuous training of the model.

Fig. 5. Loss value of the model.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig5.png
Table 1. Screening results based on action continuity.

Action data

Maximum threshold

Maximum threshold

Total Fragments

Data size

Total Frames

Percentage of data retained

Original action data

/

/

127

178.8 M

114302

/

After filtering1

30

60

350

149.6 M

98572

83.8%

After filtering2

20

60

463

104.8 M

66931

58.7%

After filtering3

20

30

841

131.1 M

83743

73.2%

The above scenario can be improved by conducting an experiment based on the connectivity of dance movements filtered on the MDN-generated dance moves, considering the similarity between two adjacent actions. The absolute value of the velocity and the first-order difference of each joint of adjacent frames are calculated, and the maximum threshold is set. If the speed difference is less than the threshold, then conform to the requirements of the coherence of the dance action. Table 1 lists the action data information filtered out under different speed differences and thresholds. The information is not difficult to find from the data in the table when the maximum threshold is set to be large, and the minimum threshold is set to be small. Hence, the conditions for action screening are relatively loose, and more dance movements are retained in the end. On the contrary, the screening conditions are strict, and fewer dance movements are retained. Therefore, the visual effects of the dance movements after screening can be changed, and the improved method is effective.

4.2 Application Effect Analysis of Choreography Model based on Hybrid Density Network and Motion Feature Screening Strategy

An intelligent choreography algorithm based on MDN and motion screening was constructed in the experiment. The final choreography results were evaluated in the experiment to verify the effectiveness of the algorithm and the effect of synthesized dance. Fig. 6 shows the extraction results of spatial features of dance movements. As shown in Fig. 6(a), the first movement segment has a small spatial metric value, indicating that it is weak in space, while the second is in contrast. Fig. 6(b) shows the paths of the two action segments. The motion trajectory range of action clip 1 is smaller than that of action clip 2, suggesting that action segment 1 is more spatial than action segment 2, which verifies that the model is effective in spatial feature extraction.

Fig. 6. Spatial Measurement Results of Dancing Movements.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig6.png

The experiment randomly selected test users to score the presentation effect of the three generated choreography styles. The test user scored from three dimensions: the matching degree of music and dance, the continuity of dance movements, and the authenticity of dance movements, as shown in Fig. 7. The results showed that the average value of the matching degree, continuity, and authenticity of the three dances are above four points, indicating that the test users feel appropriate for the three choreography results of the model synthesis. In addition, it also proves that the choreography model constructed by the experiment is practical. Hip-hop had the highest score among the three types of dance styles. This is mainly because the movement rhythm of hip-hop is unmistakable, and the movement range is larger than the other two, so the overall impression is good. The reason for the poor comprehensive performance of family dance is that its movements are more diversified, which is not conducive to model training and learning.

Fig. 7. Comparison of the Scores of Three Dances.
../../Resources/ieie/IEIESPC.2024.13.5.523/fig7.png

The experiment selects one-way ANOVA for the difference of different user evaluation results to determine if the difference in user scores is caused by the feature-matching algorithm. This method can effectively determine if adding feature matching will significantly impact the matching degree of music and dance. Table 2 compares the manual scoring results with feature matching and one-way analysis of variance. Among the sources of inter-group differences, the sum of squares between groups was 28.014; the degree of freedom was 1, and the average sum of squares was 28.014. The F-value was 58.931, and the P-value was 1.198${\times}$10$^{-10}$, which is much lower than the significance level (usually 0.05), indicating significant differences between groups. Regarding the intra-group differences, the sum of squares within the group was 30.423; the degree of freedom was 64, and the average sum of squares was 0.474. The overall variance was 58.438, with a degree of freedom of 65. In terms of scoring dance segments, the average matching degree obtained by adding feature matching algorithms was 3.87, with a standard deviation of 0.26. The average matching degree of random matching was 2.53, with a standard deviation of 0.68. The data in Table 2 showed P<0.05, indicating that the data of the two groups were significantly different. The feature matching algorithm affects the matching degree of choreography results, confirming the effectiveness of the model.

According to the above ideas, the effects of the hierarchical feature matching and local feature matching algorithms on the degree of choreography matching were verified. The feature-matching algorithms for music and dance showed differences. This difference was significant because the F-value was 6.287, corresponding to a P-value of 0.015, which is less than the significance level of 0.05. Regarding inter-group differences, the sources of differences explained 2.562 units of the total difference. A degree of freedom of 1 indicates an independent variable (feature matching algorithm). In terms of the intra-group differences, the sources of differences explained 26.062 units of the total difference. A degree of freedom of 64 indicates that there are 64 observations. The total difference was 28.624 units. A degree of freedom of 65 indicates 65 observations.

From the manual scoring results of dance segments, the average score using the hierarchical feature extraction algorithm was 4.14, with a standard deviation of 0.44. The average score using the local feature extraction algorithm was 3.87, with a standard deviation of 0.49. The significance level was set to 0.05. The P value of 0.015 indicated a significant difference between the two data groups. Therefore, the layered feature-matching algorithm is more suitable for the choreography model than the local feature-matching algorithm. The experiment compares the rule-based choreography model (labeled as Model B), the single-layer perception-based choreography model (labeled as Model C), and the template-based choreography model (labeled as Model D) to analyze the application effect of the proposed choreography model based on mixed-density networks and action feature filtering (labeled as Model A). Table 4 lists the application effects of different models.

In Table 4, Model A performed well in terms of matching, coherence, and authenticity, with scores of 4.32, 4.28, and 4.16, respectively. In contrast, Model B scored slightly lower in these areas, at 4.12, 3.98, and 3.92, respectively. The Models C and D scores were also slightly lower than those of Model A. In summary, the choreography model based on mixed-density networks and action feature filtering shows better performance in application effects, with higher matching, coherence, and authenticity compared to other models.

Table 2. Single factor ANOVA results of music and dance matching degree by adding feature matching algorithm.

Source of difference

Sum of squares

df

Mean

F

P-value

F-crit

Between groups

28.014

1.000

28.014

58.931

1.198×10-10

3.991

Within group

30.423

64.000

0.474

/

/

/

Total

58.438

65.000

/

/

/

/

Dance segment

(Manual scoring)

Add feature-matching algorithm

Mean

3.87

Standard

deviation

0.26

Random matching

Mean

2.53

Standard

deviation

0.68

Table 3. Single factor ANOVA results of two feature matching algorithms for music and dance matching.

Source of difference

Sum of squares

df

Mean

F

P-value

F-crit

Between groups

2.562

1.000

2.562

6.287

0.015

3.991

Within group

26.062

64.000

0.408

/

/

/

Total

28.624

65.000

/

/

/

/

Dance segment

(Manual scoring)

Hierarchical feature extraction

Mean

4.14

Standard deviation

0.44

Local feature extraction

Mean

3.87

Standard deviation

0.49

Table 4. Evaluation of the application effects of different models.

Model tags

Matching degree

Coherence

Authenticity

A

4.32

4.28

4.16

B

4.12

3.98

3.92

C

4.05

3.97

3.88

D

4.01

3.92

3.75

5. Conclusion

Based on the MDN, this paper proposes an algorithm model that can intelligently choreograph different styles of music. The model can generate dance movements through the MDN. The experiment first constructs the data set of music and dance actions and classifies them to provide a database for the learning and training of the action generation model. Subsequently, the performance of the hybrid density network was verified. The experimental results showed that the model has obstacles in finding the relevant rules because of the diversity of dance movements. Therefore, the experiment screened the actions generated by the model, and the screened dance actions could change the visual effect. The choreography effect of the model was tested using a test user scored from three dimensions: the matching degree of music and dance and the consistency and authenticity of dance actions. The average score of the three types of dances in three dimensions was more than four points, suggesting that users are satisfied with the overall effect of choreography, so the model is practical. In addition, the experimental results also verify the applicability of the model in spatial feature extraction and feature matching. Therefore, the choreography model constructed in the experiment can produce better intelligent choreography results and achieve the integration of art and technology to a certain extent. At the same time, it can reduce the cost of traditional choreography and assist the creative inspiration of artists. This study achieved certain results in computer music choreography, but limitations remain. First, further research is needed on the characteristics of actions, especially the analysis of high-level features of actions. Second, the evaluation of the matching degree between music and action is not comprehensive enough. Therefore, more abstract features will be added in future studies. Finally, the evaluation methods for the effectiveness of music choreography still need to be improved, and more objective quantitative indicators need to be explored. Future research will strengthen the analysis of action features, comprehensively evaluate the matching degree between music and action, and explore more objective quantitative indicators to achieve better results in this field.

REFERENCES

1 
Ma P, Jiang B, Lu Z, Li N, Jiang Z. Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields. Tsinghua Science and Technology, 2021, 26(3): 259-265.DOI
2 
An P, Wang Z, Zhang C. Ensemble unsupervised autoencoders and Gaussian mixture model for cyberattack detection. Information Processing & Management: Libraries and Information Retrieval Systems and Communication Networks: An International Journal, 2022, 59(2): 102844-102857.DOI
3 
Laura L L, Lorena A G. Fitting a Gaussian Mixture Model Through the Gini Index. International Journal of Applied Mathematics and Computer Science, 2021, 31(3): 487-500.DOI
4 
Sui H, Zhu H, Wu J, Luo B, Taccheo S, Zou X. Modeling pulse propagation in fiber optical parametric amplifier by a long short-term memory network. Optik: Zeitschrift fur Licht- und Elektronenoptik: Journal for Light-and Electronoptic, 2022, 260: 169125-169133.DOI
5 
Patil M S, Charuku B, Ren J. Long Short-term Memory Neural Network-based System Identification and Augmented Predictive Control of Piezoelectric Actuators for Precise Trajectory Tracking. IFAC-PapersOnLine, 2021, 54(20): 38-45.DOI
6 
Setiawan F, Yahya B N, Chun S J, Lee S L. Sequential inter-hop graph convolution neural network (SIhGCN) for skeleton-based human action recognition. Expert Systems with Application, 202, 195(Jun.): 116566.1~116566.10.DOI
7 
Qin Y, Liu B. KDM: A knowledge-guided and data-driven method for few-shot video action recognition. Neurocomputing, 2022, 510: 69-78.DOI
8 
Nguyen H T, Nguyen T O. Attention-based network for effective action recognition from multi-view video. Procedia Computer Science, 2021, 192: 971-980.DOI
9 
Song Y F, Zhang Z, Shan C, Wang L. Richly Activated Graph Convolutional Network for Robust Skeleton-Based Action Recognition. IEEE, 2021, 31 (5): 1915-1925.DOI
10 
PEstevam V, Pedrini H, Menotti D. Zero-Shot Action Recognition from Diverse Object-Scene Compositions. Neurocomputing, 2021, 439(Jun.7): 159-175.URL
11 
Dhiman C, Vishwakarma D K, Agarwal P. Part-wise Spatio-temporal Attention Driven CNN-based 3D Human Action Recognition. ACM Transactions on Multimidia Computing Communications and Applications, 2021, 17(3): 86.1-86.24.DOI
12 
Shi Y, Shen H. Anomaly Detection for Network Flow Using Immune Network and Density Peak. International Journal of Network Security, 2020, 22(2): 337-346.URL
13 
Li B, Sun X, Li C, Xue K, Zhang X. Downlink energy efficiency modeling and optimization with backhaul awareness and interference price in ultra cellular HetNet. The Journal of China Universities of Posts and Telecommunications, 2019, 26(2): 71-85.DOI
14 
Chacon JE. Mixture model modal clustering. Springer Berlin Heidelberg, 2019, 13(2): 379-404.DOI
15 
Wang H, Dong L, Fan T, Sun M. A local density optimization method based on a graph convolutional network. Frontiers of Information Technology and Electronic Engineering, 2020, (12): 1795-1803.DOI
16 
Wang J, Liu Y, Jin Y, Zhang Y. Control of Hydraulic Power System by Mixed Neural Network PID in Unmanned Walking Platform. Journal of Beijing Institute of Technology, 2020, 29(3): 273-282.DOI
17 
Podoprosvetov A.V., Alisejchik A.P., Orlov I.A. Comparison of action recognition from video and IMUs. Procedia Computer Science, 2021, 186: 242-249.DOI
18 
Yang H, Gu Y, Zhu J, Zhang X. PGCN-TCA: Pseudo Graph Convolutional Network with Temporal and Channel-Wise Attention for Skeleton-Based Action Recognition. IEEE Access, 2020, (8): 10040-10047.DOI
19 
Ahmad Z, Khan N. Human Action Recognition Using Deep Multilevel Multimodal (M-2) Fusion of Depth and Inertial Sensors. IEEE, 2020, 20 (3): 1445-1455.DOI
20 
Jaouedi N, Boujnah N, Bouhlel M.S. A new hybrid deep learning model for human action recognition. Journal of King Saud University - Computer and Information Sciences, 2020, 32(4): 447-453.DOI
Hanwen Li
../../Resources/ieie/IEIESPC.2024.13.5.523/au1.png

Hanwen Li, Zhejiang University of Science and Technology. In recent years, I have mainly taught courses such as dance plays, dance basics, and dance art appreciation. I participated in the compilation of the "Korean Dance Tutorial" for the third phase of the "211 Project" key discipline construc-tion project at Central University for Nationalities. As a leading dancer, I participated in the large-scale dance epic "Above Heaven and Earth" and won five major awards in the 5th Chinese Dance "Lotus Award" for dance poetry and performance, as well as the first prize in the Beijing Dance Competition. The works I guided have won many times: the first prize of the National College Student Art Festival and the title of "Excellent Instructor"; the first prize of Zhejiang University College Student Art Performance and the title of "Excellent Instructor".

Ben Jin
../../Resources/ieie/IEIESPC.2024.13.5.523/au2.png

Ben Jin, Tourism College of Zhejiang China. In recent years, I have mainly taught courses such as Chinese ethnic folk dance and appreciation of dance art. I have won the Excellent Performance Award in the Dance Competition of the TV Dance Competition, and have guided works that have won the First Prize in Zhejiang Province College Student Art Performance and the title of "Excellent Guidance Teacher" multiple times.