3.1 Music Signal Preprocessing and Feature Parameter Extraction
With the progress of information and multimedia technology, music digitization has
been widely adopted in various media, such as radio broadcasting, digital storage,
etc. The music signal has a characteristic in that various performance indicators
remain unchanged in the short run (that is, short-term stability). Therefore, when
studying the overall characteristics of the signal, it is necessary to focus on the
characteristics of each segment. First, the music signal needs to be processed in
frames. In order to make a smooth transition between two frames, it is necessary to
ensure that there is a sample overlap of 1/3 to 1/2 of the frame length between the
two frames during the framing process. The framing process is shown in the figure.
The number of frames that a music signal can be divided into is shown in Eq. (1).
In Eq. (1), $N_{x}$is the total length of the music signal, $N_{0}$is the overlapping length
between frames, and ${b_{i}}^{l}$ is the length of one frame. The high-frequency part
of the music signal with low energy is enhanced by filtering. Eq. (2) shows the implementation method.
In Eq. (2), $y\left(n\right)$is the output signal after signal enhancement processing, $x\left(n\right)$is
the input signal, and $\mu $ is an enhancement factor with a value close to 1.
After the frame processing, to increase the continuity between the frames, reduce
the edge effect, and reduce the leakage of the spectrum, the window processing of
the signal was studied. The process is shown in Eq. (3).
In Eq. (3), $s_{w}\left(n\right)$is the signal after windowing, and $w\left(n\right)$is the
window function. This research used the Hamming window as the windowing function,
as shown in Eq. (4).
The selection of music features determines the performance of the recognition system
to some extent. Good speech features can improve the accuracy and speed of music signal
classification. People's perception of music sound quality is mainly measured by pitch,
timbre, and rhythm. They can all be abstracted into feature vectors to represent them.
Most of the abstract features of timbre are short-term features, while the abstract
features of pitch and rhythm are mostly long-term features. In the view of the transform
domain, short-term features are divided into time domain (TD) features, frequency
domain (FD) features, and cepstrum domain features.
The Frequency Domain Features (FDF) of music is the characteristic parameters obtained
by processing the signal in the FD features after applying the Fourier transform to
the music signal first. Common FD features include the spectral centroid, spectral
energy, spectral bandwidth, spectral sub-band energy, spectral flow, and spectral
sub-band flow. The spectral centroid is used to measure the center of the spectrum.
The larger the value, the more high-frequency components of the signal there are.
It is calculated with Eq. (5).
In Eq. (5), $F\left(\omega \right)$is the Fourier transform of a frame signal, and $l$ and $h$represent
the minimum and maximum values of frequencies in the sub-bands, respectively. The
spectrum energy is found with Eq. (6).
The spectral bandwidth is weighted by the distance between the spectral energy and
the spectral centroid and mainly measures the FDF range of music, as shown in Eq.
(7).
The spectrum flow reflects the dynamic characteristics of the spectrum, which is the
sum of the squares of the corresponding points in the FD of two adjacent frames, which
reflects the sum of the spectrum transformation. The calculation is done with Eq.
(8).
The mel-frequency cepstral coefficients (MFCCs) are cepstrum parameters extracted
in the mel-scale FD. Eq. (9) shows the relationship of mel frequency and linear frequency
In Eq. (9), $f$is the linear frequency. After framing, signal enhancement, and windowing, the
mel-frequency coefficient (MPC) of the music signal was extracted. Based on an MFCC,
an MPC does not require a discrete cosine transform, and the result is directly output
after logarithmic calculation. The extraction of MPC is shown in Fig. 2.
The frequency spectrum was decomposed into several sub-bands by using a mel bandpass
filter to obtain the parameter frequency. The natural logarithm of the frequency was
then calculated to obtain the MPC parameter. For better unsupervised learning, the
restricted Boltzmann machine (RBM) was introduced based on the MPC parameters, and
the maximum likelihood function was used to optimize the selected feature parameters.
The MPC feature vector is the input layer of RBM. The network parameters are continuously
updated after training. The update method is shown in Eq. (10).
In Eq. (10), $\Delta w$is a weight matrix in the visible layer and hidden layer, $\Delta a$and$\Delta
b$are the bias vectors of the visible layer and the hidden layer, respectively, and
$p\left(h\left| s\right.\right)$ is the corresponding hidden layer probability distribution
when the visible unit is a specific training sample $s$.
Emotion is an attribute of music, and emotional characteristics are an important feature
in music signals. Vectors related to emotions can be divided into time-domain feature
vectors and frequency-domain feature vectors. The TD and FD characteristic vectors
include the mean, variance, intensity, maximum, center of gravity, bandwidth, roll,
and flow. In order to judge the impact of features on the music emotions, this study
adopted the sequential floating forward selection method to obtain the features that
affect the emotional model. Due to the subjectivity and the singularity of emotional
features, the classification effect is not ideal when used alone in classifiers. For
this reason, mixed feature vectors were used in this study. The feature vector of
the frame t can be expressed as Eq. (11).
In Eq. (11), $M$is the order of the MPC feature vector.
The emotion feature vector adopts 12 TD and FD features. The two types of characteristics
are fused to obtain the final combination of music signal features. The feature fusion
process is shown in Fig. 3. Based on these operations, the preprocesses of framing, signal enhancement, and
windowing on the input music signal were studied. The feature combination of an MPC
feature and emotional feature fusion was extracted, which provided a basis for the
classification model construction.
Fig. 1. Diagram of overlapping frame division processing.
Fig. 2. MPC feature vector extraction process.
Fig. 3. MPC features and emotional features are fused into a combined feature flow.
3.2 Music Classification based on DBN-HMM
Music classification plays a very important role in a music signal retrieval. Users
who like to listen to piano music are not interested in all styles of music used.
In order to carry out music retrieval according to their interests and hobbies, they
were classified according to the fusion characteristics of piano music, which is convenient
for efficient management and fast retrieval of music. HMM contains two stochastic
processes, the observed state and the hidden state, and uses parameters to represent
the statistical properties of the stochastic process. HMM has a property of essentially
reflecting sound. It is widely used in speech signal processing, which can be expressed
as a quintuple in Eq. (12).
In Eq. (12), $\Omega _{x}$is the state set, $\Omega _{O}$is the observation value set, and $A$is
a transition probability matrix indicating the probability of transitioning $B$from
the state at the current moment $q_{i}$to the state at the next moment. $q_{j}$is
an output probability matrix, indicating $q_{i}$ is the observed value’s probability
when the state value is $o_{M}$. $\pi $is the initial state distribution.
Fig. 4. Diagram of DBN network structure.
The model obtains the relevant parameters of the HMM by calculating the probability
of the observation sequence and the probability of each state and realizes the classification
of the sample data. However, the HMM depends on the state labels of the training data,
and there are a large number of raw data with missing labels in actual training, which
affects the recognition effect. To avoid this defect, HMM and DBN were combined to
classify music signals. Fig. 4 shows the specific process of the DBN.
The network structure of DBN is a highly complex directed acyclic graph composed of
RBM and is a hierarchical unsupervised learning model. DBN can effectively utilize
data with missing labels, its deep network structure enhances the ability to model
signal features, and it can provide more accurate observation probabilities. The joint
probability distribution is used to represent the relationship in the visible layer
and the hidden layer. Eq. (13) displays the calculation method.
In Eq. (13), $l$is the number of hidden layers of the DBN. RBM can obtain better initial parameter
values through layer-by-layer training. The network was further enhanced through traditional
learning algorithms. The batch gradient descent method was used for network tuning,
and the overall loss function of the sample is shown in Eq. (14).
In Eq. (14), ${W_{ij}}^{\left(l\right)}$ is the connection weight coefficient; $l$ is the number
of hidden layers; $i$ and $j$ represent the number of nodes in the current and subsequent
hidden layers, respectively; $h_{W,b}\left(x^{\left(i\right)}\right)$ is the offset
of the node; ${b_{i}}^{l}$ is the result after reconstruction. The partial derivative
of the weight coefficient and bias is shown in Eq. (15).
Fig. 5. Basic structure of DBN+HMM model.
The DBN-HMM model was trained and classified by estimating the posterior probability
of HMM, and its basic structure is shown in Fig. 5. To improve the loss function, the gradient descent method was used to minimize the
reconstruction mean square error. The objective function takes the cross-entropy between
the reference state label and the predicted state distribution, as in Eq. (16).
In Eq. (16), $s$ is the current state, and $y\left(s\right)$is the predicted state distribution.
The output results in the DBN output layer node are the input of HMM, and the posterior
probability of the HMM state is calculated using the Softmax regression model. The
output state distribution is expressed as Eq. (17).
In Eq. (17), $P\left(s\right)$is the prior probability of the state appearing in training data
$s$. The gradient expression between the objective function and the activation probability
is shown in Eq. (18).
In Eq. (18), $F_{CE}$ is the objective function, $a_{n}\left(s\right)$is the activation probability,
$\delta _{s;{s_{n}}}$is the Ronecker function, which satisfies Eq. (19).
It was found that the recognition effect of the model still needs to be further improved.
During the HMM training, the algorithm produces large differences in the randomly
selected initial matrix parameters, making the result trapped at a local optimum and
affecting the accuracy of model classification and recognition. To optimize the parameters,
the global search advantages of a genetic algorithm (GA) were used to optimize the
initial parameters of the matrix of the HMM. This was done to deal with the sensitive
problem caused by the random selection of the initial parameters of the Baum-Welch
training algorithm.
However, the traditional GA easily produces ``super individuals'' in the evolution
process, which affect the subsequent evolution. For this reason, the mutation operator
was improved, and the chaos operator was used to carry out the mutation operation
to improve the quality of the evolution result. Mutation contributes to the generation
of new individuals, and the mapping model obtained through the chaotic mapping operator
is shown in Eq. (20).
In Eq. (20), $x$ is the neuron’s internal state, $k$is a damping factor of the neuromembrane,
$g\left(x\right)$ is nonlinear self-feedback. Based on Gaussian variation, the function
of a Gaussian normal distribution was changed to a chaotic mapping function. The improved
variation operator is shown in Eq. (21).
In Eq. (21), $s$is the mutation scale, $g$ is the quoted annealing factor. Based on these operations,
the GA was used to perform the parameter initialization in HMM, and then the DBN was
used to classify the style of a piano music signal. Fig. 6 displays the classification flow.
Fig. 6. Classification process of piano-music style-classification model based on DBN-GA-HMM.