Mobile QR Code QR CODE

  1. ( School of Education, Shandong Women’s University, Jinan, Shandong 250300, China lsnliu81@outlook.com)
  2. ( School of Economics, Shandong Women’s University, Jinan, Shandong 250300, China xwangwx@outlook.com)



Accuracy, Conditional random field, Main melody extraction, Vocal information

1. Introduction

The continuous advances in computers and technology have greatly facilitated the creation and distribution of music [1]. On the other hand, with the proliferation of music content, there is an increasing need for more efficient methods to manage and utilize this vast musical landscape [2]. Polyphonic music encompasses compositions in which two or more notes coexist and sound simultaneously [3]. Such music is composed of multiple instrumental sounds and melodies superimposed upon each other, with the sequence containing the melodies referred to as the vocal main melody. While individuals with healthy hearing can accurately identify the main melody amidst the accompaniments, this task is quite challenging for computers. Main melody extraction algorithms are designed to extract the pitch sequence of vocals from polyphonic music, which has extensive applications in music retrieval, classification, and recognition [4]. Li et al. [5] proposed an approach for melodic trajectory extraction and barcode encoding. This approach automatically extracted melodic trajectories from musical instrument digital interface (MIDI) files and generated similar classical music. Their experiments showed that the approach had 94.7% accuracy in extracting melodic trajectories. Gao et al. [6] developed a high-resolution network (HRNet)-based method for extracting vocal melodies, effectively reducing accompaniment interference. Zhang et al. [7] created musical features through a constant Q-transform and employed the extreme learning machine to determine melodic pitches. Their method successfully extracted melodies with high accuracy and speed across three datasets. Kum et al. [8] introduced a joint detection and classification network for simultaneous song detection and gene estimation. In this study, a main melody extraction algorithm based on the conditional random field (CRF) was designed for processing vocal information. The algorithm fully utilized the advantages of convolutional neural networks (CNN) in feature extraction and the reliability of CRF in capturing the temporal relationships, effectively integrating multiple features to improve the accuracy of main melody extraction. The experimental results confirmed the effectiveness of the method and its potential applications in actual vocal information processing. This algorithm offers a novel approach to main melody extraction, provides a theoretical reference to the innovative application of CNN and CRF in video processing, and offers some reference for research on music retrieval and other related studies.

2. Melody Extraction for Vocal Information

2.1 Vocal Information Features

(1) Mel-frequency cepstral coefficient (MFCC)

The MFCC is a feature commonly used in speech recognition [9], which is similar to the perception of the human ear. Hence, it can play a larger role in vocal information processing [10]. When extracting the MFCC, it is first necessary to perform a pre-emphasis on the original audio signal:

(1)
$H\left(z\right)=1-\alpha \left(n-1\right)$,

where $n$ is the time point, and $\alpha $ is the pre-emphasis factor.

The Hamming window is then used:

(2)
$w\left(n\right)=0.54-0.46\cos \frac{2\pi n}{N},0\leq n\leq N$.

Window function length $L=N+1$.

The preprocessed sequence $x\left(n\right)$ is then transformed to the frequency domain by a fast Fourier transform (FFT):

(3)
$x\left(k\right)=\sum _{n=0}^{N-1}x\left(n\right)e^{-j\frac{2\pi }{N}nk},k=0,1,2,\cdots ,N-1$,

where $N$ is the frame length and $\mathrm{x}\left(\mathrm{k}\right)$ is the complex sequence of $\mathrm{N}$ points. The logarithmic energy calculated computed using the Mel filter:

(4)
$S\left(m\right)=\sum _{k=0}^{N-1}E\left(k\right)H_{m}\left(K\right)$,

where $E\left(k\right)$, $H_{m}\left(K\right)$, and $k$ are the spectral line energy, frequency domain response of the filter, and the k$^{\mathrm{th}}$ spectral line. The discrete cosine transform (DCT) is carried out on $S\left(m\right)$ to obtain the MFCC feature:

(5)
$MFCC=\sqrt{\frac{2}{M}}\sum _{m=0}^{M-1}\log S\left(m\right)\cos \left[\frac{\pi n\left(2m-1\right)}{2M}\right]$,

where $\mathrm{M}$ is for the number of filters, and $m$ refers to the $m$$^{\mathrm{th}}$ Mel filter.

(2) Colorimetric feature

The colorimetric feature is a feature based on the equal temperament [11]. In vocal information, the pitch is used to describe how high or low the pitch of a voice is, and an equal temperament can be used to calculate the pitch of a note, so chromatic features can be a good way of describing the pitch variations in a melody. The extracted chromatic feature is a 12-dimensional vector. The $j$$^{\mathrm{th}}$ audio frame in the sequence is assumed to be $v_{j}$. The $l$$^{\mathrm{th}}$ colorimetric feature is written as $c_{l}^{j}$, $l\in \left(0,1,\cdots ,11\right)$. First, the center frequency $f_{l}$ of each half-order tone class is calculated as follows:

(6)
$ f_{l}=f_{rep}\cdot 2^{\frac{l+1}{12}} $

where $f_{rep}$ refers to the fundamental frequency of the signal. The energy $chroma_{l}$ of the half-order tone class is then calculated:

(7)
$c_{l}^{j}=\frac{chroma_{l}}{max\left(chroma_{l}\right)}$.
(8)
$chroma_{l}=\sum _{i=1}^{N}w\left(l,f_{i}\right)\cdot A_{i}^{2}$,
(9)
$w\left(l,f_{i}\right)=\left\{\begin{array}{l} -cos^{2}\left(\frac{\pi }{2}\cdot \frac{d}{0.5\times \left(l+1\right)}\right),\left| d\right| \leq 0.5\times \left(l+1\right)\\ 0,\left| d\right| >0.5\times \left(l+1\right) \end{array}\right.$,
(10)
$d=12\times \log \frac{f_{l}}{f_{i}}$,

where $A_{i}$ refers to the amplitude at the i$^{\mathrm{th}}$ amplitude peak point, and $w\left(l,f_{i}\right)$ means the weight of the signal with a frequency of $f_{i}$ in the half-order tone class. The $\mathrm{c}h\text{roma}_{\mathrm{l}}$ is then normalized to obtain the final colorimetric feature:

(11)
$c_{l}^{j}=\frac{chroma_{l}}{max\left(chroma_{l}\right)}$.

2.2 Conditional Random Field (CRF)

The CRF has a wide range of applications in sequence annotation [12]. Therefore, it can also be used in pitch sequence annotation. Suppose the input observation sequence is $X=\left(X_{1},X_{2},\cdots ,X_{n}\right)$, and the output labeled sequence is $Y=\left(Y_{1},Y_{2},\cdots ,Y_{n}\right)$, the conditional probability distribution based on CRF can be written as follows:

(12)
$P\left(y|X\right)=\frac{e^{score\left(X,y\right)}}{\sum _{\hat{Y}}e^{score\left(X,y\right)}}$,
(13)
$score\left(X,y\right)=\sum _{i}A_{{y_{i}},{y_{i+1}}}+\sum _{i}B_{i,{y_{i}}}$,

where $A_{{y_{i}},{y_{i+1}}}$ is the transfer characterization matrix, denoting the transfer score between labels. $B_{i,{y_{i}}}$ is the emission probability matrix, denoting the score of the i$^{\mathrm{th}}$ position being label $\mathrm{y}_{\mathrm{i}}$. The final result is

(14)
$\overset{˜}{y}=\underset{y\in Y}{\text{argmax}}P\left(y|X\right)$.

2.3 Main Melody Extraction Algorithm Design

A main melody extraction algorithm based on CRF was designed to process vocal information using MFCC and colorimetric features as inputs. The CRF was used to obtain the best melodic line. A convolutional neural network (CNN) was added before the CRF to realize the extraction of deeper melodic features and improve the extraction of the main melodic line. The CNN adopts the structure of a six-layer convolution and adds cavity convolution in the first two layers [13]. The first layer uses 32 cavity convolution blocks (5${\times}$5) with a dilation width of (1,10), while the second layer uses 64 cavity convolution blocks (3${\times}$5) with a dilation width of (12,1). The integration and compression of the features are achieved through a fully connected layer after the CNN layer, which serves as the input to the CRF layer.

In the main melody extraction, the CRF is used to calculate the pitch score at time $t$ and then output the melodic sequence. In this task,

(15)
$score\left(y|A,B\right)=B_{1,{y_{1}}}+\sum _{t=2}^{T}\left(A_{{y_{t-1}},{y_{t}}}+B_{t,{y_{t}}}\right)$,

where $A$ is the transfer feature matrix, and $A_{{y_{t-1}},{y_{t}}}$ represents the probability of pitch $y_{t-1}$ at time $t-1$ transferring to pitch $y_{t}$ at time $t$. $B$ represents the emission probability matrix, and $B_{t,{y_{t}}}$ is the probability when the pitch at time $t$ is $y_{t}$.

The CRF result was decoded using the Viterbi algorithm [14]. The maximum value of all possible values of pitch $f$ at time $t$ is denoted as $\omega _{t}\left(f\right)$, $f=1,2,\cdots ,K$. The pitch value at time $t$ when $\omega _{t}\left(f\right)$ reached the maximum value is denoted as $\delta _{t+1}\left(f\right)$. The Viterbi decoding process is expressed below.

The relevant parameters are initialized:

(16)
$\omega _{1}\left(f\right)=A_{{y_{0}},f}+B_{f,t}$,
(17)
$\delta _{1}\left(f\right)=y_{0}$,

where $\mathrm{A}_{{\mathrm{y}_{0}},\mathrm{f}}$ is the probability of pitch transferring to $\mathrm{f}$, and $\mathrm{B}_{\mathrm{f},\mathrm{t}}$ is the probability of pitch being $\mathrm{f}$ at time $\mathrm{t}$.

After recursion using the above equation, the following equations can be obtained:

(18)
$\omega _{t+1}\left(f\right)=\max _{1\leq j\leq K}\left\{\omega _{t}\left(j\right)+A_{j,f}+B_{f,t}\right\}$, $f=1,2,\cdots ,K$,
(19)
$\delta _{t+1}\left(f\right)=\underset{1\leq j\leq K}{argmax}\left\{\omega _{t}\left(j\right)+A_{j,f}+B_{f,t}\right\}$, $f=1,2,\cdots ,K$.

The optimal solution is obtained after backtracking:

(20)
$y'_{T}=\underset{1\leq j\leq K}{\arg \,\max }\omega _{T}\left(j\right)$,
(21)
$y'_{t}=\delta _{t+1}\left(y'_{t+1}\right)$.

The main melodic pitch sequence output by the CNN-CRF algorithm is $y'=\left(y'_{1},y'_{2},\cdots ,y'_{T}\right)$.

3. Results and Analysis

3.1 Experimental Setup

The experiment was performed using MATLAB 2019 and Python 3.6. The algorithm was implemented based on the TensorFlow framework. For the CNN-CRF model, the Adam optimization algorithm was used for optimization training. The learning rate was set to 0.001. The batch size was 20. The maximum iteration count was 100. Overfitting was mitigated by adding a dropout layer with a ratio of 0.5 after output, and the activation function utilized was the rectified linear unit (ReLU). In the experiment, the large MIR1K dataset [15] was used as the training and validation set to help the model learn more data patterns and features. The independent datasets ADC2004 and MIREX05, which were not encountered by the model during training and validation, were used as the test sets to evaluate the model performance objectively. Table 1 lists the three datasets.

[1] MIR1K includes 110 songs performed by 19 amateur singers, with each clip lasting 4-13 s.

[2] ADC2004 includes 20 pop music clips, including jazz and R&B, each lasting approximately 20 s.

[3] MIREX05 includes 13 clips of pop and pure music, each lasting 24-39 s.

Table 1. Experimental Data Set.

Data set category

Name

Training set

70% MIR1K

Validation set

30% MIR1K

Test set

ADC2004, MIREX05

3.2 Evaluation Indicators

The algorithms were evaluated based on the methodology described elsewhere [16], as shown in Table 2.

Table 2. Evaluation Indicators.
../../Resources/ieie/IEIESPC.2024.13.4.322/tb2.png
Table 3. Explanation of the Parameters in Table 2.

Ground Truth

Melodic

Without melody

Total

Test results

With melody

TP

FP

DV

Without melody

FN

TN

DU

Total

GV

GU

TO

Table 3 lists the explanations of the parameters in Table 2.

According to Table 3, the following are defined.

TP: there is a melody, and it is detected correctly;

TN: no melody and correctly detected;

FP: no melody and wrong detection;

FN: no melody and correctly detected;

DV: detected melody;

DU: undetected melody;

GV: there is a melody in Ground Truth;

GU: no melody in Ground Truth;

TO: total.

In RPA and RCA, $c$ refers to the pitch, and $ch$ refers to the tone level.

3.3 Analysis of Results

Under the same conditions, the effects of the features selected in this paper on the main melody extraction results were analyzed, as shown in Fig. 1.

Fig. 1. Effect of the features on the main theme extraction results (ADC2004 dataset).
../../Resources/ieie/IEIESPC.2024.13.4.322/fig1.png

The main theme extraction algorithm achieved an OA and VFA of 85.74% and 7.43%, respectively, when using only the MFCC as the input feature for the CNN-CRF algorithm (Fig. 1). The OA and VFA were 83.62% and 7.26%, respectively, when exclusively using the chroma as the input feature. From this perspective, when using the chroma, the OA of the algorithm decreased, and the VFA increased. On the other hand, the RCA, RPA, and VR improved when using the chroma. This is because the chroma is a pitch-related feature. When the MFCC and chroma were used simultaneously as inputs for the CNN-CRF algorithm, the algorithm achieved an OA of 86.72%, showing 0.98% improvement compared to using the MFCC alone and 3.1% improvement compared to using the chroma alone. Moreover, the VFA of the algorithm was 6.84%, showing a 0.59% reduction compared to using the MFCC alone and a 0.42% reduction compared to using chroma alone. The RCA of the algorithm remained high at 85.76%, but it was slightly lower than using chroma alone. Nevertheless, both RPA and VR showed improvement.

Fig. 2 presents the results of the MIREX05 dataset.
Fig. 2. Effect of the features on main melody extraction results (MIREX05 dataset).
../../Resources/ieie/IEIESPC.2024.13.4.322/fig2.png

When only MFCC was used, the algorithm achieved an OA and VFA of 83.56% and 8.14%, respectively. On the other hand, when only chroma was used, the OA was slightly lower at 82.34%, but the VFA was higher at 9.33%. Hence, the MFCC was more accurate in extracting the main melody in the MIREX05 dataset. This difference in performance may be attributed to the MIREX05 dataset containing both pop and pure music. MFCC was more sensitive to vocals, making it better at distinguishing pitch in vocal information.

Similar to the ADC2004 dataset, chroma performed better regarding RCA with a value of 84.06%. When both MFCC and chroma were used, the CNN-CRF algorithm achieved an OA and VFA of 85.21% and 11.16%, respectively, on the MIREX05 dataset. In addition, it demonstrated impressive performance on RCA, RPA, and VR with values of 83.91%, 82.56%, and 86.33%, respectively. Overall, including MFCC and chroma as input features improved the algorithm OA. Although the VFA showed a slight increase, the enhanced performance on RCA, RPA, and VR underscored the valuable contribution of vocal information features to the main melody extraction in this study.

The effectiveness of the CNN-CRF algorithm was evaluated by a comparison with other algorithms as follows:

(1) SegNet [17]: a model based on the encoder-decoder structure that locates melodic frequencies through pooling indices;

(2) frequency-temporal attention network (FTANet) [18]: a neural network based on a time-frequency attention structure that analyzes the time mode using the time attention module and selects the same band using the frequency attention module.

Table 4 lists the results of the two datasets.

The bold data in Table 4 are the optimal values. According to Table 4, on the ADC2004 dataset, the SegNet algorithm exhibited the highest VR at 88.24%. On the other hand, its VFA and OA were 19.93% and 82.72%, respectively. The FTANet algorithm showed a significantly lower VFA, only 10.53%, and achieved an OA of 84.99%. In contrast, the CNN-CRF algorithm achieved the lowest VFA, 6.84%, which was 13.09% lower than the SegNet algorithm and 3.69% lower than the FTANet algorithm. Its OA was the highest at 86.72%, showing 4% improvement over the SegNet algorithm and 1.73% improvement over the FTANet algorithm.

On the MIREX05 dataset, the CNN-CRF algorithm achieved the best values across all indicators. It achieved a VFA of 11.16%, which was 8.81% lower than the SegNet algorithm and 9.45% lower than the FTANet algorithm. Furthermore, it attained an OA of 85.21%, 8.92% higher than the SegNet algorithm and 7.01% higher than the FTANet algorithm. These results demonstrate the advantage of the CNN-CRF algorithm in extracting main melodies.

In summary, the CNN-CRF algorithm uses the MFCC and chroma as features to capture the human voice in vocal information effectively. Compared to the ADC2004 dataset, the MIREX05 dataset features a greater abundance of popular music, making the improvements in the results more pronounced. The addition of the chroma significantly improved the RCA of the algorithm, resulting in higher OA and lower VFA.

Table 4. Comparisons with Other Algorithms (%).

VR

RPA

RCA

VFA

OA

ADC2004

SegNet

88.24

83.47

85.31

19.93

82.72

FTANet

85.97

84.41

84.77

10.53

84.99

ours

86.86

85.43

85.76

6.84

86.72

MIREX05

SegNet

71.54

65.01

66.15

19.97

76.29

FTANet

78.74

71.87

72.75

20.61

78.20

ours

86.33

82.56

83.91

11.16

85.21

4. Conclusion

This paper presented a CNN-CRF method that utilizes the MFCC and chroma features for processing vocal music information, specifically extracting the main melody. The experimental results on two datasets showed that the features selected in this paper significantly improved the accuracy of the main melody extraction in vocal music, and the best results were achieved when using the MFCC and chroma as features. The algorithm achieved an OA of 86.72% and 85.21% on the ADC2004 and MIREX05 datasets, respectively. This method consistently outperformed other algorithms. It performed better than the SegNet and FTANet algorithms on the ADC2004 datasets in all indicators except the VR and on the MIREX05 dataset in all indicators, highlighting its effectiveness in the main melody extraction. This approach has promising applications in real-world vocal music information processing. Nevertheless, this method also has some limitations. The performance of the CNN-CRF algorithm can be improved further by optimizing the CNN model. The music theory analysis is not sufficiently in-depth, and the experimental dataset is relatively limited. Future work will explore the possibilities for further improvements of the CNN-CRF algorithm, conduct a thorough analysis of algorithm performance, consider deeper levels of vocal information to enhance the reliability of melody extraction, and validate the algorithm in larger and more complex music libraries.

REFERENCES

1 
C. Roig, Tardón, J. Lorenzo, I. Barbancho, and A. M. Barbancho, ``A Non-Homogeneous Beat-Based Harmony Markov Model,'' Knowledge-Based Systems, Vol. 142, pp. 85-94, 2018.DOI
2 
D. Bisharad, and R. H. Laskar, ``Music genre recognition using convolutional recurrent neural network architecture,'' Expert Systems, Vol. 36, No. 4, pp. e12429.1-e12429.13, 2019.DOI
3 
J. Abeer, ``Informing Piano Multi-Pitch Estimation with Inferred Local Polyphony Based on Convolutional Neural Networks,'' Electronics, Vol. 10, No. 7, pp. 1-18, 2021.DOI
4 
D. Martin-Gutierrez, G. Hernandez Penaloza, A. Belmonte-Hernandez, and F. Alvarez Garcia, ``A Multimodal End-To-End Deep Learning Architecture for Music Popularity Prediction,'' IEEE Access, Vol. 8, pp. 39361-39374, 2020.DOI
5 
S. Li, S. Jang, and Y. Sung, ``Melody Extraction and Encoding Method for Generating Healthcare Music Automatically,'' Electronics, Vol. 8, No. 11, pp. 1-15, 2019.DOI
6 
Y. Gao, X. Zhang, and W. Li, ``Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation,'' Electronics, Vol. 10, No. 3, pp. 1-14, 2021.DOI
7 
W. Zhang, Q. Zhang, S. Bi, S. Fang, and J. Dai, ``EfficientMelody Extraction Based on an Extreme Learning Machine,'' Applied Sciences, Vol. 10, No. 7, pp. 1-15, 2020.DOI
8 
S. Kum, and J. Nam, ``Joint Detection and Classification of Singing Voice Melody Using Convolutional Recurrent Neural Networks,'' Applied Sciences, Vol. 9, No. 7, pp. 1-17, 2019.DOI
9 
D. Prabakaran, and S. Sriuppili, ``Speech Processing: MFCC Based Feature Extraction Techniques- An Investigation,'' Journal of Physics: Conference Series, Vol. 1717, pp. 1-8, 2021.DOI
10 
K. Balachandra, N. Kumar, T. Shukla, Swati, and S. Kumar, ``Music Genre Classification for Indian Music Genres,'' International Journal for Research in Applied Science & Engineering Technology, Vol. 9, pp. 1756-1762, 2021.DOI
11 
S. Rajesh, and N. J. Nalini, ``Combined Evidence of MFCC and CRP Features Using Machine Learning Algorithms for Singer Identification,'' International Journal of Pattern Recognition and Artificial Intelligence, Vol. 35, No. 1, pp. 2158001.1-2158001.21, 2020.DOI
12 
K. Paripremkul, and O. Sornil, ``Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field,'' Journal of Advances in Information Technology, Vol. 2021, No. 2, pp. 135-141, 2021.DOI
13 
G. Celik, and M. F. Talu, ``A new 3D MRI segmentation method based on Generative Adversarial Network and Atrous Convolution,'' Biomedical Signal Processing and Control, Vol. 71, pp. 1-10, 2022.DOI
14 
T. A. Nguyen, and J. Lee, ``Modified Viterbi Algorithm with Feedback Using a Two-Dimensional 3-Way Generalized Partial Response Target for Bit-Patterned Media Recording Systems,'' Applied Sciences, Vol. 11, No. 2, pp. 1-23, 2021.DOI
15 
P. Rengaswamy, K. Reddy, K. S. Rao, and P. Dasgupta, ``Robust f0 extraction from monophonic signals using adaptive sub-band filtering - ScienceDirect,'' Speech Communication, Vol. 116, pp. 77-85, 2020.DOI
16 
University of Illinois at Urbana-Champaign. Audio Melody Extraction[EB/OL].URL
17 
T. H. Hsieh, L. Su, and Y. H. Yang, ``A streamlined encoder/decoder architecture for melody extraction,'' in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 156-160, 2019.DOI
18 
S. Yu, X. Sun, Y. Yu, and W. Li, ``Frequency-temporal attention network for singing melody extraction,'' in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 251-255, 2021.DOI

Author

Shengnan Liu
../../Resources/ieie/IEIESPC.2024.13.4.322/au1.png

Shengnan Liu was born in Henan, Kaifeng, P.R. China, in 1981. She received her Masters degree from Henan University, P.R. China. Currently, she works in the School of Education at Shandong Women’s University. Her research includes Chinese national vocal music, music education, the singing method of Chinese national vocal music, and communication of music culture.

Xu Wang
../../Resources/ieie/IEIESPC.2024.13.4.322/au2.png

Xu Wang was born Anshan, Liaoning. P.R. China. He graduated from the City University of Macau and received a doctorate of data science. In 2022, he entered the school of economics of Shandong Women's University in April 2022. His current research focuses on Bayesian network, Intelligence Optimi-zation and financial big data.