3.1 Experimental Setup
The experiment was performed using MATLAB 2019 and Python 3.6. The algorithm was implemented
based on the TensorFlow framework. For the CNN-CRF model, the Adam optimization algorithm
was used for optimization training. The learning rate was set to 0.001. The batch
size was 20. The maximum iteration count was 100. Overfitting was mitigated by adding
a dropout layer with a ratio of 0.5 after output, and the activation function utilized
was the rectified linear unit (ReLU). In the experiment, the large MIR1K dataset [15] was used as the training and validation set to help the model learn more data patterns
and features. The independent datasets ADC2004 and MIREX05, which were not encountered
by the model during training and validation, were used as the test sets to evaluate
the model performance objectively. Table 1 lists the three datasets.
[1] MIR1K includes 110 songs performed by 19 amateur singers, with each clip lasting
4-13 s.
[2] ADC2004 includes 20 pop music clips, including jazz and R&B, each lasting approximately
20 s.
[3] MIREX05 includes 13 clips of pop and pure music, each lasting 24-39 s.
Table 1. Experimental Data Set.
Data set category
|
Name
|
Training set
|
70% MIR1K
|
Validation set
|
30% MIR1K
|
Test set
|
ADC2004, MIREX05
|
3.2 Evaluation Indicators
The algorithms were evaluated based on the methodology described elsewhere [16], as shown in Table 2.
Table 2. Evaluation Indicators.
Table 3. Explanation of the Parameters in Table 2.
|
Ground Truth
|
Melodic
|
Without melody
|
Total
|
Test results
|
With melody
|
TP
|
FP
|
DV
|
Without melody
|
FN
|
TN
|
DU
|
Total
|
GV
|
GU
|
TO
|
Table 3 lists the explanations of the parameters in Table 2.
According to Table 3, the following are defined.
TP: there is a melody, and it is detected correctly;
TN: no melody and correctly detected;
FP: no melody and wrong detection;
FN: no melody and correctly detected;
DV: detected melody;
DU: undetected melody;
GV: there is a melody in Ground Truth;
GU: no melody in Ground Truth;
TO: total.
In RPA and RCA, $c$ refers to the pitch, and $ch$ refers to the tone level.
3.3 Analysis of Results
Under the same conditions, the effects of the features selected in this paper on the
main melody extraction results were analyzed, as shown in Fig. 1.
Fig. 1. Effect of the features on the main theme extraction results (ADC2004 dataset).
The main theme extraction algorithm achieved an OA and VFA of 85.74% and 7.43%, respectively,
when using only the MFCC as the input feature for the CNN-CRF algorithm (Fig. 1). The OA and VFA were 83.62% and 7.26%, respectively, when exclusively using the
chroma as the input feature. From this perspective, when using the chroma, the OA
of the algorithm decreased, and the VFA increased. On the other hand, the RCA, RPA,
and VR improved when using the chroma. This is because the chroma is a pitch-related
feature. When the MFCC and chroma were used simultaneously as inputs for the CNN-CRF
algorithm, the algorithm achieved an OA of 86.72%, showing 0.98% improvement compared
to using the MFCC alone and 3.1% improvement compared to using the chroma alone. Moreover,
the VFA of the algorithm was 6.84%, showing a 0.59% reduction compared to using the
MFCC alone and a 0.42% reduction compared to using chroma alone. The RCA of the algorithm
remained high at 85.76%, but it was slightly lower than using chroma alone. Nevertheless,
both RPA and VR showed improvement.
Fig. 2 presents the results of the MIREX05 dataset.
Fig. 2. Effect of the features on main melody extraction results (MIREX05 dataset).
When only MFCC was used, the algorithm achieved an OA and VFA of 83.56% and 8.14%,
respectively. On the other hand, when only chroma was used, the OA was slightly lower
at 82.34%, but the VFA was higher at 9.33%. Hence, the MFCC was more accurate in extracting
the main melody in the MIREX05 dataset. This difference in performance may be attributed
to the MIREX05 dataset containing both pop and pure music. MFCC was more sensitive
to vocals, making it better at distinguishing pitch in vocal information.
Similar to the ADC2004 dataset, chroma performed better regarding RCA with a value
of 84.06%. When both MFCC and chroma were used, the CNN-CRF algorithm achieved an
OA and VFA of 85.21% and 11.16%, respectively, on the MIREX05 dataset. In addition,
it demonstrated impressive performance on RCA, RPA, and VR with values of 83.91%,
82.56%, and 86.33%, respectively. Overall, including MFCC and chroma as input features
improved the algorithm OA. Although the VFA showed a slight increase, the enhanced
performance on RCA, RPA, and VR underscored the valuable contribution of vocal information
features to the main melody extraction in this study.
The effectiveness of the CNN-CRF algorithm was evaluated by a comparison with other
algorithms as follows:
(1) SegNet [17]: a model based on the encoder-decoder structure that locates melodic frequencies
through pooling indices;
(2) frequency-temporal attention network (FTANet) [18]: a neural network based on a time-frequency attention structure that analyzes the
time mode using the time attention module and selects the same band using the frequency
attention module.
Table 4 lists the results of the two datasets.
The bold data in Table 4 are the optimal values. According to Table 4, on the ADC2004 dataset, the SegNet algorithm exhibited the highest VR at 88.24%.
On the other hand, its VFA and OA were 19.93% and 82.72%, respectively. The FTANet
algorithm showed a significantly lower VFA, only 10.53%, and achieved an OA of 84.99%.
In contrast, the CNN-CRF algorithm achieved the lowest VFA, 6.84%, which was 13.09%
lower than the SegNet algorithm and 3.69% lower than the FTANet algorithm. Its OA
was the highest at 86.72%, showing 4% improvement over the SegNet algorithm and 1.73%
improvement over the FTANet algorithm.
On the MIREX05 dataset, the CNN-CRF algorithm achieved the best values across all
indicators. It achieved a VFA of 11.16%, which was 8.81% lower than the SegNet algorithm
and 9.45% lower than the FTANet algorithm. Furthermore, it attained an OA of 85.21%,
8.92% higher than the SegNet algorithm and 7.01% higher than the FTANet algorithm.
These results demonstrate the advantage of the CNN-CRF algorithm in extracting main
melodies.
In summary, the CNN-CRF algorithm uses the MFCC and chroma as features to capture
the human voice in vocal information effectively. Compared to the ADC2004 dataset,
the MIREX05 dataset features a greater abundance of popular music, making the improvements
in the results more pronounced. The addition of the chroma significantly improved
the RCA of the algorithm, resulting in higher OA and lower VFA.
Table 4. Comparisons with Other Algorithms (%).
|
VR
|
RPA
|
RCA
|
VFA
|
OA
|
ADC2004
|
SegNet
|
88.24
|
83.47
|
85.31
|
19.93
|
82.72
|
FTANet
|
85.97
|
84.41
|
84.77
|
10.53
|
84.99
|
ours
|
86.86
|
85.43
|
85.76
|
6.84
|
86.72
|
MIREX05
|
SegNet
|
71.54
|
65.01
|
66.15
|
19.97
|
76.29
|
FTANet
|
78.74
|
71.87
|
72.75
|
20.61
|
78.20
|
ours
|
86.33
|
82.56
|
83.91
|
11.16
|
85.21
|