1. Introduction
People’s interest in music has grown stronger as society has evolved. The piano is
a musical instrument with a rich timbre. The instrument has appeared in many musical
works, has extensive applications in concerts and music festivals, and is deeply loved
by people. With the development of computer technology, music is also becoming increasingly
digitalized, and automatic music transcription (AMT) [1] has also been studied more deeply. AMT refers to using computers to convert notes
in musical signals into scores [2]. The original audio signals can be converted to symbols that are easier for humans
to understand through AMT [3], which is the basis for analyzing various kinds of music [4]. The conversion helps people better appreciate music and reduces manual notation
pressure. Furthermore, improving the accuracy of automatic transcription and marking
audio with AMT can also increase the effectiveness of a music search [5], so AMT plays an important role. As technology develops, many methods have been applied
to AMT [6]. Cheuk et al. [7] used two U-nets. The first U-net transcribed the spectrograms into the posterior
gram, and the second U-net converted the posterior gram back into the spectrogram
to achieve AMT. The experiments on three datasets showed that this method was more
accurate in note-level transcription. Skoki et al. [8] examined Sopela's AMT and determined the pitch prediction model by combining two
machine learning algorithms and frequency features, realizing promising transcription
accuracy. Kawashima [9] increased the automatic transcription accuracy using convolutional neural networks
(CNN) as post-processing before low-rank non-negative matrix factorization and assessed
the effectiveness of this method by simulation. Beltran [10] examined the influence of timbre on monophonic transcription and used deep saliency
models. The experimental results showed that the model was effective for the polyphonic
transcription of non-piano instruments, e.g., the F1 value of low instruments reached
0.9516. Steiner et al. [11] designed a method based on an echo state network. They reported a 1.8% and 1.4% improvement
in note detection compared to the bidirectional Long Short-Term Memory (LSTM) and
CNN, respectively. Wei et al. [12] proposed a semi-automatic method using the audio-to-musical instrument digital interface
(MIDI) alignment technique for automatic drum transcription. They demonstrated its
effectiveness in automatic transcription through experiments. Nakamura et al. [13] designed several Bayesian Markovian score models to achieve transcription of musical
rhythms and found through experiments that the method had good transcription accuracy
and computational efficiency. In AMT, the automatic transcription of a single note
is relatively simple, while the piano, as a musical instrument with polyphonic characteristics,
has multiple notes present simultaneously, which makes automatic transcription difficult.
The automatic transcription of polyphonic audio is still challenging, but there have
been relatively few studies on the automatic transcription of piano audio. Therefore,
this paper investigated the automatic transcription algorithm based on the polyphonic
characteristics of piano audio to improve the performance of automatic transcription
of piano audio. Three features were extracted from the piano video: short-time Fourier
transform (STFT), constant-Q transform (CQT), and variable-Q transform (VQT). The
CNN was combined with a bidirectional gated recurrent unit (BiGRU) for detecting the
note start point and the fundamental of the polyphone to improve the reliability of
the automatic transcription. The effectiveness of the method was demonstrated through
experiments on the MAPS dataset, providing a new method for AMT. This method can also
be applied to the AMT of various instruments and music to provide reliable support
for music information retrieval and analysis.
2. Automatic Transcription Algorithm for Piano
2.1 The Polyphonic Characteristics of Piano Audio
In music, the most fundamental unit is the musical note. A musical note refers to
a symbol used to represent different pitches. Each note can be marked with an English
letter, called the "musical alphabet".
The distance between two notes of the same name is called an octave. According to
the twelve-tone equal temperament, an octave is separated into twelve equal parts;
each is called a semitone. Using an octave as an example, Table 1 lists the correspondence between the musical alphabet, syllable name, and numbered
musical notation.
When a single piano key is struck, the lowest frequency sine component in the musical
signal of the note is called the fundamental tone; its corresponding frequency is
called the fundamental frequency. Recognizing the type of fundamental tone can identify
the pitch type because the fundamental tone of different pitches is different.
In the piano, each semitone corresponds to a piano key. The fundamental frequencies
of the pitch corresponding to 88 piano keys are
where $\mathrm{f}_{0}$ stands for the standard pitch. The fundamental frequencies
of a standard piano range from 27.5 Hz to 4186.01 Hz.
In the field of digital music, MIDI is used to represent tones. In a standard piano,
the relationship between the MIDI values and the fundamental frequency is
The MIDI numbers corresponding to the 88 piano keys (A0-C8) are 21-108.
The polyphonic characteristic of piano audio refers to the sound characteristics produced
when multiple notes are played on the piano simultaneously. Polyphony occurs when
multiple notes are played simultaneously on the piano, suggesting that each note emits
sound at different times. These sounds will interfere and resonate, producing a unique
timbre. Therefore, there is information on the fundamental frequencies of multiple
notes, making automatic piano transcription a challenge.
Table 1. Piano Notes in An Octave.
Musical alphabet
|
C
|
D
|
E
|
F
|
G
|
A
|
B
|
Syllable name
|
do
|
re
|
mi
|
fa
|
so
|
la
|
ti
|
Numbered musical notation
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
2.2 Analysis of Piano Audio Characteristics
Making the piano audio computable requires an analysis of its characteristics. For
audio signals, frequency-based analysis methods are more effective than time-domain
analysis, and the following two methods are commonly used.
(1) STFT
STFT [14] is a method that analyzes the time-frequency distribution of local signals to understand
the changing pattern of the signal. The corresponding calculation formula is
where $f\left(n\right)$ represents the time domain signal; $w\left(n\right)$ stands
for the window function; $w\left(n-m\right)$ stands for the sliding window.
STFT pays more attention to the local information of the signal after windowing. Therefore,
it can extract the time-frequency information better.
(2) CQT
STFT uses a fixed window length. The linearly distributed frequency points can lead
to errors in fundamental frequency recognition. CQT can make the frequency points
exponentially distributed [15], and the corresponding formula is
where
$f\left(n\right)$: signal sequence,
$k$: frequency point index of the CQT spectrum,
$w_{{N_{k}}}\left(n\right)$: a window function with a length of $N_{k}$,
$Q$: the constant factor of CQT,
$f_{s}$: sampling frequency,
$f_{k}$: the central frequency of the $k$-th spectral line in the CQT spectrum,
$f_{min}$: the lowest frequency,
$b$: number of frequency points within each octave.
(3) VQT
The VQT introduces parameter ${\Gamma}$ [16] to enhance the time resolution of the time-frequency representation. The central
frequency distribution is the same as that of CQT, and the relationship between the
bandwidth of frequency band $k$ and the central frequency can be expressed as
when $\gamma =0$, VQT = CQT, while when $\gamma >0$, VQT has the same frequency resolution
as CQT, but the time resolution was improved significantly. Therefore, the spectral
characteristics obtained from the signal after VQT are better.
2.3 CNN-based Automatic Transcription Algorithm
CNN is an important component in deep learning that performs well in image recognition
and other fields [17]. In automatic piano transcription, there are mainly three tasks that need to be accomplished:
(1) detection of the start point of musical notes,
(2) detection of the endpoint of musical notes,
(3) detection of the fundamental tone of the polyphone.
The start and end points of piano audio have distinct amplitude changes in the spectrogram.
CNN is suitable for extracting spatial structural features and has good generalization
ability. Therefore, this study analyzed the automatic transcription of piano audio
based on CNN.
CNN uses convolutional layers as the core and extracts features using convolutional
operations. Rich features could be obtained through multiple convolutional kernels.
Some unimportant information was sampled and discarded through the pooling layer to
reduce the computational complexity. In the activation layer, nonlinear functions
were used to alleviate the problem of gradient disappearance. The commonly used functions
include
$sigmoid$ and $tanh$ are often used for fully connected layers, while $relu$ is often
used for convolutional layers.
This paper combines the gated recurrent unit (GRU) with CNN to obtain the association
before and after sequences [18]. GRU simplifies the long short-term memory neural network and has a simpler structure
and higher efficiency than LSTM. GRU mainly trains the network through the reset gate
and the update gate. The formula for the update gate is expressed as
where
$x_{t}$: the current input vector,
$W_{z}$ and $U_{z}$: weight matrices,
$h_{t-1}$: the hidden state from the previous moment.
The formula for the reset gate is expressed as
The current memory information is $\overset{˜}{h}_{t}=$ $\tanh \left[W_{k}x_{t}+U\left(r_{t}\odot
h_{t-1}\right)\right]$. Here, $\overset{˜}{h}_{t}$ is the candidate hidden state.
The hidden state storage information at the current moment is: $h_{t}=z_{t}\odot h_{t}+\left(1-z\right)\odot
\overset{˜}{h}_{t-1}$.
A bidirectional GRU (BiGRU) was used to make the extracted piano audio feature information
more accurate. BiGRU includes a forward GRU and a backward GRU, which can model the
input data from both forward and backward directions.
The CNN-BiGRU algorithm was obtained by combining CNN and BiGRU and applied to the
automatic transcription of piano audio, as shown in Fig. 1.
According to Fig. 1, after extracting STFT, CQT, or VQT features from the piano audio signal, they are
used as input to the CNN-BiGRU algorithm to detect the note start and end points as
well as the fundamental tone of the polyphone. Only the first segment with a length
of 512 dimensions was selected when using STFT as input because the spectrum of STFT
is relatively long.
Three CNN-BiGRU models have the same structure and use four convolutional layers.
The pooling layers use mean pooling with a step length of 2. All layers except the
output layer use the ReLU function, and the output layer uses the sigmoid function.
The difference is that the output layer of the CNN-BiGRU algorithm in detecting the
start and end points of notes has only one node, representing the probability of containing
the start and end points of notes in the representation of input video. The output
layer of the CNN-BiGRU algorithm in detecting the fundamental tone of the polyphone
has 88 nodes, representing the independent probability of each note being played in
the representation of the input video.
Fig. 1. Piano automatic transcription algorithm based on CNN-BiGRU.
3. Results and Analysis
3.1 Experimental Dataset
The experimental data comes from the MAPS dataset [19]. Each audio file has a standard file that annotates the start and end time and the
MIDI number of all notes. The examples are as follows:
[0.336977, 0.510340): 69
[0.518616, 0.622360): 72
[0.635011, 0.738756): 76
[0.751406, 0.855150): 77
[0.867801, 1.098953): 69
[1.105229, 1.379819): 72
[1.392470, 1.666060): 74
[1.678711, 1.952301): 76
[1.964952, 2.346750): 64
...
The value in the brackets indicates the start and end time of the note, and the number
at the end of the line is the MIDI number of the pitch, e.g., "69" at the end of the
first line means that the MIDI number is 69, which is a C4 note.
The MAPS dataset includes nine directories, each containing 30 pieces of piano music.
There are seven directories of synthesized audio. ENSTDKCl (Cl) and ENSTDKAm (Am)
are recordings of real piano performances. In this article, Cl and Am served as the
test set. There are two combinations for selecting the training set:
① The synthesized audio of the first seven directories + the first 30 seconds of Cl
and Am;
② Only the synthesized audio of the first seven directories.
3.2 Evaluation Indicators
The performance evaluation of the algorithm was based on the confusion matrix (Table 2).
(1) Precision: the proportion of correctly detected notes to all detected notes, $P=TP/\left(TP+FP\right)$;
(2) Recall rate: the proportion of correctly detected notes to the total number of
notes, $R=TP/\left(TP+FN\right)$;
(3) F-measure: the result considering both precision (P) and recall rate (R), $F1=\left(2\times
P\times R\right)/\left(P+R\right)$.
Table 2. Confusion Matrix.
Confusion matrix
|
Real value
|
Positive
|
Negative
|
Detection value
|
Positive
|
True Positive (TP)
|
False Positive (FP)
|
Negative
|
False Negative (FN)
|
True Negative (TN)
|
3.3 Result Analysis
First, the synthesized audio of the first seven directories and the first 30 seconds
of Cl and Am were used as a training set to compare the effect of STFT/CQT/VQT as
an input on the detection performance of note start point. In addition, the CNN-BiGRU
algorithm was compared with the CNN and CNN-GRU algorithms. Table 3 lists the comparison results.
According to Table 3, when using STFT as input, the P values of these algorithms were approximately 75%,
the R values were around 60%, and the F1-measures were below 70%. On the other hand,
when CQT was used as input, the performance of these algorithms was improved to some
extent. For example, the F1-measure of the CNN-BiGRU algorithm was improved by 9.97%
compared to using STFT as input. The comparison of different algorithms showed that
the F1-measure of the CNN-BiGRU algorithm was the highest. Finally, the P value of
the CNN-BiGRU algorithm was 81.26% when using VQT as the input, the R-value was 87.64%,
and the F1-measure was 84.33%, all the highest, demonstrating the effectiveness of
VQT and CNN-BiGRU in detecting the start point of notes.
The performance of different features and algorithms was compared in terms of polyphonic
fundamental tone detection, and the results are displayed in Table 4.
According to Table 3, the F1 values of these algorithms were all below 90% when using the STFT as input.
The CNN algorithm performed the worst in polyphonic fundamental tone detection, with
low P and R values and the lowest F1-measure, only 80.89%. The F1-measure of the CNN-BiGRU
algorithm was 89.25%, which was 8.36% higher than the CNN algorithm. When CQT was
used as the input, these algorithms exhibited improved performance in polyphonic fundamental
tone detection than when STFT was used. The F1-measure of the CNN-BiGRU algorithm
reached 95.88%, 6.63% higher than when STFT was used. Finally, the P and R values
and F1-measures of these algorithms were above 90% when VQT was used as input. The
F1-measure of the CNN-BiGRU algorithm was 97.25%, which was 1.37% higher than when
CQT was used. The results in Table 2 showed that VQT was more effective as a feature input for polyphonic fundamental
tone detection among the three types of features, STFT, CQT, and VQT, and proved that
the CNN-BiGRU algorithm performed better than the CNN and CNN-GRU algorithms in polyphonic
fundamental tone detection.
Detection of the note endpoints was similar to that of the start points, and VQT,
in combination with CNN-BiGRU showed the best performance. Finally, the impact of
different training sets on automatic piano transcription from the audio was compared.
Fig. 2 presents the results of automatic transcription under two different training sets
Using VQT as the input and CNN-BiGRU as the algorithm.
According to Fig. 2, in the automatic transcription of piano audio, the performance of the algorithm
in detecting the note start and end points was not as good as in detecting polyphonic
fundamental tone. The various indicators of the CNN-BiGRU algorithm in detecting polyphonic
fundamental tones reached over 90%, while the indicators for note start and end point
detection were below 90%. In piano audio, some notes may be played with low intensity,
which can cause missed detections and result in poor performance in detecting note
start points.
From a comparison of the different training sets, when using the training set ②, the
CNN-BiGRU algorithm did not perform as well as the training set ① in terms of note
start and end point detection and polyphonic fundamental tone detection. Taking polyphonic
fundamental tone detection as an example, compared to using training set ①, the P
value of using the training set ② decreased by 1.73% (95.43%), the R-value decreased
by 3.18% (94.16%), and the F1-measure decreased by 2.46% (94.79%). The training set
② contained only synthesized audio, lacking training on real piano recordings. Therefore,
it led to insufficient algorithm training and poor performance on the test set.
Fig. 2. Comparison of the automatic transcription results for piano audio.
Table 3. Influence of Different Input Features on the Detection of Note Start Point.
Input feature
|
Algorithm
|
P value/%
|
R value/%
|
F1-measure/%
|
STFT
|
CNN
|
75.12
|
57.64
|
65.23
|
CNN-GRU
|
76.77
|
59.89
|
67.29
|
CNN-BiGRU
|
77.79
|
60.12
|
67.82
|
CQT
|
CNN
|
75.61
|
71.27
|
73.38
|
CNN-GRU
|
77.49
|
74.15
|
75.78
|
CNN-BiGRU
|
79.12
|
76.51
|
77.79
|
VQT
|
CNN
|
76.25
|
83.55
|
79.73
|
CNN-GRU
|
79.33
|
85.16
|
82.14
|
CNN-BiGRU
|
81.26
|
87.64
|
84.33
|
Table 4. Impact of Different Features as Input on the Effectiveness of Polyphonic Fundamental Tone Detection.
Input feature
|
Algorithm
|
P value/%
|
R value/%
|
F1-measure/%
|
STFT
|
CNN
|
81.67
|
80.12
|
80.89
|
CNN-GRU
|
85.32
|
82.07
|
83.66
|
CNN-BiGRU
|
89.86
|
88.64
|
89.25
|
CQT
|
CNN
|
88.77
|
87.64
|
88.20
|
CNN-GRU
|
91.67
|
92.13
|
91.90
|
CNN-BiGRU
|
95.64
|
96.12
|
95.88
|
VQT
|
CNN
|
90.07
|
91.22
|
90.64
|
CNN-GRU
|
93.21
|
91.36
|
92.28
|
CNN-BiGRU
|
97.16
|
97.34
|
97.25
|