Mobile QR Code QR CODE

2024

Acceptance Ratio

21%

Main Menu

※ The user interface design of www.ieiespc.org has been recently revised and updated. Please contact inter@theieie.org for any inquiries regarding paper submission.

Journal Search

IEIESPC(IEIE Transactions on Smart Processing and Computing)

IEIESPC Vol. 12, No. 05, p.412-418

ISSN (online) :

2287-5255

Received : 18 April 2023Revised : 23 May 202330 October 2023

DOI :

https://doi.org/10.5573/IEIESPC.2023.12.5.412

Regular Paper

Study of Automatic Piano Transcription Algorithms based on the Polyphonic Properties of Piano Audio

LiangYan¹ PanFeng²

( Department of Educational Sciences and Music, Luoyang Institute of Science and Technology, Luoyang, Henan 471000, China)
(Department of Sports Training, Guangzhou Sport University, Guangzhou, Guangdong 510500, China p205909@163.com )

^*Corresponding Author: Feng Pan

License :

This is an Open-Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.(www.theieie.org).

Abstract

The polyphonic characteristics of piano audio make automatic transcription particularly challenging. This study briefly analyzed the polyphonic characteristics of piano audio and introduced three piano audio features: short-time Fourier transform (STFT), constant-Q transform (CQT), and variable-Q transform (VQT). An algorithm integrating a convolutional neural network (CNN) with a bidirectional gated recurrent unit (BiGRU) was developed and tested on the MAPS dataset to detect the note start and end points and fundamental tones of polyphone. The results showed that the combined algorithm performed better than STFT and CQT when VQT was used as input, and CNN-BiGRU outperformed CNN and CNN-GRU in terms of the P value, R-value, and F1-measure in the fundamental tone detection of 97.16%, 97.34%, and 97.25%, respectively. The experimental results of this paper confirmed that the designed automatic piano transcription algorithm is reliable and can be further adopted in the practical music field.

Keywords

Automatic transcription, Convolutional neural network, Piano audio, Polyphonic characteristics

1. Introduction

People’s interest in music has grown stronger as society has evolved. The piano is a musical instrument with a rich timbre. The instrument has appeared in many musical works, has extensive applications in concerts and music festivals, and is deeply loved by people. With the development of computer technology, music is also becoming increasingly digitalized, and automatic music transcription (AMT) ^[1] has also been studied more deeply. AMT refers to using computers to convert notes in musical signals into scores ^[2]. The original audio signals can be converted to symbols that are easier for humans to understand through AMT ^[3], which is the basis for analyzing various kinds of music ^[4]. The conversion helps people better appreciate music and reduces manual notation pressure. Furthermore, improving the accuracy of automatic transcription and marking audio with AMT can also increase the effectiveness of a music search ^[5], so AMT plays an important role. As technology develops, many methods have been applied to AMT ^[6]. Cheuk et al. ^[7] used two U-nets. The first U-net transcribed the spectrograms into the posterior gram, and the second U-net converted the posterior gram back into the spectrogram to achieve AMT. The experiments on three datasets showed that this method was more accurate in note-level transcription. Skoki et al. ^[8] examined Sopela's AMT and determined the pitch prediction model by combining two machine learning algorithms and frequency features, realizing promising transcription accuracy. Kawashima ^[9] increased the automatic transcription accuracy using convolutional neural networks (CNN) as post-processing before low-rank non-negative matrix factorization and assessed the effectiveness of this method by simulation. Beltran ^[10] examined the influence of timbre on monophonic transcription and used deep saliency models. The experimental results showed that the model was effective for the polyphonic transcription of non-piano instruments, e.g., the F1 value of low instruments reached 0.9516. Steiner et al. ^[11] designed a method based on an echo state network. They reported a 1.8% and 1.4% improvement in note detection compared to the bidirectional Long Short-Term Memory (LSTM) and CNN, respectively. Wei et al. ^[12] proposed a semi-automatic method using the audio-to-musical instrument digital interface (MIDI) alignment technique for automatic drum transcription. They demonstrated its effectiveness in automatic transcription through experiments. Nakamura et al. ^[13] designed several Bayesian Markovian score models to achieve transcription of musical rhythms and found through experiments that the method had good transcription accuracy and computational efficiency. In AMT, the automatic transcription of a single note is relatively simple, while the piano, as a musical instrument with polyphonic characteristics, has multiple notes present simultaneously, which makes automatic transcription difficult. The automatic transcription of polyphonic audio is still challenging, but there have been relatively few studies on the automatic transcription of piano audio. Therefore, this paper investigated the automatic transcription algorithm based on the polyphonic characteristics of piano audio to improve the performance of automatic transcription of piano audio. Three features were extracted from the piano video: short-time Fourier transform (STFT), constant-Q transform (CQT), and variable-Q transform (VQT). The CNN was combined with a bidirectional gated recurrent unit (BiGRU) for detecting the note start point and the fundamental of the polyphone to improve the reliability of the automatic transcription. The effectiveness of the method was demonstrated through experiments on the MAPS dataset, providing a new method for AMT. This method can also be applied to the AMT of various instruments and music to provide reliable support for music information retrieval and analysis.

2. Automatic Transcription Algorithm for Piano

2.1 The Polyphonic Characteristics of Piano Audio

In music, the most fundamental unit is the musical note. A musical note refers to a symbol used to represent different pitches. Each note can be marked with an English letter, called the "musical alphabet".

The distance between two notes of the same name is called an octave. According to the twelve-tone equal temperament, an octave is separated into twelve equal parts; each is called a semitone. Using an octave as an example, Table 1 lists the correspondence between the musical alphabet, syllable name, and numbered musical notation.

When a single piano key is struck, the lowest frequency sine component in the musical signal of the note is called the fundamental tone; its corresponding frequency is called the fundamental frequency. Recognizing the type of fundamental tone can identify the pitch type because the fundamental tone of different pitches is different.

In the piano, each semitone corresponds to a piano key. The fundamental frequencies of the pitch corresponding to 88 piano keys are

(1)

$f_{i}=f_{0}\times 2^{\left(i-49\right)/12},i=1,2,\cdots ,88$,

(2)

$f_{0}=440Hz$,

where $\mathrm{f}_{0}$ stands for the standard pitch. The fundamental frequencies of a standard piano range from 27.5 Hz to 4186.01 Hz.

In the field of digital music, MIDI is used to represent tones. In a standard piano, the relationship between the MIDI values and the fundamental frequency is

(3)

$p=69+12\times \log _{2}\left(f_{i}/440\right)$.

The MIDI numbers corresponding to the 88 piano keys (A0-C8) are 21-108.

The polyphonic characteristic of piano audio refers to the sound characteristics produced when multiple notes are played on the piano simultaneously. Polyphony occurs when multiple notes are played simultaneously on the piano, suggesting that each note emits sound at different times. These sounds will interfere and resonate, producing a unique timbre. Therefore, there is information on the fundamental frequencies of multiple notes, making automatic piano transcription a challenge.

Table 1. Piano Notes in An Octave.

Musical alphabet	C	D	E	F	G	A	B
Syllable name	do	re	mi	fa	so	la	ti
Numbered musical notation	1	2	3	4	5	6	7

2.2 Analysis of Piano Audio Characteristics

Making the piano audio computable requires an analysis of its characteristics. For audio signals, frequency-based analysis methods are more effective than time-domain analysis, and the following two methods are commonly used.

(1) STFT

STFT ^[14] is a method that analyzes the time-frequency distribution of local signals to understand the changing pattern of the signal. The corresponding calculation formula is

(4)

$\mathrm{F}_{\mathrm{M}}\left(\omega \right)=\sum _{\mathrm{n}=-\infty }^{+\infty }\mathrm{f}\left(\mathrm{n}\right)\mathrm{w}\left(\mathrm{n}-\mathrm{m}\right)\mathrm{e}^{-\mathrm{i}\omega \mathrm{t}}$,

where $f\left(n\right)$ represents the time domain signal; $w\left(n\right)$ stands for the window function; $w\left(n-m\right)$ stands for the sliding window.

STFT pays more attention to the local information of the signal after windowing. Therefore, it can extract the time-frequency information better.

(2) CQT

STFT uses a fixed window length. The linearly distributed frequency points can lead to errors in fundamental frequency recognition. CQT can make the frequency points exponentially distributed ^[15], and the corresponding formula is

(5)

$F_{CQ}\left(k\right)=\frac{1}{N_{k}}\sum _{n=0}^{N_{k}-1}f\left(n\right)w_{{N_{k}}}\left(n\right)e^{-j\frac{2\pi Qn}{N_{k}}}$,

(6)

$N_{k}=Q\frac{f_{s}}{f_{k}}$,

(7)

$f_{k}=f_{min}\cdot 2^{\left(k-1\right)/b}$,

where

$f\left(n\right)$: signal sequence,

$k$: frequency point index of the CQT spectrum,

$w_{{N_{k}}}\left(n\right)$: a window function with a length of $N_{k}$,

$Q$: the constant factor of CQT,

$f_{s}$: sampling frequency,

$f_{k}$: the central frequency of the $k$-th spectral line in the CQT spectrum,

$f_{min}$: the lowest frequency,

$b$: number of frequency points within each octave.

(3) VQT

The VQT introduces parameter ${\Gamma}$ ^[16] to enhance the time resolution of the time-frequency representation. The central frequency distribution is the same as that of CQT, and the relationship between the bandwidth of frequency band $k$ and the central frequency can be expressed as

(8)

$\delta _{{f_{k}}}=\alpha \cdot f_{k}+\gamma $,

(9)

$\alpha =2^{1/b}-2^{-1/b}$.

when $\gamma =0$, VQT = CQT, while when $\gamma >0$, VQT has the same frequency resolution as CQT, but the time resolution was improved significantly. Therefore, the spectral characteristics obtained from the signal after VQT are better.

2.3 CNN-based Automatic Transcription Algorithm

CNN is an important component in deep learning that performs well in image recognition and other fields ^[17]. In automatic piano transcription, there are mainly three tasks that need to be accomplished:

(1) detection of the start point of musical notes,

(2) detection of the endpoint of musical notes,

(3) detection of the fundamental tone of the polyphone.

The start and end points of piano audio have distinct amplitude changes in the spectrogram.

CNN is suitable for extracting spatial structural features and has good generalization ability. Therefore, this study analyzed the automatic transcription of piano audio based on CNN.

CNN uses convolutional layers as the core and extracts features using convolutional operations. Rich features could be obtained through multiple convolutional kernels. Some unimportant information was sampled and discarded through the pooling layer to reduce the computational complexity. In the activation layer, nonlinear functions were used to alleviate the problem of gradient disappearance. The commonly used functions include

(10)

$sigmoid=\frac{1}{1+e^{-x}}$,

(11)

$tanh=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$,

(12)

$relu=max\left(0,x\right)$.

$sigmoid$ and $tanh$ are often used for fully connected layers, while $relu$ is often used for convolutional layers.

This paper combines the gated recurrent unit (GRU) with CNN to obtain the association before and after sequences ^[18]. GRU simplifies the long short-term memory neural network and has a simpler structure and higher efficiency than LSTM. GRU mainly trains the network through the reset gate and the update gate. The formula for the update gate is expressed as

(13)

$z_{t}=\sigma \left(W_{z}x_{t}+U_{z}h_{t-1}\right)$,

where

$x_{t}$: the current input vector,

$W_{z}$ and $U_{z}$: weight matrices,

$h_{t-1}$: the hidden state from the previous moment.

The formula for the reset gate is expressed as

(14)

$r_{t}=\sigma \left(W_{r}x_{t}+U_{r}h_{t-1}\right)$.

The current memory information is $\overset{˜}{h}_{t}=$ $\tanh \left[W_{k}x_{t}+U\left(r_{t}\odot h_{t-1}\right)\right]$. Here, $\overset{˜}{h}_{t}$ is the candidate hidden state. The hidden state storage information at the current moment is: $h_{t}=z_{t}\odot h_{t}+\left(1-z\right)\odot \overset{˜}{h}_{t-1}$.

A bidirectional GRU (BiGRU) was used to make the extracted piano audio feature information more accurate. BiGRU includes a forward GRU and a backward GRU, which can model the input data from both forward and backward directions.

The CNN-BiGRU algorithm was obtained by combining CNN and BiGRU and applied to the automatic transcription of piano audio, as shown in Fig. 1.

According to Fig. 1, after extracting STFT, CQT, or VQT features from the piano audio signal, they are used as input to the CNN-BiGRU algorithm to detect the note start and end points as well as the fundamental tone of the polyphone. Only the first segment with a length of 512 dimensions was selected when using STFT as input because the spectrum of STFT is relatively long.

Three CNN-BiGRU models have the same structure and use four convolutional layers. The pooling layers use mean pooling with a step length of 2. All layers except the output layer use the ReLU function, and the output layer uses the sigmoid function. The difference is that the output layer of the CNN-BiGRU algorithm in detecting the start and end points of notes has only one node, representing the probability of containing the start and end points of notes in the representation of input video. The output layer of the CNN-BiGRU algorithm in detecting the fundamental tone of the polyphone has 88 nodes, representing the independent probability of each note being played in the representation of the input video.

Fig. 1. Piano automatic transcription algorithm based on CNN-BiGRU.

3. Results and Analysis

3.1 Experimental Dataset

The experimental data comes from the MAPS dataset ^[19]. Each audio file has a standard file that annotates the start and end time and the MIDI number of all notes. The examples are as follows:

[0.336977, 0.510340): 69

[0.518616, 0.622360): 72

[0.635011, 0.738756): 76

[0.751406, 0.855150): 77

[0.867801, 1.098953): 69

[1.105229, 1.379819): 72

[1.392470, 1.666060): 74

[1.678711, 1.952301): 76

[1.964952, 2.346750): 64

...

The value in the brackets indicates the start and end time of the note, and the number at the end of the line is the MIDI number of the pitch, e.g., "69" at the end of the first line means that the MIDI number is 69, which is a C4 note.

The MAPS dataset includes nine directories, each containing 30 pieces of piano music. There are seven directories of synthesized audio. ENSTDKCl (Cl) and ENSTDKAm (Am) are recordings of real piano performances. In this article, Cl and Am served as the test set. There are two combinations for selecting the training set:

① The synthesized audio of the first seven directories + the first 30 seconds of Cl and Am;

② Only the synthesized audio of the first seven directories.

3.2 Evaluation Indicators

The performance evaluation of the algorithm was based on the confusion matrix (Table 2).

(1) Precision: the proportion of correctly detected notes to all detected notes, $P=TP/\left(TP+FP\right)$;

(2) Recall rate: the proportion of correctly detected notes to the total number of notes, $R=TP/\left(TP+FN\right)$;

(3) F-measure: the result considering both precision (P) and recall rate (R), $F1=\left(2\times P\times R\right)/\left(P+R\right)$.

Table 2. Confusion Matrix.

Confusion matrix		Real value
Confusion matrix		Positive	Negative
Detection value	Positive	True Positive (TP)	False Positive (FP)
Detection value	Negative	False Negative (FN)	True Negative (TN)

3.3 Result Analysis

First, the synthesized audio of the first seven directories and the first 30 seconds of Cl and Am were used as a training set to compare the effect of STFT/CQT/VQT as an input on the detection performance of note start point. In addition, the CNN-BiGRU algorithm was compared with the CNN and CNN-GRU algorithms. Table 3 lists the comparison results.

According to Table 3, when using STFT as input, the P values of these algorithms were approximately 75%, the R values were around 60%, and the F1-measures were below 70%. On the other hand, when CQT was used as input, the performance of these algorithms was improved to some extent. For example, the F1-measure of the CNN-BiGRU algorithm was improved by 9.97% compared to using STFT as input. The comparison of different algorithms showed that the F1-measure of the CNN-BiGRU algorithm was the highest. Finally, the P value of the CNN-BiGRU algorithm was 81.26% when using VQT as the input, the R-value was 87.64%, and the F1-measure was 84.33%, all the highest, demonstrating the effectiveness of VQT and CNN-BiGRU in detecting the start point of notes.

The performance of different features and algorithms was compared in terms of polyphonic fundamental tone detection, and the results are displayed in Table 4.

According to Table 3, the F1 values of these algorithms were all below 90% when using the STFT as input. The CNN algorithm performed the worst in polyphonic fundamental tone detection, with low P and R values and the lowest F1-measure, only 80.89%. The F1-measure of the CNN-BiGRU algorithm was 89.25%, which was 8.36% higher than the CNN algorithm. When CQT was used as the input, these algorithms exhibited improved performance in polyphonic fundamental tone detection than when STFT was used. The F1-measure of the CNN-BiGRU algorithm reached 95.88%, 6.63% higher than when STFT was used. Finally, the P and R values and F1-measures of these algorithms were above 90% when VQT was used as input. The F1-measure of the CNN-BiGRU algorithm was 97.25%, which was 1.37% higher than when CQT was used. The results in Table 2 showed that VQT was more effective as a feature input for polyphonic fundamental tone detection among the three types of features, STFT, CQT, and VQT, and proved that the CNN-BiGRU algorithm performed better than the CNN and CNN-GRU algorithms in polyphonic fundamental tone detection.

Detection of the note endpoints was similar to that of the start points, and VQT, in combination with CNN-BiGRU showed the best performance. Finally, the impact of different training sets on automatic piano transcription from the audio was compared. Fig. 2 presents the results of automatic transcription under two different training sets Using VQT as the input and CNN-BiGRU as the algorithm.

According to Fig. 2, in the automatic transcription of piano audio, the performance of the algorithm in detecting the note start and end points was not as good as in detecting polyphonic fundamental tone. The various indicators of the CNN-BiGRU algorithm in detecting polyphonic fundamental tones reached over 90%, while the indicators for note start and end point detection were below 90%. In piano audio, some notes may be played with low intensity, which can cause missed detections and result in poor performance in detecting note start points.

From a comparison of the different training sets, when using the training set ②, the CNN-BiGRU algorithm did not perform as well as the training set ① in terms of note start and end point detection and polyphonic fundamental tone detection. Taking polyphonic fundamental tone detection as an example, compared to using training set ①, the P value of using the training set ② decreased by 1.73% (95.43%), the R-value decreased by 3.18% (94.16%), and the F1-measure decreased by 2.46% (94.79%). The training set ② contained only synthesized audio, lacking training on real piano recordings. Therefore, it led to insufficient algorithm training and poor performance on the test set.

Fig. 2. Comparison of the automatic transcription results for piano audio.

Table 3. Influence of Different Input Features on the Detection of Note Start Point.

Input feature	Algorithm	P value/%	R value/%	F1-measure/%
STFT	CNN	75.12	57.64	65.23
	CNN-GRU	76.77	59.89	67.29
	CNN-BiGRU	77.79	60.12	67.82
CQT	CNN	75.61	71.27	73.38
	CNN-GRU	77.49	74.15	75.78
	CNN-BiGRU	79.12	76.51	77.79
VQT	CNN	76.25	83.55	79.73
	CNN-GRU	79.33	85.16	82.14
	CNN-BiGRU	81.26	87.64	84.33

Table 4. Impact of Different Features as Input on the Effectiveness of Polyphonic Fundamental Tone Detection.

Input feature	Algorithm	P value/%	R value/%	F1-measure/%
STFT	CNN	81.67	80.12	80.89
	CNN-GRU	85.32	82.07	83.66
	CNN-BiGRU	89.86	88.64	89.25
CQT	CNN	88.77	87.64	88.20
	CNN-GRU	91.67	92.13	91.90
	CNN-BiGRU	95.64	96.12	95.88
VQT	CNN	90.07	91.22	90.64
	CNN-GRU	93.21	91.36	92.28
	CNN-BiGRU	97.16	97.34	97.25

4. Conclusion

The main focus of this article is the automatic transcription algorithm for piano audio. The CNN was combined with the BiGRU to detect the start and end points of the notes and the fundamental tones of the polyphone. Through experiments on the MAPS dataset, VQT yielded the best results for automatic transcription among the three features (STFT, CQT, and VQT). The designed CNN-BiGRU algorithm performed well in detecting the start and end points of the notes and the fundamental tones of the polyphone. The P value, R-value, and F1-measure for detecting fundamental tones of the polyphone were 97.16%, 97.34%, and 97.25%, respectively, which were the highest, demonstrating the reliability of this method for automatic piano transcription and its practical application.

REFERENCES

L. G. Chis, M. Marcu, and F. Drăgan, ``Software Tool for Audio Signal Analysis and Automatic Music Transcription,'' in 2018 IEEE 12th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 1-5, May. 2018.

Z. Meng, and W. Chen, ``Automatic music transcription based on convolutional neural network, constant Q transform and MFCC,'' Journal of Physics: Conference Series, Vol. 1651, No. 1, pp. 1-8, Nov. 2020.

K. Vaca, A. Gajjar, and X. Yang, ``Real-Time Automatic Music Transcription (AMT) with Zync FPGA,'' in 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 378-384, July. 2019.

A. Holzapfel, E. Benetos, A. Killick, and R. Widdess, ``Humanities and engineering perspectives on music transcription,'' Digital Scholarship in the Humanities, Vol. 37, No. 3, pp. 747-764, Oct. 2021.

J. Liu, W. Xu, X. Wang, and W. Cheng, ``An EB-enhanced CNN Model for Piano Music Transcription,'' in ICMLC 2021: 2021 13th International Conference on Machine Learning and Computing, pp. 186-190, Feb. 2021.

Y. T. Wu, B. Chen, and L. Su, ``Automatic Music Transcription Leveraging Generalized Cepstral Features and Deep Learning,'' in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, pp. 401-405, April. 2018.

K. W. Cheuk, Y. J. Luo, E. Benetos, and D. Herremans, ``The Effect of Spectrogram Reconstruction on Automatic Music Transcription: An Alternative Approach to Improve Transcription Accuracy,'' in 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, pp. 9091-9098, Oct. 2021.

A. Skoki, S. Ljubic, J. Lerga, and I. Stajduhar, ``Automatic music transcription for traditional woodwind instruments sopele,'' Pattern Recognition Letters, Vol. 128, No. Dec., pp. 340-347, Sep. 2019.

T. Kawashima, and K. Ichige, ``Automatic Piano Music Transcription by Hadamard Product of Low-Rank NMF and CNN/CDAE Outputs,'' IEEJ Transactions on Electronics Information and Systems, Vol. 139, No. 10, pp. 1106-1112, Oct. 2019.

C. Hernandez-Olivan, I. Z. Pinilla, C. Hernandez-Lopez, and J. R. Beltran, ``A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription,'' Electronics, Vol. 10, No. 7, pp. 1-16, Mar. 2021.

P. Steiner, A. Jalalvand, S. Stone, and P. Birkholz, ``Feature Engineering and Stacked Echo State Networks for Musical Onset Detection,'' in 2020 25th International Conference on Pattern Recognition (ICPR), pp. 9537-9544, Jan. 2011.

I. C. Wei, C. W. Wu, and L. Su, ``Improving Automatic Drum Transcription Using Large-Scale Audio-to-Midi Aligned Data,'' in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 246-250, June. 2021.

E. Nakamura and K. Yoshii, ``Musical Rhythm Transcription Based on Bayesian Piece-Specific Score Models Capturing Repetitions,'' Information Sciences, Vol. 572, No. 1, pp. 482-500, May. 2021.

K. Satpathi, Y. M. Yeap, A. Ukil, and N. Geddada, ``Short-Time Fourier Transform Based Transient Analysis of VSC Interfaced Point-to-Point DC System,'' IEEE Transactions on Industrial Electronics, Vol. 65, No. 5, pp. 4080-4091, May. 2018.

Y. Huang, H. Hou, Y. Wang, Y. Zhang, and M. Fan, ``A Long Sequence Speech Perceptual Hashing Authentication Algorithm Based on Constant Q Transform and Tensor Decomposition,'' IEEE Access, Vol. 8, pp. 34140-34152, Feb. 2020.

C. Schörkhuber, A. Klapuri, N. Holighaus, and M. Doerfler, ``A Matlab Toolbox for Efficient Perfect Reconstruction Time-Frequency Transforms with Log-Frequency Resolution,'' in 53rd AES Conference on Semantic Audio, pp. 27-29, Jan. 2014.

L. Guo, ``SAR image classification based on multi-feature fusion decision convolutional neural network,'' IET Image Processing, Vol. 16, No. 1, pp. 1-10, Jan. 2022.

X. Zhao, H. Lv, S. Lv, Y. Sang, Y. Wei, and X. Zhu, ``Enhancing robustness of monthly streamflow forecasting model using gated recurrent unit based on improved grey wolf optimizer,'' Journal of Hydrology, Vol. 601, pp. 1-11, Jun. 2021.

V. Emiya, R. Badeau, and B. David, ``Multipitch estimation of piano sounds using a new probabilistic spectral smoothness principle,'' IEEE Transactions on Audio, Speech, and Language Processing, 18(6):1643-1654, Sep. 2010.

Author

Yan Liang

Yan Liang was born in Henan, China, in 1974. She studied at Henan University and received her bachelor's degree in 1998. She has been employed as a lecturer at the Luoyang Institute of Science and Technology until now. She has been teaching music and dance. She has published more than ten papers and two textbooks for colleges and universities, presided over and participated in dozens of research projects of Henan Provincial Department of Education, Department of Science and Technology, Federation of Social Sciences, and municipal social science planning projects.

Feng Pan

Feng Pan was born in Henan, China, in 1974. From 2003 to 2006, he studied at Beijing Sport University and received his Master’s degree in 2006. From 2017 to 2021, he studied at South China Normal University and received his doctorate in 2021. Currently, he is employed as an associate professor at Guangzhou Sport University. He has published 11 papers, three of which have been indexed by CSSCI. His research interests include sports and art.

IEIE SPC IEIE Transactions on Smart Processing & Computing

Journal Search

Journal XML

Journal Information

Study of Automatic Piano Transcription Algorithms based on the Polyphonic Properties of Piano Audio

Abstract

Keywords

1. Introduction

2. Automatic Transcription Algorithm for Piano

2.1 The Polyphonic Characteristics of Piano Audio

(1)

(2)

(3)

Table 1. Piano Notes in An Octave.

2.2 Analysis of Piano Audio Characteristics

(4)

(5)

(6)

(7)

(8)

(9)

2.3 CNN-based Automatic Transcription Algorithm

(10)

(11)

(12)

(13)

(14)

Fig. 1. Piano automatic transcription algorithm based on CNN-BiGRU.

3. Results and Analysis

3.1 Experimental Dataset

3.2 Evaluation Indicators

Table 2. Confusion Matrix.

3.3 Result Analysis

Fig. 2. Comparison of the automatic transcription results for piano audio.

Table 3. Influence of Different Input Features on the Detection of Note Start Point.

Table 4. Impact of Different Features as Input on the Effectiveness of Polyphonic Fundamental Tone Detection.

4. Conclusion

REFERENCES

Author

Yan Liang

Feng Pan

Article Information (continued)

Keywords

IEIE SPC

IEIE Transactions on Smart Processing & Computing