Automatic Assessment of Piano Performances Using Timbre and Pitch Features
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Timbre-Based Evaluation
3.1.1. Timbre-Based WaveNet Approach
3.1.2. Timbre-Based MLNet Approach
3.1.3. Timbre-Based CNN Approach
3.1.4. Timbre-Based CNN Transformers Approach
3.2. Pitch-Based Evaluation
3.2.1. Pitch-Based CNN Approach
3.2.2. Pitch-Based CNN Transformers Approach
4. Experiments
4.1. Experiment Configuration
4.2. Songs for Kids’ Performance Description
4.3. Experiment Setup and System Implementation
4.4. Experiment Results
4.4.1. Timbre-Based Results
4.4.2. Pitch-Based Results
4.4.3. The Best Comparative Results
4.4.4. The Performance Competition Results
4.5. Analysis
4.5.1. Confusion Matrix
4.5.2. Loss Rate
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Layer | Kernel Size | Stride | Number of Filters | Output Shape |
---|---|---|---|---|
Input (Raw data +MFCC) | - | - | - | (1, 510, 1) |
1D Conv1 | (1, 3) | (1, 1) | 32 | (1, 510, 32) |
1D Conv2 | (1, 3) | (1, 1) | 32 | (1, 508, 32) |
MaxPool1 | (2, 2) | (1, 1) | 32 | (1, 254, 32) |
1D Conv3 | (1, 5) | (1, 1) | 64 | (1, 250, 64) |
1D Conv4 | (1, 5) | (1, 1) | 64 | (1, 246, 64) |
MaxPool2 | (2, 2) | (1, 1) | 64 | (1, 123, 64) |
Flatten | - | - | - | (1, 7872) |
FC1 | - | - | 256 | (1, 256) |
FC2 | - | - | 3 | (1, 3) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
Layer | Kernel Size | Stride | Number of Filters | Output Shape |
---|---|---|---|---|
Input (Raw data +3D MFCC) | - | - | - | (1, 63, 1149, 1) |
2D Conv1 | (4, 4) | (2, 2) | 32 | (1, 63, 1149, 32) |
2D Conv2 | (4, 4) | (2, 2) | 32 | (1, 30, 573, 32) |
MaxPool1 | (2, 2) | (2, 2) | 32 | (1, 15, 286, 32) |
Flatten | - | - | - | (1, 137,280) |
FC1 | - | - | 512 | (1, 512) |
FC2 | - | - | 64 | (1, 64) |
FC3 | - | - | 3 | (1, 3) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
Layer | Kernel Size | Stride | Number of Filters | Output Shape |
---|---|---|---|---|
Input (Raw data + MFCC) | - | - | - | (1, 160, 1) |
1D Conv1 | (3, 3) | (2, 2) | 64 | (1, 160, 64) |
MaxPool1 | (2, 2) | (2, 2) | 64 | (1, 80, 64) |
Flatten | - | - | - | (1, 5120) |
FC1 | - | - | 100 | (1, 100) |
FC2 | - | - | 3 | (1, 3) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
Layer | Kernel Size | Stride | Number of Filters | Output Shape |
---|---|---|---|---|
Input (Raw data + Chroma_stft) | - | - | - | (1, 201, 1) |
Layer Normalization1 | - | - | - | (1, 201, 1) |
Multi Head Attention = 4 size = 256 | - | - | - | (1, 201, 1) |
Dropout1 (0.25) | - | - | - | (1, 201, 1) |
Layer Normalization2 | - | - | - | (1, 201, 1) |
1D Conv1, ReLU | (1, 1) | (1, 1) | 4 | (1, 201, 4) |
Dropout2 (0.25) | - | - | - | (1, 201, 4) |
1D Conv2, ReLU | (1, 1) | (1, 1) | 4 | (1, 201, 1) |
Global Average Pooling | - | - | - | (1, 201) |
FC1 | - | - | 128 | (1, 128) |
Dropout3 (0.4) | - | - | - | (1, 128) |
FC2 | - | - | 3 | (1, 3) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
Layer | Kernel Size | Stride | Number of Filters | Output Shape |
---|---|---|---|---|
Pitch Input (Pitch raw data) | - | - | - | (1, 160, 1) |
1D Conv1 | (1, 3) | (1, 1) | 64 | (1, 160, 64) |
Batch Normalization1 | - | - | - | (1, 160, 64) |
Rectifier Linear Unit (ReLU)1 | - | - | - | (1, 160, 64) |
1D Conv2 | (1, 3) | (1, 1) | 64 | (1, 158, 64) |
Batch Normalization2 | - | - | - | (1, 158, 64) |
Rectifier Linear Unit (ReLU)2 | - | - | - | (1, 158, 64) |
1D Conv3 | (1, 3) | (1, 1) | 64 | (1, 156, 64) |
Batch Normalization3 | - | - | - | (1, 156, 64) |
Rectifier Linear Unit (ReLU)3 | - | - | - | (1, 156, 64) |
Global Average Pooling | (1, 2) | (2, 2) | - | (1, 78, 64) |
FC1 | - | - | 3 | (1, 4992) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
Layer | Kernel Size | Stride | Number of filters | Output Shape |
---|---|---|---|---|
Pitch Input (Pitch raw data) | - | - | - | (1, 201, 1) |
Layer Normalization1 | - | - | - | (1, 201, 1) |
Multi Head Attention = 4 size = 256 | - | - | - | (1, 201, 1) |
Dropout1 (0.25) | - | - | - | (1, 201, 1) |
Layer Normalization2 | - | - | - | (1, 201, 1) |
1D Conv1, ReLU | (1, 1) | (1, 1) | 4 | (1, 201, 4) |
Dropout2 (0.25) | - | - | - | (1, 201, 4) |
1D Conv2, ReLU | (1, 1) | (1, 1) | 4 | (1, 201, 1) |
Global Average Pooling | - | - | - | (1, 201) |
FC1 | - | - | 128 | (1, 128) |
Dropout3 (0.4) | - | - | - | (1, 128) |
FC2 | - | - | 3 | (1, 3) |
SoftMax | - | - | - | (1, 3) |
Output (#3Classes) | - | - | - | (1, 3) |
References
- Hosken, D. An Introduction to Music Technology, 2nd ed.; Taylor & Francis: New York, NY, USA, 2014; pp. 4–46. [Google Scholar] [CrossRef]
- Campayo-Muñoz, E.; Cabedo-Mas, A.; Hargreaves, D. Intrapersonal skills and music performance in elementary piano students in Spanish conservatories: Three case studies. Int. J. Music Educ. 2020, 38, 93–112. [Google Scholar] [CrossRef]
- Chandrasekaran, B.; Kraus, N. Music, noise-exclusion, and learning. Music Percept. 2010, 27, 297–306. [Google Scholar] [CrossRef]
- Li, W. Analysis of piano performance characteristics by deep learning and artificial intelligence and its application in piano teaching. Front. Psychol. 2022, 12, 5962. [Google Scholar] [CrossRef] [PubMed]
- Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef] [PubMed]
- Wang, D.; Zhang, M.; Li, Z.; Li, J.; Fu, M.; Cui, Y.; Chen, X. Modulation format recognition and OSNR estimation using CNN-based deep learning. IEEE Photon. Technol. Lett. 2017, 29, 1667–1670. [Google Scholar] [CrossRef]
- Yang, C.; Zhang, X.; Song, Z. CNN Meets Transformer for Tracking. Sensors 2022, 22, 3210. [Google Scholar] [CrossRef]
- Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer: Cham, Germany, 2020; pp. 213–229. [Google Scholar] [CrossRef]
- Shuo, C.; Xiao, C. The construction of internet+ piano intelligent network teaching system model. J. Intell. Fuzzy Syst. 2019, 37, 5819–5827. [Google Scholar] [CrossRef]
- Chiang, P.Y.; Sun, C.H. Oncall piano sensei: Portable ar piano training system. In Proceedings of the 3rd ACM Symposium on Spatial User Interaction (SUI), Los Angeles, CA, USA, 8–9 August 2015; p. 134. [Google Scholar] [CrossRef]
- Sun, C.H.; Chiang, P.Y. Mr. Piano: A portable piano tutoring system. In Proceedings of the 2018 IEEE XXV International Conference on Electronics, Electrical Engineering, and Computing (INTERCON), Lima, Peru, 8–10 August 2018; pp. 1–4. [Google Scholar] [CrossRef]
- Giraldo, S.; Ortega, A.; Perez, A.; Ramirez, R.; Waddell, G.; Williamon, A. Automatic assessment of violin performance using dynamic time warping classification. In Proceedings of the 2018 26th Signal Processing and Communications Applications Conference (SIU), Altinyunus, Turkey, 2–5 May 2018; pp. 1–3. [Google Scholar] [CrossRef] [Green Version]
- Liu, M.; Huang, J. Piano playing teaching system based on artificial intelligence–design and research. J. Intell. Fuzzy Syst. 2021, 40, 3525–3533. [Google Scholar] [CrossRef]
- Phanichraksaphong, V.; Tsai, W.H. Automatic evaluation of piano performances for STEAM education. Appl. Sci. 2021, 11, 11783. [Google Scholar] [CrossRef]
- Sharma, A.K.; Aggarwal, G.; Bhardwaj, S.; Chakrabarti, P.; Chakrabarti, T.; Abawajy, J.H.; Mahdin, H. Classification of Indian classical music with time-series matching deep learning approach. IEEE Access 2021, 9, 102041–102052. [Google Scholar] [CrossRef]
- Li, B. On identity authentication technology of distance education system based on voiceprint recognition. In Proceedings of the 30th Chinese Control Conference (CCC 2011), Yantai, China, 22–24 July 2011; pp. 5718–5721. [Google Scholar]
- Belman, A.K.; Paul, T.; Wang, L.; Iyengar, S.S.; Śniatała, P.; Jin, Z.; Roning, J. Authentication by mapping keystrokes to music: The melody of typing. In Proceedings of the 2020 International Conference on Artificial Intelligence and Signal Processing (AISP), Andhra Pradesh, India, 10–12 January 2020; pp. 1–6. [Google Scholar] [CrossRef]
- McAdams, S. The Psychology of Music, Musical Timbre Perception, 3rd ed.; Elsevier: Amsterdam, The Netherlands, 2013; pp. 35–67. [Google Scholar] [CrossRef]
- Jiam, N.T.; Deroche, M.L.; Jiradejvong, P.; Limb, C.J. A randomized controlled crossover study of the impact of online music training on pitch and timbre perception in cochlear implant users. J. Assoc. Res. Otolaryngol. 2019, 20, 247–262. [Google Scholar] [CrossRef] [PubMed]
- Verma, P.; Chafe, C. A generative model for raw audio using transformer architectures. In Proceedings of the 2021 24th International Conference on Digital Audio Effects (DAFx), Copenhagen, Denmark, 8–10 September 2021; pp. 230–237. [Google Scholar] [CrossRef]
- Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
- Tran, V.T.; Tsai, W.H. Acoustic-based emergency vehicle detection using convolutional neural networks. IEEE Access 2021, 8, 75702–75713. [Google Scholar] [CrossRef]
- Fonseca, E.; Pons Puig, J.; Favory, X.; Font Corbera, F.; Bogdanov, D.; Ferraro, A.; Serra, X. Freesound datasets: A platform for the creation of open audio datasets. In Proceedings of the 18th Society for Music Information Retrieval (ISMIR), Suzhou, China, 23–27 October 2017; pp. 486–493. [Google Scholar] [CrossRef]
- Boddapati, V.; Petef, A.; Rasmusson, J.; Lundberg, L. Classifying environmental sounds using image recognition networks. Proc. Comput. Sci. 2017, 112, 2048–2056. [Google Scholar] [CrossRef]
- McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Libros: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference (SciPy 2015), Austin, TX, USA, 6–12 July 2015; pp. 18–25. [Google Scholar] [CrossRef] [Green Version]
- Chachada, S.; Kuo, C.C.J. Environmental sound recognition: A survey. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA), Kaohsiung, Taiwan, 29 October–1 November 2013. [Google Scholar] [CrossRef]
- Salamon, J.; Bello, J.P. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 2017, 24, 279–283. [Google Scholar] [CrossRef]
- Piczak, K.J. Environmental sound classification with convolutional neural networks. In Proceedings of the 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Boston, MA, USA, 17–20 September 2015; pp. 17–20. [Google Scholar] [CrossRef]
- Lee, J.; Kim, T.; Park, J.; Nam, J. Raw waveform-based audio classification using sample-level CNN architectures. arXiv 2017, arXiv:1712.00866. [Google Scholar]
- Thomas, S.; Ganapathy, S.; Saon, G.; Soltau, H. Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; pp. 2519–2523. [Google Scholar] [CrossRef]
- Abdel-Hamid, O.; Mohamed, A.R.; Jiang, H.; Deng, L.; Penn, G.; Yu, D. Convolutional neural networks for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process 2014, 22, 1533–1545. [Google Scholar] [CrossRef] [Green Version]
- Siripibal, N.; Supratid, S.; Sudprasert, C. A comparative study of object recognition techniques: Softmax, linear and quadratic discriminant analysis based on convolutional neural network feature extraction. In Proceedings of the 2019 International Conference on Management Science and Industrial Engineering, Phuket, Thailand, 24–26 May 2019; pp. 209–214. [Google Scholar] [CrossRef]
- Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object detection with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [Green Version]
- Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCVW), Montreal, QC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar] [CrossRef]
- Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. 2022, 54, 1–41. [Google Scholar] [CrossRef]
- Yu, H.M.; Tsai, W.H.; Wang, H.M. A query-by-singing system for retrieving karaoke music. IEEE Trans. Multimed. 2008, 10, 1626–1637. [Google Scholar] [CrossRef]
- Piszczalski, M.; Galler, B.A. Predicting musical pitch from component frequency ratios. J. Acoust. Soc. Am. 1979, 66, 710–720. [Google Scholar] [CrossRef]
- Su, H.; Zhang, H.; Zhang, X.; Gao, G. Convolutional neural network for robust pitch determination. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 579–583. [Google Scholar] [CrossRef]
- Guo, J.; Han, K.; Wu, H.; Tang, Y.; Chen, X.; Wang, Y.; Xu, C. Cmt: Convolutional neural networks meet vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 12175–12185. [Google Scholar] [CrossRef]
- Zhang, W.; Lei, W.; Xu, X.; Xing, X. Improved music genre classification with convolutional neural networks. In Proceedings of the 17th Annual Conference of the International Speech Communication Association (INTERSPEECH 2016), San Francisco, CA, USA, 8–12 September 2016; pp. 3304–3308. [Google Scholar] [CrossRef] [Green Version]
- Sarkar, R.; Choudhury, S.; Dutta, S.; Roy, A.; Saha, S.K. Recognition of emotion in music based on deep convolutional neural network. Multimed. Tools Appl. 2020, 79, 765–783. [Google Scholar] [CrossRef]
- Singh, Y.; Biswas, A. Robustness of musical features on deep learning models for music genre classification. Expert Syst. Appl. 2022, 199, 116879. [Google Scholar] [CrossRef]
Songs for Kids Dataset | #Training | #Testing |
---|---|---|
Alphabet Song | 281 | 70 |
Amazing Grace | 281 | 70 |
Au Clair De La Lune | 281 | 70 |
Beautiful Brown Eye | 281 | 70 |
Brahms Lullaby | 281 | 70 |
Can Can | 281 | 70 |
Clementine | 281 | 70 |
Deck The Hall | 281 | 70 |
Hot Cross Buns | 281 | 70 |
Jingle Bells | 281 | 70 |
Lavender Blue | 281 | 70 |
London Bridge | 281 | 70 |
Mary Had a Little Lamb | 281 | 70 |
Ode to Joy | 281 | 70 |
Oh Susanna | 281 | 70 |
Skip to My Lou | 281 | 70 |
The Cuckoo | 281 | 70 |
This Old Man | 281 | 70 |
Total | 5058 | 1260 |
Dataset | Model | Accuracy | Precision | Recall | F1-Score |
---|---|---|---|---|---|
Alphabet Song | Timbre-based WaveNet | 64.78% | 44.17% | 64.81% | 52.42% |
Timbre-based MLNet | 76.06% | 87.68% | 74.53% | 75.89% | |
Timbre-based CNN Original | 88.73% | 90.49% | 87.72% | 88.79% | |
Timbre-based CNN Transformer | 98.59% | 98.80% | 98.48% | 98.61% | |
Amazing Grace | Timbre-based WaveNet | 64.79% | 44.17% | 64.81% | 52.42% |
Timbre-based MLNet | 73.23% | 83.53% | 68.98% | 70.83% | |
Timbre-based CNN Original | 80.28% | 90.47% | 72.82% | 75.92% | |
Timbre-based CNN Transformer | 95.77% | 96.11% | 95.45% | 95.70% | |
Au Clair De La Lune | Timbre-based WaveNet | 63.38% | 84.24% | 56.94% | 55.96% |
Timbre-based MLNet | 67.60% | 73.42% | 70.21% | 66.38% | |
Timbre-based CNN Original | 87.32% | 95.06% | 84.44% | 88.42% | |
Timbre-based CNN Transformer | 97.18% | 97.31% | 97.25% | 97.24% | |
Beautiful Brown Eye | Timbre-based WaveNet | 54.93% | 46.41% | 47.22% | 40.77% |
Timbre-based MLNet | 66.20% | 72.63% | 67.19% | 65.43% | |
Timbre-based CNN Original | 84.51% | 85.43% | 84.56% | 84.10% | |
Timbre-based CNN Transformer | 92.96% | 94.79% | 92.42% | 93.06% | |
Brahms Lullaby | Timbre-based WaveNet | 60.56% | 80.22% | 56.48% | 55.83% |
Timbre-based MLNet | 74.65% | 76.62% | 71.77% | 72.57% | |
Timbre-based CNN Original | 77.46% | 86.67% | 74.95% | 80.09% | |
Timbre-based CNN Transformer | 91.55% | 92.93% | 90.90% | 91.39% | |
Can Can | Timbre-based WaveNet | 66.20% | 79.91% | 72.17% | 63.71% |
Timbre-based MLNet | 76.06% | 76.92% | 77.63% | 76.48% | |
Timbre-based CNN Original | 78.87% | 88.06% | 71.44% | 75.41% | |
Timbre-based CNN Transformer | 91.55% | 93.93% | 90.90% | 91.61% | |
Clementine | Timbre-based WaveNet | 59.15% | 50.00% | 51.38% | 45.64% |
Timbre-based MLNet | 73.24% | 74.96% | 73.24% | 71.84% | |
Timbre-based CNN Original | 77.46% | 86.67% | 74.95% | 80.09% | |
Timbre-based CNN Transformer | 95.77% | 96.00% | 96.29% | 95.91% | |
Deck The Hall | Timbre-based WaveNet | 61.97% | 78.07% | 59.13% | 56.56% |
Timbre-based MLNet | 69.01% | 68.47% | 69.49% | 67.57% | |
Timbre-based CNN Original | 85.92% | 90.98% | 72.30% | 79.35% | |
Timbre-based CNN Transformer | 98.59% | 98.80% | 98.48% | 98.61% | |
Hot Cross Buns | Timbre-based WaveNet | 54.93% | 49.18% | 47.22% | 41.08% |
Timbre-based MLNet | 77.46% | 88.14% | 77.77% | 76.12% | |
Timbre-based CNN Original | 78.87% | 82.50% | 76.91% | 79.22% | |
Timbre-based CNN Transformer | 95.77% | 96.03% | 95.45% | 95.50% | |
Jingle Bells | Timbre-based WaveNet | 66.20% | 81.85% | 59.27% | 54.73% |
Timbre-based MLNet | 71.83% | 53.06% | 63.88% | 56.67% | |
Timbre-based CNN Original | 77.46% | 78.16% | 76.91% | 75.60% | |
Timbre-based CNN Transformer | 91.55% | 92.07% | 92.03% | 91.65% | |
Lavender Blue | Timbre-based WaveNet | 66.20% | 51.57% | 58.33% | 52.14% |
Timbre-based MLNet | 69.01% | 71.79% | 68.32% | 68.46% | |
Timbre-based CNN Original | 81.69% | 86.69% | 79.57% | 79.78% | |
Timbre-based CNN Transformer | 94.37% | 94.66% | 94.78% | 94.53% | |
London Bridge | Timbre-based WaveNet | 54.93% | 82.51% | 50.46% | 47.55% |
Timbre-based MLNet | 77.46% | 87.91% | 75.47% | 77.77% | |
Timbre-based CNN Original | 84.51% | 89.66% | 84.15% | 84.82% | |
Timbre-based CNN Transformer | 95.77% | 96.00% | 96.29% | 95.91% | |
Mary Had a Little Lamb | Timbre-based WaveNet | 55.09% | 84.24% | 61.11% | 59.20% |
Timbre-based MLNet | 69.01% | 73.10% | 71.83% | 68.42% | |
Timbre-based CNN Original | 81.69% | 87.22% | 75.29% | 76.44% | |
Timbre-based CNN Transformer | 97.18% | 97.70% | 96.96% | 97.22% | |
Ode To Joy | Timbre-based WaveNet | 57.75% | 83.05% | 55.09% | 49.92% |
Timbre-based MLNet | 77.46% | 84.91% | 75.47% | 77.77% | |
Timbre-based CNN Original | 88.73% | 89.13% | 89.43% | 88.52% | |
Timbre-based CNN Transformer | 90.14% | 93.13% | 89.39% | 90.40% | |
Oh Susanna | Timbre-based WaveNet | 64.79% | 44.17% | 64.81% | 52.42% |
Timbre-based MLNet | 74.65% | 83.43% | 72.70% | 75.00% | |
Timbre-based CNN Original | 84.51% | 89.24% | 82.19% | 85.40% | |
Timbre-based CNN Transformer | 97.18% | 97.70% | 96.96% | 97.22% | |
Skip To My Lou | Timbre-based WaveNet | 61.97% | 46.66% | 63.21% | 50.56% |
Timbre-based MLNet | 74.65% | 81.92% | 73.86% | 74.84% | |
Timbre-based CNN Original | 88.73% | 83.71% | 82.00% | 86.99% | |
Timbre-based CNN Transformer | 90.14% | 90.14% | 89.39% | 89.55% | |
The Cuckoo | Timbre-based WaveNet | 57.75% | 79.53% | 54.98% | 51.51% |
Timbre-based MLNet | 77.46% | 84.01% | 74.53% | 76.14% | |
Timbre-based CNN Original | 77.46% | 78.16% | 76.91% | 75.60% | |
Timbre-based CNN Transformer | 94.37% | 94.51% | 94.21% | 94.34% | |
This Old Man | Timbre-based WaveNet | 63.38% | 45.33% | 57.47% | 45.64% |
Timbre-based MLNet | 74.65% | 83.43% | 72.70% | 75.00% | |
Timbre-based CNN Original | 84.51% | 92.12% | 67.93% | 77.81% | |
Timbre-based CNN Transformer | 94.37% | 95.69% | 93.93% | 94.48% |
Model | Timbre-Based WaveNet | Timbre-Based MLNet | Timbre-Based CNN | Timbre-Based CNN Transformers |
---|---|---|---|---|
The average of accuracy | 61.04% | 73.32% | 82.71% | 94.60% |
Dataset | Model | |||||||
---|---|---|---|---|---|---|---|---|
Pitch-Based CNN Original | Pitch-Based CNN Transformer | |||||||
Accuracy | Precision | Recall | F1-Score | Accuracy | Precision | Recall | F1-Score | |
Alphabet Song | 92.96% | 94.79% | 92.42% | 92.90% | 100.00% | 100.00% | 100.00% | 100.00% |
Amazing Grace | 97.18% | 97.22% | 97.53% | 97.27% | 98.59% | 98.55% | 98.48% | 98.48% |
Au Clair De La Lune | 92.96% | 93.28% | 93.55% | 93.19% | 95.77% | 96.00% | 96.29% | 95.91% |
Beautiful Brown Eye | 95.77% | 96.00% | 95.45% | 95.43% | 97.18% | 97.70% | 96.96% | 97.22% |
Brahms Lullaby | 92.96% | 93.83% | 93.83% | 93.20% | 95.77% | 96.66% | 95.45% | 95.80% |
Can Can | 100.00% | 100.00% | 100.00% | 100.00% | 95.77% | 96.00% | 96.29% | 95.91% |
Clementine | 98.59% | 98.55% | 98.48% | 98.48% | 97.18% | 97.22% | 96.96% | 96.96% |
Deck The Hall | 98.59% | 98.81% | 98.48% | 98.62% | 98.59% | 98.55% | 98.76% | 98.63% |
Hot Cross Buns | 92.96% | 94.79% | 92.42% | 92.90% | 100.00% | 100.00% | 100.00% | 100.00% |
Jingle Bells | 100.00% | 100.00% | 100.00% | 100.00% | 97.18% | 97.22% | 97.53% | 97.26% |
Lavender Blue | 100.00% | 100.00% | 100.00% | 100.00% | 97.18% | 97.70% | 96.96% | 97.22% |
London Bridge | 97.18% | 97.70% | 96.97% | 97.22% | 98.59% | 98.55% | 98.48% | 98.61% |
Mary Had a Little Lamb | 98.59% | 98.55% | 98.77% | 98.63% | 98.59% | 98.55% | 98.48% | 98.48% |
Ode To Joy | 98.59% | 98.55% | 98.77% | 98.63% | 97.18% | 97.70% | 96.96% | 97.22% |
Oh Susanna | 97.18% | 97.70% | 96.97% | 97.22% | 97.18% | 97.22% | 97.53% | 97.26% |
Skip To My Lou | 94.37% | 94.87% | 94.22% | 94.15% | 97.18% | 97.70% | 96.96% | 97.22% |
The Cuckoo | 97.18% | 97.70% | 96.97% | 97.22% | 94.37% | 94.87% | 95.06% | 94.55% |
This Old Man | 98.59% | 98.81% | 98.48% | 98.62% | 98.59% | 98.80% | 98.48% | 98.61% |
Model | Pitch-Based CNN | Pitch-Based CNN Transformers |
---|---|---|
The average of accuracy | 96.87% | 97.50% |
Model | The Average of Accuracy |
---|---|
Timbre-based WaveNet | 61.04% |
Timbre-based MLNet | 73.32% |
Timbre-based CNN | 82.71% |
Timbre-based CNN Transformers | 94.60% |
Pitch-based CNN | 96.87% |
Pitch-based CNN Transformers | 97.50% |
Model Combination | Timbre-Based CNN & Pitch-Based CNN | Pitch-Based CNN Transformers & Pitch-Based CNN Transformers |
---|---|---|
The comparison of accuracy | 89.79% | 96.05% |
Work | Feature | Model/Method | Accuracy |
---|---|---|---|
Zhang, W et al. [40], 2016 | STFT | CNN | 87.40% |
Sarkar, R et al. [41], 2020 | MFCC | CNN | 67.71% |
Singh, Y et al. [42], 2022 | MFCC | Xception | 89.02% |
Ours | MFCC | Timbre-based CNN | 82.71% |
Ours | Pitch raw data | Pitch-based CNN | 96.87% |
Ours | Chroma_stft | Timbre-based CNN Transformers | 94.60% |
Ours | Pitch raw data | Pitch-based CNN Transformers | 97.50% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Phanichraksaphong, V.; Tsai, W.-H. Automatic Assessment of Piano Performances Using Timbre and Pitch Features. Electronics 2023, 12, 1791. https://doi.org/10.3390/electronics12081791
Phanichraksaphong V, Tsai W-H. Automatic Assessment of Piano Performances Using Timbre and Pitch Features. Electronics. 2023; 12(8):1791. https://doi.org/10.3390/electronics12081791
Chicago/Turabian StylePhanichraksaphong, Varinya, and Wei-Ho Tsai. 2023. "Automatic Assessment of Piano Performances Using Timbre and Pitch Features" Electronics 12, no. 8: 1791. https://doi.org/10.3390/electronics12081791
APA StylePhanichraksaphong, V., & Tsai, W. -H. (2023). Automatic Assessment of Piano Performances Using Timbre and Pitch Features. Electronics, 12(8), 1791. https://doi.org/10.3390/electronics12081791