Assessment of Pepper Robot’s Speech Recognition System through the Lens of Machine Learning
Abstract
:1. Introduction
2. Literature Review
3. Methodology
3.1. Experimental Setup
3.2. Pepper Robot Function
3.3. Transfer of Recordings to the Local System
3.4. Feature Generation
3.4.1. Mel-Frequency Cepstral Coefficients (MFCCs)
3.4.2. Pitch
3.4.3. Spectral Centroid
3.4.4. Spectral Flatness
3.4.5. Time Domain Features
3.5. Pre-Processing of Dataset
3.6. Application of Machine Learning
3.7. Integration with Whisper: Performance Evaluation Using WER, MER, WIL, and CER
3.8. Missing Value Imputation
3.9. Selection of Best Cluster
4. Results
4.1. Feature Generation from Audio Recordings, Pre-Processing and Machine Learning
4.2. Visualization of Clusters with Features
4.3. Integration of Whisper to Evaluate Pepper’s Speech Recognition System
4.4. Selection of Best Cluster Depending on WER Threshold
4.5. Visualization of Best Records in Cluster 1
5. Discussions
6. Conclusions, Implications, Limitations, and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Conflicts of Interest
References
- Abdollahi, H.; Mahoor, M.H.; Zandie, R.; Siewierski, J.; Qualls, S.H. Artificial emotional intelligence in socially assistive robots for older adults: A pilot study. IEEE Trans. Affect. Comput. 2022, 14, 2020–2032. [Google Scholar] [CrossRef]
- Cabibihan, J.-J.; Javed, H.; Ang, M.; Aljunied, S.M. Why robots? A survey on the roles and benefits of social robots in the therapy of children with autism. Int. J. Soc. Robot. 2013, 5, 593–618. [Google Scholar] [CrossRef]
- Donnermann, M.; Schaper, P.; Lugrin, B. Social robots in applied settings: A long-term study on adaptive robotic tutors in higher education. Front. Robot. AI 2022, 9, 831633. [Google Scholar] [CrossRef]
- Lanzilotti, R.; Piccinno, A.; Rossano, V.; Roselli, T. Social Robot to teach coding in primary school. In Proceedings of the 2021 International Conference on Advanced Learning Technologies (ICALT), Tartu, Estonia, 12–15 July 2021; pp. 102–104. [Google Scholar]
- Nakanishi, J.; Kuramoto, I.; Baba, J.; Ogawa, K.; Yoshikawa, Y.; Ishiguro, H. Continuous hospitality with social robots at a hotel. SN Appl. Sci. 2020, 2, 452. [Google Scholar] [CrossRef]
- Youssef, K.; Said, S.; Beyrouthy, T.; Alkork, S. A social robot with conversational capabilities for visitor reception: Design and framework. In Proceedings of the 2021 4th International Conference on Bio-Engineering for Smart Technologies (BioSMART), Paris, France, 8–10 December 2021; pp. 1–4. [Google Scholar]
- Mishra, D.; Romero, G.A.; Pande, A.; Nachenahalli Bhuthegowda, B.; Chaskopoulos, D.; Shrestha, B. An Exploration of the Pepper Robot’s Capabilities: Unveiling Its Potential. Appl. Sci. 2023, 14, 110. [Google Scholar] [CrossRef]
- Ghiță, A.; Gavril, A.F.; Nan, M.; Hoteit, B.; Awada, I.A.; Sorici, A.; Mocanu, I.G.; Florea, A.M. The AMIRO Social Robotics Framework: Deployment and Evaluation on the Pepper Robot. Sensors 2020, 20, 7271. [Google Scholar] [CrossRef]
- Pandey, A.K.; Gelin, R.; Robot, A. Pepper: The first machine of its kind. IEEE Robot. Autom. Mag. 2018, 25, 40–48. [Google Scholar] [CrossRef]
- Pande, A.; Mishra, D. The Synergy between a Humanoid Robot and Whisper: Bridging a Gap in Education. Electronics 2023, 12, 3995. [Google Scholar] [CrossRef]
- Ganesh, B.B.; Damodar, B.; Dharmesh, R.; Karthik, K.T.; Vasudevan, S.K. An innovative hearing-impaired assistant with sound-localisation and speech-to-text application. Int. J. Med. Eng. Inform. 2022, 14, 63–73. [Google Scholar] [CrossRef]
- Matre, M.E.; Cameron, D.L. A scoping review on the use of speech-to-text technology for adolescents with learning difficulties in secondary education. Disabil. Rehabil. Assist. Technol. 2022, 19, 1103–1116. [Google Scholar] [CrossRef]
- Athikkal, S.; Jenq, J. Voice Chatbot for Hospitality. arXiv 2022, arXiv:2208.10926. [Google Scholar]
- Goss, F.R.; Blackley, S.V.; Ortega, C.A.; Kowalski, L.T.; Landman, A.B.; Lin, C.-T.; Meteer, M.; Bakes, S.; Gradwohl, S.C.; Bates, D.W. A clinician survey of using speech recognition for clinical documentation in the electronic health record. Int. J. Med. Inform. 2019, 130, 103938. [Google Scholar] [CrossRef]
- Debnath, S.; Roy, P.; Namasudra, S.; Crespo, R.G. Audio-Visual Automatic Speech Recognition Towards Education for Disabilities. J. Autism Dev. Disord. 2023, 53, 3581–3594. [Google Scholar] [CrossRef]
- Dash, T.K.; Chakraborty, C.; Mahapatra, S.; Panda, G. Gradient boosting machine and efficient combination of features for speech-based detection of COVID-19. IEEE J. Biomed. Health Inform. 2022, 26, 5364–5371. [Google Scholar] [CrossRef]
- Kerkeni, L.; Serrestou, Y.; Raoof, K.; Mbarki, M.; Mahjoub, M.A.; Cleder, C. Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO. Speech Commun. 2019, 114, 22–35. [Google Scholar] [CrossRef]
- Wang, Z.; Liu, G.; Song, H. Speech emotion recognition method based on multiple kernel learning feature fusion. Comput. Eng. 2019, 45, 248–254. [Google Scholar]
- Gupta, M.; Chandra, S. Speech emotion recognition using MFCC and wide residual network. In Proceedings of the 2021 Thirteenth International Conference on Contemporary Computing, Noida, India, 5–7 August 2021; pp. 320–327. [Google Scholar]
- Ittichaichareon, C.; Suksri, S.; Yingthawornsuk, T. Speech recognition using MFCC. In Proceedings of the International Conference on Computer Graphics, Simulation and Modeling, Pattaya, Thailand, 28–29 July 2012. [Google Scholar]
- Ganchev, T.; Fakotakis, N.; Kokkinakis, G. Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the SPECOM, Patras, Greece, 17–19 October 2005; pp. 191–194. [Google Scholar]
- Zhen, B.; Wu, X.; Liu, Z.; Chi, H. On the Importance of Components of the MFCC in Speech and Speaker Recognition. In Proceedings of the Sixth International Conference on Spoken Language Processing, Beijing, China, 16–20 October 2000. [Google Scholar]
- Hamza, A.; Javed, A.R.R.; Iqbal, F.; Kryvinska, N.; Almadhor, A.S.; Jalil, Z.; Borghol, R. Deepfake Audio Detection via MFCC Features Using Machine Learning. IEEE Access 2022, 10, 134018–134028. [Google Scholar] [CrossRef]
- Massar, M.L.; Fickus, M.; Bryan, E.; Petkie, D.T.; Terzuoli, A.J. Fast computation of spectral centroids. Adv. Comput. Math. 2011, 35, 83–97. [Google Scholar] [CrossRef]
- Li, T.; Ogihara, M.; Li, Q. A comparative study on content-based music genre classification. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, Toronto, ON, Canada, 28 July–1 August 2003; pp. 282–289. [Google Scholar]
- Lu, L.; Liu, D.; Zhang, H.-J. Automatic mood detection and tracking of music audio signals. IEEE Trans. Audio Speech Lang. Process. 2005, 14, 5–18. [Google Scholar] [CrossRef]
- Madhu, N. Note on measures for spectral flatness. Electron. Lett. 2009, 45, 1195–1196. [Google Scholar] [CrossRef]
- Uddin, M.A.; Pathan, R.K.; Hossain, M.S.; Biswas, M. Gender and region detection from human voice using the three-layer feature extraction method with 1D CNN. J. Inf. Telecommun. 2022, 6, 27–42. [Google Scholar] [CrossRef]
- Kedem, B. Spectral analysis and discrimination by zero-crossings. Proc. IEEE 1986, 74, 1477–1493. [Google Scholar] [CrossRef]
- Panagiotakis, C.; Tziritas, G. A speech/music discriminator based on RMS and zero-crossings. IEEE Trans. Multimed. 2005, 7, 155–166. [Google Scholar] [CrossRef]
- Saunders, J. Real-time discrimination of broadcast speech/music. In Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA, 9 May 1996; pp. 993–996. [Google Scholar]
- Chuang, Z.-J.; Wu, C.-H. Emotion recognition using acoustic features and textual content. In Proceedings of the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No. 04TH8763), Taipei, Taiwan, 27–30 June 2004; pp. 53–56. [Google Scholar]
- Mitrović, D.; Zeppelzauer, M.; Breiteneder, C. Features for content-based audio retrieval. In Advances in Computers; Elsevier: Amsterdam, The Netherlands, 2010; Volume 78, pp. 71–150. [Google Scholar]
- Guglani, J.; Mishra, A. Automatic speech recognition system with pitch dependent features for Punjabi language on KALDI toolkit. Appl. Acoust. 2020, 167, 107386. [Google Scholar] [CrossRef]
- Ramakrishnan, A.; Abhiram, B.; Mahadeva Prasanna, S. Voice source characterization using pitch synchronous discrete cosine transform for speaker identification. J. Acoust. Soc. Am. 2015, 137, EL469–EL475. [Google Scholar] [CrossRef]
- Sudhakar, R.S.; Anil, M.C. Analysis of Speech Features for Emotion Detection: A Review. In Proceedings of the 2015 International Conference on Computing Communication Control and Automation, Pune, India, 26–27 February 2015; pp. 661–664. [Google Scholar]
- Abhang, P.A.; Gawali, B.W.; Mehrotra, S.C. Chapter 5—Emotion Recognition. In Introduction to EEG-and Speech-Based Emotion Recognition; Abhang, P.A., Gawali, B.W., Mehrotra, S.C., Eds.; Academic Press: Cambridge, MA, USA, 2016; pp. 97–112. [Google Scholar]
- Johnson, E.K.; van Heugten, M.; Buckler, H. Navigating accent variation: A developmental perspective. Annu. Rev. Linguist. 2022, 8, 365–387. [Google Scholar] [CrossRef]
- Wu, J.; Zhao, J. Systematic correspondence in co-evolving languages. Humanit. Soc. Sci. Commun. 2023, 10, 469. [Google Scholar] [CrossRef]
- Cassar, C.; McCabe, P.; Cumming, S. “I still have issues with pronunciation of words”: A mixed methods investigation of the psychosocial and speech effects of childhood apraxia of speech in adults. Int. J. Speech-Lang. Pathol. 2023, 25, 193–205. [Google Scholar] [CrossRef]
- Feng, S.; Halpern, B.M.; Kudina, O.; Scharenborg, O. Towards inclusive automatic speech recognition. Comput. Speech Lang. 2024, 84, 101567. [Google Scholar] [CrossRef]
- Kaur, J.; Singh, A.; Kadyan, V. Automatic speech recognition system for tonal languages: State-of-the-art survey. Arch. Comput. Methods Eng. 2021, 28, 1039–1068. [Google Scholar] [CrossRef]
- Karlsson, F.; Hartelius, L. On the primary influences of age on articulation and phonation in maximum performance tasks. Languages 2021, 6, 174. [Google Scholar] [CrossRef]
- Bóna, J. Temporal characteristics of speech: The effect of age and speech style. J. Acoust. Soc. Am. 2014, 136, EL116–EL121. [Google Scholar] [CrossRef]
- Das, B.; Mandal, S.; Mitra, P.; Basu, A. Effect of aging on speech features and phoneme recognition: A study on Bengali voicing vowels. Int. J. Speech Technol. 2013, 16, 19–31. [Google Scholar] [CrossRef]
- Kennedy, J.; Lemaignan, S.; Montassier, C.; Lavalade, P.; Irfan, B.; Papadopoulos, F.; Senft, E.; Belpaeme, T. Child speech recognition in human-robot interaction: Evaluations and recommendations. In Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction, Vienna, Austria, 6–9 March 2017; pp. 82–90. [Google Scholar]
- Arslan, B.; Göksun, T. Aging, gesture production, and disfluency in speech: A comparison of younger and older adults. Cogn. Sci. 2022, 46, e13098. [Google Scholar] [CrossRef]
- Ekström, S.; Pareto, L. The dual role of humanoid robots in education: As didactic tools and social actors. Educ. Inf. Technol. 2022, 27, 12609–12644. [Google Scholar] [CrossRef]
- Carros, F.; Meurer, J.; Löffler, D.; Unbehaun, D.; Matthies, S.; Koch, I.; Wieching, R.; Randall, D.; Hassenzahl, M.; Wulf, V. Exploring human-robot interaction with the elderly: Results from a ten-week case study in a care home. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 25–30 April 2020; pp. 1–12. [Google Scholar]
- Olde Keizer, R.A.; van Velsen, L.; Moncharmont, M.; Riche, B.; Ammour, N.; Del Signore, S.; Zia, G.; Hermens, H.; N’Dja, A. Using socially assistive robots for monitoring and preventing frailty among older adults: A study on usability and user experience challenges. Health Technol. 2019, 9, 595–605. [Google Scholar] [CrossRef]
- Mendoza, E.; Valencia, N.; Muñoz, J.; Trujillo, H. Differences in voice quality between men and women: Use of the long-term average spectrum (LTAS). J. Voice 1996, 10, 59–66. [Google Scholar] [CrossRef]
- Pande, A.; Mishra, D. Humanoid robot as an educational assistant–insights of speech recognition for online and offline mode of teaching. Behav. Inf. Technol. 2024, 1–18. [Google Scholar] [CrossRef]
- Attawibulkul, S.; Kaewkamnerdpong, B.; Miyanaga, Y. Noisy speech training in MFCC-based speech recognition with noise suppression toward robot assisted autism therapy. In Proceedings of the 2017 10th Biomedical Engineering International Conference (BMEiCON), Hokkaido, Japan, 31 August–2 September 2017; pp. 1–5. [Google Scholar]
- Meyer, J.; Dentel, L.; Meunier, F. Speech recognition in natural background noise. PLoS ONE 2013, 8, e79279. [Google Scholar] [CrossRef]
- Agarwal, G.; Om, H. Performance of deer hunting optimization based deep learning algorithm for speech emotion recognition. Multimed. Tools Appl. 2021, 80, 9961–9992. [Google Scholar] [CrossRef]
- Doğdu, C.; Kessler, T.; Schneider, D.; Shadaydeh, M.; Schweinberger, S.R. A comparison of machine learning algorithms and feature sets for automatic vocal emotion recognition in speech. Sensors 2022, 22, 7561. [Google Scholar] [CrossRef]
- Ayrancı, A.A.; Atay, S.; Yıldırım, T. Speaker Accent Recognition Using Machine Learning Algorithms. In Proceedings of the 2020 Innovations in Intelligent Systems and Applications Conference (ASYU), Istanbul, Turkey, 15–17 October 2020; pp. 1–6. [Google Scholar]
- Mulfari, D.; Meoni, G.; Marini, M.; Fanucci, L. Machine learning assistive application for users with speech disorders. Appl. Soft Comput. 2021, 103, 107147. [Google Scholar] [CrossRef]
- Abdusalomov, A.B.; Safarov, F.; Rakhimov, M.; Turaev, B.; Whangbo, T.K. Improved Feature Parameter Extraction from Speech Signals Using Machine Learning Algorithm. Sensors 2022, 22, 8122. [Google Scholar] [CrossRef]
- Li, F.; Liu, M.; Zhao, Y.; Kong, L.; Dong, L.; Liu, X.; Hui, M. Feature extraction and classification of heart sound using 1D convolutional neural networks. EURASIP J. Adv. Signal Process. 2019, 2019, 59. [Google Scholar] [CrossRef]
- Singh, V.; Prasad, S. Speech emotion recognition system using gender dependent convolution neural network. Procedia Comput. Sci. 2023, 218, 2533–2540. [Google Scholar] [CrossRef]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- Sandhya, P.; Spoorthy, V.; Koolagudi, S.G.; Sobhana, N.V. Spectral Features for Emotional Speaker Recognition. In Proceedings of the 2020 Third International Conference on Advances in Electronics, Computers and Communications (ICAECC), Bengaluru, India, 11–12 December 2020; pp. 1–6. [Google Scholar]
- Micheyl, C.; Ryan, C.M.; Oxenham, A.J. Further evidence that fundamental-frequency difference limens measure pitch discrimination. J. Acoust. Soc. Am. 2012, 131, 3989–4001. [Google Scholar] [CrossRef]
- Abdul, Z.K.; Al-Talabani, A.K. Mel Frequency Cepstral Coefficient and its applications: A Review. IEEE Access 2022, 10, 122136–122158. [Google Scholar] [CrossRef]
- Gourisaria, M.K.; Agrawal, R.; Sahni, M.; Singh, P.K. Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques. Discov. Internet Things 2024, 4, 1. [Google Scholar] [CrossRef]
- Shagi, G.U.; Aji, S. A machine learning approach for gender identification using statistical features of pitch in speeches. Appl. Acoust. 2022, 185, 108392. [Google Scholar] [CrossRef]
- Agostini, G.; Longari, M.; Pollastri, E. Musical instrument timbres classification with spectral features. EURASIP J. Adv. Signal Process. 2003, 2003, 943279. [Google Scholar] [CrossRef]
- Ferdoushi, M.; Paul, M.; Fattah, S.A. A Spectral Centroid Based Analysis of Heart sounds for Disease Detection Using Machine Learning. In Proceedings of the 2019 IEEE International WIE Conference on Electrical and Computer Engineering (WIECON-ECE), Banglore, India, 15–16 November 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Ma, Y.; Nishihara, A. Efficient voice activity detection algorithm using long-term spectral flatness measure. EURASIP J. Audio Speech Music. Process. 2013, 2013, 87. [Google Scholar] [CrossRef]
- Lazaro, A.; Sarno, R.; Andre, R.J.; Mahardika, M.N. Music tempo classification using audio spectrum centroid, audio spectrum flatness, and audio spectrum spread based on MPEG-7 audio features. In Proceedings of the 2017 3rd International Conference on Science in Information Technology (ICSITech), Bandung, Indonesia, 25–26 October 2017; pp. 41–46. [Google Scholar]
- Gouyon, F.; Pachet, F.; Delerue, O. On the use of zero-crossing rate for an application of classification of percussive sounds. In Proceedings of the COST G-6 conference on Digital Audio Effects (DAFX-00), Verona, Italy, 7–9 December 2000; p. 16. [Google Scholar]
- Panda, S.K.; Jena, A.K.; Panda, M.R.; Panda, S. Speech emotion recognition using multimodal feature fusion with machine learning approach. Multimed. Tools Appl. 2023, 82, 42763–42781. [Google Scholar] [CrossRef]
- Paul, B.; Bera, S.; Dey, T.; Phadikar, S. Machine learning approach of speech emotions recognition using feature fusion technique. Multimed. Tools Appl. 2024, 83, 8663–8688. [Google Scholar] [CrossRef]
- Hammoud, M.; Getahun, M.N.; Baldycheva, A.; Somov, A. Machine learning-based infant crying interpretation. Front. Artif. Intell. 2024, 7, 1337356. [Google Scholar] [CrossRef]
- Li, M.; Yang, B.; Levy, J.; Stolcke, A.; Rozgic, V.; Matsoukas, S.; Papayiannis, C.; Bone, D.; Wang, C. Contrastive Unsupervised Learning for Speech Emotion Recognition. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–11 June 2021; pp. 6329–6333. [Google Scholar]
- Zhang, Z.; Weninger, F.; Wöllmer, M.; Schuller, B. Unsupervised learning in cross-corpus acoustic emotion recognition. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, Waikoloa, HI, USA, 11–15 December 2011; pp. 523–528. [Google Scholar]
- Aldarmaki, H.; Ullah, A.; Ram, S.; Zaki, N. Unsupervised automatic speech recognition: A review. Speech Commun. 2022, 139, 76–91. [Google Scholar] [CrossRef]
- Esfandian, N.; Razzazi, F.; Behrad, A. A clustering based feature selection method in spectro-temporal domain for speech recognition. Eng. Appl. Artif. Intell. 2012, 25, 1194–1202. [Google Scholar] [CrossRef]
- Hajarolasvadi, N.; Demirel, H. 3D CNN-based speech emotion recognition using k-means clustering and spectrograms. Entropy 2019, 21, 479. [Google Scholar] [CrossRef]
- Vyas, G.; Dutta, M.K. Automatic mood detection of indian music using mfccs and k-means algorithm. In Proceedings of the 2014 Seventh International Conference on Contemporary Computing (IC3), Noida, India, 7–9 August 2014; pp. 117–122. [Google Scholar]
- Bansal, S.; Dev, A. Emotional Hindi speech: Feature extraction and classification. In Proceedings of the 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 11–13 March 2015; pp. 1865–1868. [Google Scholar]
- Marupaka, P.T.; Singh, R.K. Comparison of classification results obtained by using cyclostationary features, MFCC, proposed algorithm and development of an environmental sound classification system. In Proceedings of the 2014 International Conference on Advances in Electronics Computers and Communications, Bangalore, India, 10–11 October 2014; pp. 1–6. [Google Scholar]
- Poorna, S.S.; Jeevitha, C.Y.; Nair, S.J.; Santhosh, S.; Nair, G.J. Emotion recognition using multi-parameter speech feature classification. In Proceedings of the 2015 International Conference on Computers, Communications, and Systems (ICCCS), Kanyakumari, India, 2–3 November 2015; pp. 217–222. [Google Scholar]
- Shadiev, R.; Hwang, W.-Y.; Chen, N.-S.; Huang, Y.-M. Review of speech-to-text recognition technology for enhancing learning. J. Educ. Technol. Soc. 2014, 17, 65–84. [Google Scholar]
- Macháček, D.; Dabre, R.; Bojar, O. Turning Whisper into Real-Time Transcription System. arXiv 2023, arXiv:2307.14743. [Google Scholar]
- Vásquez-Correa, J.C.; Arzelus, H.; Martin-Doñas, J.M.; Arellano, J.; Gonzalez-Docasal, A.; Álvarez, A. When Whisper Meets TTS: Domain Adaptation Using only Synthetic Speech Data. In Proceedings of the International Conference on Text, Speech, and Dialogue, Pilsen, Czech Republic, 4–6 September 2023; Springer: Berlin/Heidelberg, Germany, 2023; pp. 226–238. [Google Scholar]
- Spiller, T.R.; Ben-Zion, Z.; Korem, N.; Harpaz-Rotem, I.; Duek, O. Efficient and Accurate Transcription in Mental Health Research-A Tutorial on Using Whisper AI for Sound File Transcription. OSF Prepr. 2023. [Google Scholar] [CrossRef]
- Liu, S.; Hu, S.; Liu, X.; Meng, H. On the Use of Pitch Features for Disordered Speech Recognition. In Proceedings of the Interspeech, Graz, Austria, 15–19 September 2019; pp. 4130–4134. [Google Scholar]
- De Cheveigné, A.; Kawahara, H. YIN, a fundamental frequency estimator for speech and music. J. Acoust. Soc. Am. 2002, 111, 1917–1930. [Google Scholar] [CrossRef]
- Giannakopoulos, T. Pikrakis, A. Introduction to Audio Analysis: A MATLAB Approach; Academic Press: Oxford, UK, 2014; p. i. [Google Scholar]
- Ijaz, A.; Nabeel, M.; Masood, U.; Mahmood, T.; Hashmi, M.S.; Posokhova, I.; Rizwan, A.; Imran, A. Towards using cough for respiratory disease diagnosis by leveraging Artificial Intelligence: A survey. Inform. Med. Unlocked 2022, 29, 100832. [Google Scholar] [CrossRef]
- Krishnamurthi, R.; Gopinathan, D.; Kumar, A. Chapter 10—Using wavelet transformation for acoustic signal processing in heavy vehicle detection and classification. In Autonomous and Connected Heavy Vehicle Technology; Krishnamurthi, R., Kumar, A., Gill, S.S., Eds.; Academic Press: Cambridge, MA, USA, 2022; pp. 199–209. [Google Scholar]
- Torres-García, A.A.; Mendoza-Montoya, O.; Molinas, M.; Antelis, J.M.; Moctezuma, L.A.; Hernández-Del-Toro, T. Chapter 4—Pre-processing and feature extraction. In Biosignal Processing and Classification Using Computational Learning and Intelligence; Torres-García, A.A., Reyes-García, C.A., Villaseñor-Pineda, L., Mendoza-Montoya, O., Eds.; Academic Press: Cambridge, MA, USA, 2022; pp. 59–91. [Google Scholar]
- Shete, D.; Patil, S.; Patil, S. Zero crossing rate and Energy of the Speech Signal of Devanagari Script. IOSR J. VLSI Signal Process. (IOSR-JVSP) 2014, 4, 1–5. [Google Scholar] [CrossRef]
- Bisong, E. Introduction to Scikit-learn. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Bisong, E., Ed.; Apress: Berkeley, CA, USA, 2019; pp. 215–229. [Google Scholar]
- Ikotun, A.M.; Ezugwu, A.E.; Abualigah, L.; Abuhaija, B.; Heming, J. K-means clustering algorithms: A comprehensive review, variants analysis, and advances in the era of big data. Inf. Sci. 2023, 622, 178–210. [Google Scholar] [CrossRef]
- Abdalla, H.I. A Brief Comparison of K-means and Agglomerative Hierarchical Clustering Algorithms on Small Datasets. In Proceedings of the 2021 International Conference on Wireless Communications, Networking and Applications, Berlin, Germany, 17–19 December 2021; Springer Nature: Singapore, 2022; pp. 623–632. [Google Scholar]
- Rathore, P.; Shukla, D. Analysis and performance improvement of K-means clustering in big data environment. In Proceedings of the 2015 International Conference on Communication Networks (ICCN), Gwalior, India, 19–21 November 2015; IEEE: New York, NY, USA, 2015; pp. 43–46. [Google Scholar]
- Abbas, O.A. Comparisons between data clustering algorithms. Int. Arab. J. Inf. Technol. (IAJIT) 2008, 5, 320–325. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
- Hartigan, J.A.; Wong, M.A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1979, 28, 100–108. [Google Scholar] [CrossRef]
- Peng, K.; Leung, V.C.M.; Huang, Q. Clustering Approach Based on Mini Batch Kmeans for Intrusion Detection System Over Big Data. IEEE Access 2018, 6, 11897–11906. [Google Scholar] [CrossRef]
- OpenAI Whisper. Available online: https://openai.com/research/whisper (accessed on 7 May 2023).
- Openai-Whisper. Available online: https://pypi.org/project/openai-whisper/ (accessed on 21 April 2024).
- Klakow, D.; Peters, J. Testing the correlation of word error rate and perplexity. Speech Commun. 2002, 38, 19–28. [Google Scholar] [CrossRef]
- Filippidou, F.; Moussiades, L. A benchmarking of IBM, Google and Wit automatic speech recognition systems. In Proceedings of the Artificial Intelligence Applications and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos Marmaras, Greece, 5–7 June 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 73–82. [Google Scholar]
- Morris, A.C.; Maier, V.; Green, P. From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition. In Proceedings of the Eighth International Conference on Spoken Language Processing, Jeju Island, Republic of Korea, 4–8 October 2004. [Google Scholar]
- Vidal, E.; Toselli, A.H.; Ríos-Vila, A.; Calvo-Zaragoza, J. End-to-End page-Level assessment of handwritten text recognition. Pattern Recognit. 2023, 142, 109695. [Google Scholar] [CrossRef]
- Pande, A.; Shrestha, B.; Rani, A.; Mishra, D. A Comparative Analysis of Real Time Open-Source Speech Recognition Tools for Social Robots. In Proceedings of the Design, User Experience, and Usability, Copenhagen, Denmark, 23–28 July 2023; Springer Nature: Cham, Switzerland, 2023; pp. 355–365. [Google Scholar]
- Alghofaili, Y. Kmeans-Feature-Importance. Available online: https://github.com/YousefGh/kmeans-feature-importance (accessed on 20 April 2024).
- Jiang, W.; Wang, Z.; Jin, J.S.; Han, X.; Li, C. Speech emotion recognition with heterogeneous feature unification of deep neural network. Sensors 2019, 19, 2730. [Google Scholar] [CrossRef] [PubMed]
- Chauhan, N.; Isshiki, T.; Li, D. Speaker Recognition Using LPC, MFCC, ZCR Features with ANN and SVM Classifier for Large Input Database. In Proceedings of the 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), Singapore, 23–25 February 2019; pp. 130–133. [Google Scholar]
- Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 1980, 28, 357–366. [Google Scholar] [CrossRef]
- Lehner, B.; Sonnleitner, R.; Widmer, G. Towards Light-Weight, Real-Time-Capable Singing Voice Detection. In Proceedings of the 14th International Conference on Music Information Retrieval (ISMIR), Curitiba, Brazil, 4–8 November 2013; pp. 1–6. [Google Scholar]
- Gajic, B.; Paliwal, K.K. Robust speech recognition in noisy environments based on subband spectral centroid histograms. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 600–608. [Google Scholar] [CrossRef]
- Paliwal, K.K. Spectral subband centroid features for speech recognition. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP’98 (Cat. No. 98CH36181), Seattle, WA, USA, 15 May 1998; IEEE: New York, NY, USA, 1998; pp. 617–620. [Google Scholar]
- Huang, Y.; Ao, W.; Zhang, G. Novel sub-band spectral centroid weighted wavelet packet features with importance-weighted support vector machines for robust speech emotion recognition. Wirel. Pers. Commun. 2017, 95, 2223–2238. [Google Scholar] [CrossRef]
- Qadri, S.A.A.; Gunawan, T.S.; Wani, T.; Alghifari, M.F.; Mansor, H.; Kartiwi, M. Comparative Analysis of Gender Identification using Speech Analysis and Higher Order Statistics. In Proceedings of the 2019 IEEE International Conference on Smart Instrumentation, Measurement and Application (ICSIMA), Kuala Lumpur, Malaysia, 27–29 August 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
- Sebastian, J.; Kumar, M.; Murthy, H.A. An analysis of the high resolution property of group delay function with applications to audio signal processing. Speech Commun. 2016, 81, 42–53. [Google Scholar] [CrossRef]
- Koduru, A.; Valiveti, H.B.; Budati, A.K. Feature extraction algorithms to improve the speech emotion recognition rate. Int. J. Speech Technol. 2020, 23, 45–55. [Google Scholar] [CrossRef]
- Chauhan, N.; Isshiki, T.; Li, D. Text-Independent Speaker Recognition System Using Feature-Level Fusion for Audio Databases of Various Sizes. SN Comput. Sci. 2023, 4, 531. [Google Scholar] [CrossRef]
- Bird, J.J.; Faria, D.R.; Premebida, C.; Ekárt, A.; Ayrosa, P.P. Overcoming data scarcity in speaker identification: Dataset augmentation with synthetic mfccs via character-level rnn. In Proceedings of the 2020 IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC), Ponta Delgada, Portugal, 15–17 April 2020; IEEE: New York, NY, USA, 2020; pp. 146–151. [Google Scholar]
- Shen, Z.; Elibol, A.; Chong, N.Y. Inferring human personality traits in human-robot social interaction. In Proceedings of the 2019 14th ACM/IEEE International Conference on Human-Robot Interaction (HRI), Daegu, Republic of Korea, 11–14 March 2019; IEEE: New York, NY, USA, 2019; pp. 578–579. [Google Scholar]
- Li, N.; Ross, R. Invoking and identifying task-oriented interlocutor confusion in human-robot interaction. Front. Robot. AI 2023, 10, 1244381. [Google Scholar] [CrossRef] [PubMed]
- Telembici, T.; Grama, L.; Muscar, L.; Rusu, C. Results on the MFCC extraction for improving audio capabilities of TIAGo service robot. In Proceedings of the 2021 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, Romania, 25–27 October 2021; IEEE: New York, NY, USA, 2021; pp. 57–61. [Google Scholar]
- Wu, X.; Gong, H.; Chen, P.; Zhong, Z.; Xu, Y. Surveillance robot utilizing video and audio information. J. Intell. Robot. Syst. 2009, 55, 403–421. [Google Scholar] [CrossRef]
- Hireche, A.; Belkacem, A.N.; Jamil, S.; Chen, C. NewsGPT: ChatGPT Integration for Robot-Reporter. arXiv 2023, arXiv:2311.06640. [Google Scholar]
- Pépiot, E. Voice, speech and gender: Male-female acoustic differences and cross-language variation in english and french speakers. In Proceedings of the 15th Rencontres Jeunes Chercheurs (RJC 2012), Paris, France, 15–16 June 2012. [Google Scholar] [CrossRef]
- Tsantani, M.S.; Belin, P.; Paterson, H.M.; McAleer, P. Low vocal pitch preference drives first impressions irrespective of context in male voices but not in female voices. Perception 2016, 45, 946–963. [Google Scholar] [CrossRef] [PubMed]
- Garnerin, M.; Rossato, S.; Besacier, L. Gender representation in French broadcast corpora and its impact on ASR performance. In Proceedings of the 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, Nice, France, 21 October 2019; pp. 3–9. [Google Scholar]
- Adda-Decker, M.; Lamel, L. Do speech recognizers prefer female speakers? In Proceedings of the Ninth European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005.
- Tatman, R. Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Valencia, Spain, 4 April 2017; pp. 53–59. [Google Scholar]
- Doddington, G.R.; Przybocki, M.A.; Martin, A.F.; Reynolds, D.A. The NIST speaker recognition evaluation—Overview, methodology, systems, results, perspective. Speech Commun. 2000, 31, 225–254. [Google Scholar] [CrossRef]
- Rodrigues, A.; Santos, R.; Abreu, J.; Beça, P.; Almeida, P.; Fernandes, S. Analyzing the performance of ASR systems: The effects of noise, distance to the device, age and gender. In Proceedings of the XX International Conference on Human Computer Interaction, Donostia Gipuzkoa, Spain, 25–28 June 2019; pp. 1–8. [Google Scholar]
- Nematollahi, M.A.; Al-Haddad, S.A.R. Distant speaker recognition: An overview. Int. J. Humanoid Robot. 2016, 13, 1550032. [Google Scholar] [CrossRef]
- Michael, D.D.; Siegel, G.M.; Pick, H.L., Jr. Effects of distance on vocal intensity. J. Speech Lang. Hear. Res. 1995, 38, 1176–1183. [Google Scholar] [CrossRef] [PubMed]
- Zahorik, P.; Kelly, J.W. Accurate vocal compensation for sound intensity loss with increasing distance in natural environments. J. Acoust. Soc. Am. 2007, 122, EL143–EL150. [Google Scholar] [CrossRef]
- Diaz-Asper, C.; Chandler, C.; Turner, R.S.; Reynolds, B.; Elvevåg, B. Acceptability of collecting speech samples from the elderly via the telephone. Digit. Health 2021, 7, 20552076211002103. [Google Scholar] [CrossRef]
- Li, Q.; Russell, M.J. An analysis of the causes of increased error rates in children’s speech recognition. In Proceedings of the Seventh International Conference on Spoken Language Processing, Denver, CO, USA, 16–20 September 2002. [Google Scholar]
MFCC1 | MFCC2 | MFCC3 | MFCC4 | MFCC5 | MFCC6 | MFCC7 | MFCC8 | MFCC9 | MFCC10 | MFCC11 | MFCC12 | MFCC13 | Pitch | Spectral_Centroid | Spectral_Flatness | Energy | zcr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 | 44,406.0 |
mean | −0.34 | 0.42 | −0.32 | 0.27 | −0.24 | 0.23 | −0.03 | 0.1 | −0.0 | −0.12 | 0.2 | −0.17 | 0.22 | −0.06 | −0.18 | −0.12 | −0.12 | −0.18 |
std | 0.7 | 0.53 | 0.66 | 0.68 | 0.65 | 0.67 | 0.79 | 0.83 | 0.9 | 0.88 | 0.87 | 0.87 | 0.81 | 0.98 | 0.5 | 0.47 | 0.2 | 0.53 |
min | −1.87 | −2.28 | −5.62 | −3.34 | −4.41 | −3.42 | −5.27 | −5.68 | −6.41 | −5.11 | −4.91 | −4.48 | −3.81 | −1.21 | −1.98 | −0.95 | −0.32 | −1.83 |
25% | −0.83 | 0.08 | −0.61 | −0.16 | −0.64 | −0.15 | −0.42 | −0.33 | −0.49 | −0.61 | −0.32 | −0.7 | −0.23 | −0.89 | −0.45 | −0.45 | −0.23 | −0.45 |
50% | −0.57 | 0.45 | −0.24 | 0.22 | −0.26 | 0.29 | 0.04 | 0.13 | 0.02 | −0.16 | 0.15 | −0.2 | 0.28 | −0.37 | −0.25 | −0.13 | −0.2 | −0.23 |
75% | −0.07 | 0.77 | 0.11 | 0.62 | 0.17 | 0.68 | 0.46 | 0.58 | 0.51 | 0.32 | 0.65 | 0.32 | 0.74 | 0.59 | 0.01 | 0.14 | −0.08 | 0.04 |
max | 5.23 | 3.34 | 1.96 | 4.06 | 3.60 | 2.96 | 4.00 | 5.26 | 4.77 | 5.52 | 6.02 | 5.40 | 7.07 | 2.57 | 2.95 | 7.76 | 2.32 | 3.87 |
MFCC1 | MFCC2 | MFCC3 | MFCC4 | MFCC5 | MFCC6 | MFCC7 | MFCC8 | MFCC9 | MFCC10 | MFCC11 | MFCC12 | MFCC13 | Pitch | Spectral_Centroid | Spectral_Flatness | Energy | zcr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 | 12,122.00 |
mean | 0.98 | −1.06 | 1.17 | −1.25 | 1.23 | −1.19 | 0.46 | −0.68 | 0.15 | 0.34 | −0.74 | 0.75 | −1.02 | 0.24 | −0.24 | −0.28 | 0.46 | −0.17 |
std | 1.12 | 1.06 | 1.05 | 0.96 | 1.01 | 0.97 | 1.27 | 1.08 | 1.18 | 1.24 | 1.08 | 1.08 | 0.94 | 1.05 | 0.73 | 0.60 | 2.13 | 0.64 |
min | −1.16 | −4.21 | −3.70 | −4.45 | −4.62 | −6.60 | −7.85 | −6.72 | −6.05 | −6.21 | −6.61 | −4.50 | −5.23 | −1.21 | −2.38 | −0.96 | −0.30 | −1.76 |
25% | 0.07 | −1.50 | 0.66 | −1.80 | 0.96 | −1.44 | 0.12 | −1.11 | −0.42 | −0.29 | −1.39 | 0.15 | −1.53 | −0.73 | −0.64 | −0.76 | −0.21 | −0.54 |
50% | 0.69 | −1.17 | 1.43 | −1.45 | 1.40 | −1.03 | 0.81 | −0.48 | 0.14 | 0.38 | −0.83 | 0.73 | −0.94 | 0.17 | −0.30 | −0.35 | −0.09 | −0.23 |
75% | 1.61 | −0.59 | 1.79 | −0.84 | 1.75 | −0.71 | 1.19 | −0.03 | 0.74 | 0.98 | −0.21 | 1.33 | −0.45 | 1.21 | 0.05 | 0.05 | 0.33 | 0.08 |
max | 8.72 | 3.15 | 4.38 | 3.65 | 5.13 | 1.95 | 6.05 | 3.36 | 6.56 | 6.82 | 5.56 | 5.46 | 4.20 | 2.57 | 2.91 | 10.58 | 35.54 | 4.11 |
MFCC1 | MFCC2 | MFCC3 | MFCC4 | MFCC5 | MFCC6 | MFCC7 | MFCC8 | MFCC9 | MFCC10 | MFCC11 | MFCC12 | MFCC13 | Pitch | Spectral_Centroid | Spectral_Flatness | Energy | zcr | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 | 3808.00 |
mean | 0.89 | −1.55 | −0.00 | 0.84 | −1.10 | 1.10 | −1.14 | 0.94 | −0.48 | 0.37 | 0.05 | −0.37 | 0.64 | −0.03 | 2.88 | 2.24 | −0.10 | 2.60 |
std | 0.93 | 1.00 | 1.30 | 1.05 | 0.97 | 1.14 | 1.25 | 1.27 | 1.30 | 1.07 | 1.04 | 1.02 | 1.04 | 0.98 | 1.51 | 2.59 | 0.30 | 2.02 |
min | −1.08 | −5.09 | −5.67 | −2.00 | −4.90 | −3.19 | −8.44 | −4.79 | −4.18 | −4.17 | −3.60 | −4.44 | −3.77 | −1.21 | −0.45 | −0.94 | −0.31 | −1.54 |
25% | 0.20 | −2.17 | −0.70 | 0.11 | −1.75 | 0.42 | −2.04 | 0.09 | −1.40 | −0.35 | −0.64 | −1.03 | −0.01 | −0.89 | 1.81 | 1.11 | −0.21 | 1.24 |
50% | 0.74 | −1.34 | 0.17 | 0.77 | −1.20 | 1.21 | −1.18 | 0.92 | −0.48 | 0.37 | 0.02 | −0.40 | 0.71 | −0.36 | 2.50 | 1.83 | −0.18 | 2.12 |
75% | 1.40 | −0.82 | 0.82 | 1.56 | −0.47 | 1.93 | −0.26 | 1.81 | 0.39 | 1.08 | 0.70 | 0.27 | 1.34 | 0.82 | 3.62 | 2.83 | −0.09 | 3.44 |
max | 5.49 | 0.55 | 3.20 | 4.07 | 2.48 | 4.21 | 3.82 | 4.90 | 4.68 | 4.84 | 4.48 | 3.44 | 3.50 | 2.57 | 9.77 | 58.17 | 6.64 | 14.53 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pande, A.; Mishra, D. Assessment of Pepper Robot’s Speech Recognition System through the Lens of Machine Learning. Biomimetics 2024, 9, 391. https://doi.org/10.3390/biomimetics9070391
Pande A, Mishra D. Assessment of Pepper Robot’s Speech Recognition System through the Lens of Machine Learning. Biomimetics. 2024; 9(7):391. https://doi.org/10.3390/biomimetics9070391
Chicago/Turabian StylePande, Akshara, and Deepti Mishra. 2024. "Assessment of Pepper Robot’s Speech Recognition System through the Lens of Machine Learning" Biomimetics 9, no. 7: 391. https://doi.org/10.3390/biomimetics9070391
APA StylePande, A., & Mishra, D. (2024). Assessment of Pepper Robot’s Speech Recognition System through the Lens of Machine Learning. Biomimetics, 9(7), 391. https://doi.org/10.3390/biomimetics9070391