Arabic Mispronunciation Recognition System Using LSTM Network
Abstract
:1. Introduction
- A two-stage diagnostic system for recognizing the mispronunciation of Arabic letters using MFCC features and the LSTM model was implemented.
- To the best of our knowledge, this is the first attempt to recognize both mispronunciation and the gender of the speaker through two-level classification.
- The first benchmark for mispronunciation prediction for both native and non-native Arabic speakers is provided in the paper.
- Grid search was utilized for the proposed framework to identify the optimum model hyperparameters.
- Empirical analysis was conducted to investigate the impact of speech features on the mispronunciation recognition system.
2. Literature Review
3. Methodology
3.1. Speech Corpus
3.2. Speech Preprocessing
3.3. Proposed Framework
3.3.1. Recurrent Neural Network (RNN)
3.3.2. Long Short-Term Memory (LSTM)
3.3.3. Feature Extraction
- Zero-crossing rate (ZCR): These features embody the number of times the sound signal transitions from positive to negative and vice versa, thereby providing insight into the temporal characteristics of the signal. The zero-crossing rate (ZCR) is a crucial indicator of the dominant frequency component of the signal, which serves as a crucial determinant in the appraisal of the signal’s acoustic properties [9,11,16].
- Mel-frequency cepstral coefficients (MFCCs): The Mel-frequency cepstral coefficients (MFCCs) constitute an inherent feature that is widely utilized in the fields of speaker and emotion recognition. This is attributable to the high-level representation of human auditory perception that MFCCs afford, thus rendering them a critical component in the assessment of acoustic properties [17,18,19,20]. The computation of MFCCs necessitates the utilization of a psychoacoustical-motivated filter bank, which is subsequently followed by logarithmic compression and discrete cosine transform (DCT) techniques. The MFCCs are calculated based on the following formula [21,22]:
- ∆2 MFCC: Another important feature in speech processing algorithms is the second-order derivative of Mel-frequency cepstral coefficients (MFCCs), which is calculated through a differential computation.
- Linear prediction cepstral coefficients (LPCCs): The coefficients derived from the impulse response of the linear prediction coefficients (LPCs) are modeled to mimic the vocal tract of the human speech production system, providing a resilient and noise-tolerant speech feature compared to LPCs. LPC analysis evaluates the speech signal by approximating the formants, removing their effects from the speech signal, and estimating the residual concentration and frequency that remain [24,25].
- Gammatone-frequency cepstral coefficients (GFCCs): Derived from gammatone filter banks, they offer a more precise representation of speech features and are less affected by noise and distortion compared to MFCCs. Their integration into speech processing algorithms has resulted in better performance and accuracy in tasks such as speaker recognition, speech emotion recognition, and music information retrieval [26].
- Log-frequency cepstral coefficients (LFCCs): A comparable technique to Mel-frequency cepstral coefficients (MFCCs) is employed; however, this approach leverages a spatially distributed filter bank situated along a linear frequency spectrum [25].
3.3.4. Grid Search
4. Experimental Setup
4.1. Evaluation Metrics
- True Positive is an incorrect positive pronunciation prediction.
- False Positive is a correct positive pronunciation prediction.
- True Negative is an incorrect negative pronunciation prediction.
- False Negative is a correct negative pronunciation prediction.
4.2. Platform
5. Experimental Results
5.1. Ablation Study
5.1.1. Optimum Model Hyperparameters
Number of LSTM Cells in Each Layer
Dropout
Batch Size
Number of Epochs
5.1.2. Optimum Speech Features
5.2. Overall Performance of the System
6. Concluding Remarks
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Calik, S.S.; Kucukmanisa, A.; Kilimci, Z.H. An ensemble-based framework for mispronunciation detection of Arabic phonemes. arXiv 2023, arXiv:2301.01378. [Google Scholar]
- Fu, P.; Liu, D.; Yang, H. LAS-Transformer: An Enhanced Transformer Based on the Local Attention Mechanism for Speech Recognition. Information 2022, 13, 250. [Google Scholar] [CrossRef]
- Ye, W.; Mao, S.; Soong, F.; Wu, W.; Xia, Y.; Tien, J.; Wu, Z. An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (Apl) Embeddings. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 6827–6831. [Google Scholar] [CrossRef]
- Li, K.; Qian, X.; Meng, H. Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 193–207. [Google Scholar] [CrossRef] [Green Version]
- Shahin, M.; Ahmed, B. Anomaly detection based pronunciation verification approach using speech attribute features. Speech Commun. 2019, 111, 29–43. [Google Scholar] [CrossRef]
- Arafa, M.N.M.; Elbarougy, R.; Ewees, A.A.; Behery, G.M. A Dataset for Speech Recognition to Support Arabic Phoneme Pronunciation. Int. J. Image Graph. Signal Process. 2018, 10, 31–38. [Google Scholar] [CrossRef] [Green Version]
- Shareef, S.; Al-Irhayim, Y. Comparison between Features Extraction Techniques for Impairments Arabic Speech. Al-Rafidain Eng. J. 2022, 27, 190–197. [Google Scholar] [CrossRef]
- Keerio, A.; Mitra, B.K.; Birch, P.; Young, R.; Chatwin, C. On preprocessing of speech signals. World Acad. Sci. Eng. Technol. 2009, 35, 818–824. [Google Scholar] [CrossRef]
- Ibrahim, Y.A.; Odiketa, J.C.; Ibiyemi, T.S. Preprocessing technique in automatic speech recognition for human computer interaction: An overview. Ann. Comput. Sci. Ser. 2017, 15, 186–191. [Google Scholar]
- Kaur, M.; Mohta, A. A Review of Deep Learning with Recurrent Neural Network. In Proceedings of the 2019 International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 27–29 November 2019; pp. 460–465. [Google Scholar] [CrossRef]
- Hassan, A.; Shahin, I.; Alsabek, M.B. COVID-19 Detection System using Recurrent Neural Networks. In Proceedings of the 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Sharjah, United Arab Emirates, 3–5 November 2020. [Google Scholar] [CrossRef]
- Nassif, A.B.; Shahin, I.; Attili, I.; Azzeh, M.; Shaalan, K. Speech Recognition Using Deep Neural Networks: A Systematic Review. IEEE Access 2019, 7, 19143–19165. [Google Scholar] [CrossRef]
- Shewalkar, A.; Nyavanandi, D.; Ludwig, S.A. Performance Evaluation of Deep neural networks Applied to Speech Recognition: Rnn, LSTM and GRU. J. Artif. Intell. Soft Comput. Res. 2019, 9, 235–245. [Google Scholar] [CrossRef] [Green Version]
- Amberkar, A.; Awasarmol, P.; Deshmukh, G.; Dave, P. Speech Recognition using Recurrent Neural Networks. In Proceedings of the 2018 International Conference on Current Trends towards Converging Technologies (ICCTCT), Coimbatore, India, 1–3 March 2018; pp. 1–4. [Google Scholar] [CrossRef]
- Geiger, J.T.; Zhang, Z.; Weninger, F.; Schuller, B.; Rigoll, G. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling. In Proceedings of the Annual Conference on the International Speech Communication Association (Interspeech 2014), Singapore, 14–18 September 2014; pp. 631–635. [Google Scholar]
- Kos, M.; Kačič, Z.; Vlaj, D. Acoustic classification and segmentation using modified spectral roll-off and variance-based features. Digit. Signal Process. 2013, 23, 659–674. [Google Scholar] [CrossRef]
- Shahin, I.; Nassif, A.B.; Bahutair, M. Emirati-accented speaker identification in each of neutral and shouted talking environments. Int. J. Speech Technol. 2018, 21, 265–278. [Google Scholar] [CrossRef] [Green Version]
- Shahin, I. Novel third-order hidden Markov models for speaker identification in shouted talking environments. Eng. Appl. Artif. Intell. 2014, 35, 316–323. [Google Scholar] [CrossRef]
- Shahin, I. Using emotions to identify speakers. In Proceedings of the 5th International Workshop on Signal Processing and Its Applications (WoSPA 2008), Sharjah, United Arab Emirates, 18–20 March 2008. [Google Scholar]
- Shahin, I. Identifying Speakers Using Their Emotion Cues. Int. J. Speech Technol. 2011, 14, 89–98. [Google Scholar] [CrossRef] [Green Version]
- Shahin, I.; Nassif, A.B.; Hamsa, S. Novel cascaded Gaussian mixture model-deep neural network classifier for speaker identification in emotional talking environments. Neural Comput. Appl. 2020, 32, 2575–2587. [Google Scholar] [CrossRef] [Green Version]
- Alsabek, M.B.; Shahin, I.; Hassan, A. Studying the Similarity of COVID-19 Sounds based on Correlation Analysis of MFCC. In Proceedings of the 2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Sharjah, United Arab Emirates, 3–5 November 2020. [Google Scholar] [CrossRef]
- Ranjan, R.; Thakur, A. Analysis of feature extraction techniques for speech recognition system. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 197–200. [Google Scholar]
- Kinnunen, T.; Li, H. An overview of text-independent speaker recognition: From features to supervectors. Speech Commun. 2010, 52, 12–40. [Google Scholar] [CrossRef] [Green Version]
- Atrey, P.K.; Maddage, N.C.; Kankanhalli, M.S. Audio based event detection for multimedia surveillance. In Proceedings of the 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, France, 14–19 May 2006; Volume 5, pp. 813–816. [Google Scholar] [CrossRef] [Green Version]
- Ayoub, B.; Jamal, K.; Arsalane, Z. Gammatone frequency cepstral coefficients for speaker identification over VoIP networks. In Proceedings of the 2016 International Conference on Information Technology for Organizations Development (IT4OD), Fez, Morocco, 30 March–1 April 2016. [Google Scholar] [CrossRef]
- Liashchynskyi, P.; Liashchynskyi, P. Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS. arXiv 2019, arXiv:1912.06059. [Google Scholar]
- Sokolova, M.; Japkowicz, N.; Szpakowicz, S. Beyond accuracy, F-score and ROC: A family of discriminant measures for performance evaluation. In Proceedings of the 19th Australian Joint Conference on Artificial Intelligence, Hobart, Australia, 4–8 December 2006; WS-06-06. pp. 24–29. [Google Scholar] [CrossRef] [Green Version]
- Bahador, M.; Ahmed, W. The Accuracy of the LSTM Model for Predicting the S&P 500 Index and the Difference between Prediction and Backtesting. Bachelor’s Thesis, KTH Royal Institute of Technology, Stockholm, Sweden, 2018; p. 37. [Google Scholar]
- Azzouni, A.; Pujolle, G. A long short-term memory recurrent neural network framework for network traffic matrix prediction. arXiv 2017, arXiv:1705.05690. [Google Scholar]
No. | Arabic Letter | Phonetic Symbol |
---|---|---|
1 | س | /s/ |
2 | ر | /r/ |
3 | ق | /q/ |
4 | ج | /ʒ/ |
5 | ك | /k/ |
6 | خ | /x/ |
7 | غ | /ɣ/ |
8 | ض | /d/ |
9 | ح | /ḥ/ |
10 | ص | /Ṣ/ |
11 | ط | /ŧ/ |
12 | ظ | /∂/ |
13 | ذ | /ð/ |
Work | Classification Algorithm | Data Utilized | Performance Metrics | Results |
---|---|---|---|---|
Ye et al. [3] | Acoustic, Phonetic, and Linguistic Data Embedding | L2-ARCTIC database | Detection Accuracy, Diagnosis Error Rate, F-Measure | Accuracy: 9.93% DER: 10.13% F-measure: 6.17% |
Li et al. [4] | Acoustic-Graphemic Phonemic Model (AGPM) Using Multi-Distribution Deep Neural Networks (MD-DNNs) | Not specified | Phone Error Rate (PER), False Rejection Rate (FRR), False Acceptance Rate (FAR), Diagnostic Error Rate (DER) | PER: 11.1%, FRR: 4.6%, FAR: 30.5%, DER: 13.5% |
Shahin and Ahmed [5] | One-Class SVM, DNN Speech Attribute Detectors | WSJ0 and TIMIT standard datasets | False-Acceptance Rate, False-Rejection Rate | Lowered FAR and FRR by 26% and 39% compared to the GOP technique |
Arafa et al. [6] | Random Forest (RF) | 89 students’ Arabic phoneme utterances | Accuracy | 85.02% |
Shareef and Al-Irhayim [7] | LSTM and CNN-LSTM | Not specified | Classification Accuracy | LSTM: 93%, CNN-LSTM: 91% |
Letters | Word Initial | Word Medial | Word Final |
---|---|---|---|
/ʒ/ج | /ʒbl/ جبل | /nʒm/ نجم | /zwʒ/ زوج |
/ḥ/ح | /ḥbl/ حبل | /lḥm/ لحم | /mlḥ/ ملح |
/x/خ | /xjmh/ خيمة | /nxl/ نخل | /jdwx/ يدوخ |
/ð/ذ | /ðʡb/ ذئب | /ʡðhb/ اذهب | /mnð/ منذ |
/r/ر | /rml/ رمل | /brmjl/ برميل | /ʡmr/ أمر |
/s/س | /sjf/ سيف | /nsf/ نسف | /jlbs/ يلبس |
/Ṣ/ص | /Ṣjf/ صيف | /bṢl/ بصل | /lṢ/ لص |
/d/ض | /db/ ضب | /mdʡ/ مضى | /nbd/ نبض |
/ŧ/ط | /ŧjb/ طيب | /mnŧʡd/ منطاد | /hbwŧ/ هبوط |
/∂/ظ | /∂l/ ظل | /m∂lh/ مظلة | /mlfw∂/ ملفوظ |
/ɣ/غ | /ɣʡbh/ غابة | /bbɣʡʡ/ ببغاء | /blɣ/ بلغ |
/q/ق | /qlm/ قلم | /lqmh/ لقمة | /jtfq/ يتفق |
/k/ك | /khf/ كهف | /mkʡn/ مكان | /djk/ ديك |
Parameters | Learning Rate | Epoch | LSTM Layers | LSTM Units | Fully Connected Layers | Fully Connected Units | Dropout |
---|---|---|---|---|---|---|---|
Value | 190 | 2 | 179,179 | 2 | 64, 1 | 0.1 |
Sample | Precision | Recall | F1 Score | Accuracy |
---|---|---|---|---|
Letter ر /r/ | 0.8255 | 0.7759 | 0.7964 | 0.7567 |
Letter ظ /∂/ | 0.8164 | 0.7880 | 0.8019 | 0.7717 |
Letter ض /d/ | 0.8161 | 0.7873 | 0.8014 | 0.7709 |
Letter ذ /ð/ | 0.8110 | 0.7834 | 0.7970 | 0.7633 |
Letter غ /ɣ/ | 0.8162 | 0.7808 | 0.7981 | 0.7652 |
Letter ك /k/ | 0.8165 | 0.7874 | 0.8017 | 0.7714 |
Letter ق /q/ | 0.8297 | 0.8090 | 0.8192 | 0.8114 |
Letter س/s/ | 0.8127 | 0.7877 | 0.8000 | 0.7864 |
Letter ص /Ṣ/ | 0.8188 | 0.8125 | 0.8156 | 0.8130 |
Letter خ/x/ | 0.8188 | 0.8089 | 0.8138 | 0.8098 |
Letter ج /ʒ/ | 0.8233 | 0.8226 | 0.8235 | 0.8225 |
Letter ط /ŧ/ | 0.8235 | 0.8097 | 0.8165 | 0.8109 |
Letter ح /ḥ/ | 0.8240 | 0.8222 | 0.8231 | 0.8222 |
Sample | Precision | Recall | F1 Score | Accuracy |
---|---|---|---|---|
Letter ر /r/ | 0.8369 | 0.7582 | 0.7756 | 0.7549 |
Letter ظ /∂/ | 0.8458 | 0.8050 | 0.8249 | 0.8027 |
Letter ض /d/ | 0.8459 | 0.8023 | 0.8235 | 0.8004 |
Letter ذ /ð/ | 0.8100 | 0.7679 | 0.7884 | 0.7876 |
Letter غ /ɣ/ | 0.8362 | 0.8245 | 0.8303 | 0.8258 |
Letter ك /k/ | 0.8255 | 0.8239 | 0.8247 | 0.8240 |
Letter ق /q/ | 0.8243 | 0.8172 | 0.8207 | 0.8176 |
Letter س/s/ | 0.8232 | 0.8189 | 0.8210 | 0.8181 |
Letter ص /Ṣ/ | 0.8568 | 0.8067 | 0.8310 | 0.8351 |
Letter خ/x/ | 0.8561 | 0.8193 | 0.8373 | 0.8460 |
Letter ج /ʒ/ | 0.8568 | 0.8574 | 0.8571 | 0.8539 |
Letter ط /ŧ/ | 0.8442 | 0.8404 | 0.8423 | 0.8410 |
Letter ح /ḥ/ | 0.8429 | 0.8431 | 0.8430 | 0.8422 |
Sample | Precision | Recall | F1 Score | Accuracy |
---|---|---|---|---|
Letter ر /r/ | 0.9061 | 0.8699 | 0.8876 | 0.8653 |
Letter ظ /∂/ | 0.9131 | 0.9157 | 0.9144 | 0.9102 |
Letter ض /d/ | 0.9137 | 0.9115 | 0.9126 | 0.9071 |
Letter ذ /ð/ | 0.9159 | 0.9140 | 0.9149 | 0.9112 |
Letter غ /ɣ/ | 0.9188 | 0.9141 | 0.9164 | 0.9137 |
Letter ك /k/ | 0.9200 | 0.9151 | 0.9175 | 0.9157 |
Letter ق /q/ | 0.9191 | 0.8978 | 0.9083 | 0.8996 |
Letter س/s/ | 0.9186 | 0.8841 | 0.9010 | 0.8873 |
Letter ص /Ṣ/ | 0.9196 | 0.9179 | 0.9187 | 0.9179 |
Letter خ/x/ | 0.9178 | 0.9137 | 0.9157 | 0.9125 |
Letter ج /ʒ/ | 0.9200 | 0.9182 | 0.9191 | 0.9184 |
Letter ط /ŧ/ | 0.9199 | 0.9056 | 0.9127 | 0.9075 |
Letter ح /ḥ/ | 0.9188 | 0.9178 | 0.9183 | 0.9171 |
Sample | Precision | Recall | F1 Score | Accuracy |
---|---|---|---|---|
Letter ر /r/ | 0.7113 | 0.6926 | 0.7018 | 0.7105 |
Letter ظ /∂/ | 0.7756 | 0.7704 | 0.7730 | 0.7733 |
Letter ض /d/ | 0.7756 | 0.7708 | 0.7732 | 0.7637 |
Letter ذ /ð/ | 0.7744 | 0.7607 | 0.7725 | 0.7727 |
Letter غ /ɣ/ | 0.7743 | 0.7669 | 0.7606 | 0.7778 |
Letter ك /k/ | 0.7772 | 0.7266 | 0.7469 | 0.7765 |
Letter ق /q/ | 0.7471 | 0.7386 | 0.7428 | 0.7394 |
Letter س/s/ | 0.8251 | 0.8110 | 0.8180 | 0.8104 |
Letter ص /Ṣ/ | 0.8280 | 0.8246 | 0.8263 | 0.8234 |
Letter خ/x/ | 0.8288 | 0.8189 | 0.8238 | 0.8148 |
Letter ج /ʒ/ | 0.7607 | 0.7578 | 0.7592 | 0.7582 |
Letter ط /ŧ/ | 0.8029 | 0.7975 | 0.7902 | 0.8180 |
Letter ح /ḥ/ | 0.7747 | 0.7723 | 0.7735 | 0.7726 |
Techniques | Average Accuracy (%) |
---|---|
Creating a model for genders | 81.52 |
Without creating a model for genders | 83.77 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ahmed, A.; Bader, M.; Shahin, I.; Nassif, A.B.; Werghi, N.; Basel, M. Arabic Mispronunciation Recognition System Using LSTM Network. Information 2023, 14, 413. https://doi.org/10.3390/info14070413
Ahmed A, Bader M, Shahin I, Nassif AB, Werghi N, Basel M. Arabic Mispronunciation Recognition System Using LSTM Network. Information. 2023; 14(7):413. https://doi.org/10.3390/info14070413
Chicago/Turabian StyleAhmed, Abdelfatah, Mohamed Bader, Ismail Shahin, Ali Bou Nassif, Naoufel Werghi, and Mohammad Basel. 2023. "Arabic Mispronunciation Recognition System Using LSTM Network" Information 14, no. 7: 413. https://doi.org/10.3390/info14070413
APA StyleAhmed, A., Bader, M., Shahin, I., Nassif, A. B., Werghi, N., & Basel, M. (2023). Arabic Mispronunciation Recognition System Using LSTM Network. Information, 14(7), 413. https://doi.org/10.3390/info14070413