Research on Speech Emotion Recognition Method Based A-CapsNet
Abstract
:1. Introduction
2. Our Database
3. Preprocessing and Feature Extraction
3.1. Preprocessing
3.2. Feature Extraction
4. The Proposed Novel Model A-CapsNet
4.1. Network Structure
4.2. Fundamentals
4.2.1. Forward Propagation
4.2.2. The Dynamic Routing Algorithm
4.2.3. The Margin Loss of the Model A-CapsNet
5. Experimental Results of the Proposed Model A-CapsNet
5.1. Experimental Settings
5.2. Data Division Methods
5.3. Experimental Analysis
- The performance of the model A-CapsNet improves as the SNR increases. Under the EIRD data division technique, the recognition performance with an SNR of −10 and 10 is 57.68 and 99.26, respectively, as illustrated in Figure 9. There is a rise of 40 percentage points, which is a significant improvement, and it suggests that the SNR has a significant impact on the performance of the model.
- Using the same database, the performance of the proposed model under different data division techniques varies, but no data division method outperforms the others, as illustrated in Figure 10. For example, when the SNR is −10, the model A-CapsNet performs best when using the data division technique EIRD, and when the SNR is 10, it performs best when using the data division method EDRD. With the improvement in the signal-to-noise ratio, no matter which data division method is used, the averages of the accuracies are improved, which is a relatively ideal recognition result. When the signal-to-noise ratio (SNR) is 10, there is minimal variation in the experimental performance of the four data division algorithms.
- Figure 11 shows the outcomes of four data division techniques for varying model validation times using the Multi-SNR-5-EMODB database. It can be demonstrated that the suggested model A-CapsNet’s performance was higher than 88.06 percent, which is an important recognition outcome. When compared to other data division techniques, the model A-CapsNet performs better under the data division approach EIRD. At the same time, it is also clear that the model’s performance fluctuates significantly when cross-validated at various intervals, suggesting that data partitioning techniques may have an adverse effect on the model’s performance. Therefore, it is important to assess the resilience and generalization of the model using various data partitioning techniques. However, no consistent conclusion has been obtained regarding which data partitioning method results in the best performance of the model. It can be said that it is necessary to evaluate the performance of the model by integrating the results of the four data partitioning methods.
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Jin, B.; Liu, G. Speech Emotion Recognition Based on Hyper-Prosodic Features. In Proceedings of the 2017 International Conference on Computer Technology, Electronics and Communication (ICCTEC), Dalian, China, 19–21 December 2017; pp. 82–87. [Google Scholar]
- Li, G.; Tie, Y.; Qi, L. Multi-feature speech emotion recognition based on random forest classification and optimization. Microelectron. Comput. 2019, 36, 70–73. [Google Scholar]
- Xu, L.; Liu, Y.; Hu, M.; Wang, X.; Reng, F. Spectrogram improves speech emotion recognition based on completely local binary patterns. J. Electron. Meas. Instrum. 2018, 209, 30–37. [Google Scholar]
- Zhao, X.; Xu, X. Speech emotion recognition combining shallow learning and deep learning models. Comput. Appl. Softw. 2020, 37, 114–118+182. [Google Scholar]
- Cheng, Y.; Chen, Y.; Cheng, Y.; Yang, Y. Speech emotion recognition with embedded attention mechanism combined with hierarchical context. J. Harbin Inst. Technol. 2019, 51, 100–107. [Google Scholar]
- Ramakrishnan, S.; Emary, I. Speech emotion recognition approaches in human computer interaction. Telecommun. Syst. 2013, 52, 1467–1478. [Google Scholar] [CrossRef]
- John, K.; Saurous, R.A. Emotion recognition from human speech using temporal information and deep learning. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 937–940. [Google Scholar]
- Lu, G.; Cheng, X.; Li, X.; Yan, J.; Li, H. Multimodal emotional feature fusion method based on genetic algorithm. J. Nanjing Univ. Posts Telecommun. (Nat. Sci. Ed.) 2019, 184, 44–50. [Google Scholar]
- Ma, J.; Sun, Y.; Zhang, X. Multi-modal emotion recognition based on fusion of speech signal and EEG signal. J. Xidian Univ. 2019, 46, 143–150. [Google Scholar]
- Hu, H.; Xu, M.-X.; Wu, W. GMM supervector based SVM with spectral features for speech emotion recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Honolulu, HI, USA, 15–20 April 2007; pp. 413–416. [Google Scholar]
- Yu, Y.; Huang, F.; Liu, Y. Speech emotion recognition based on feature dimensionality reduction and parameter optimization. J. Yanbian Univ. (Nat. Sci. Ed.) 2020, 46, 49–54. [Google Scholar]
- Mao, X.; Chen, L.; Fu, L. Multi-level speech emotion recognition based on HMM and ANN. In Proceedings of the 2009 WRI World Congress on Computer Science and Information Engineering, Los Angeles, LA, USA, 31 March–2 April 2009; pp. 225–229. [Google Scholar]
- Kansizoglou, I.; Misirlis, E.; Tsintotas, K.; Gasteratos, A. Continuous Emotion Recognition for Long-Term Behavior Modeling through Recurrent Neural Networks. Technologies 2022, 10, 59. [Google Scholar] [CrossRef]
- Song, M.; Chen, C.; You, M. Audio-visual based emotion recognition using tripled hidden Markov model. In Proceedings of the 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Montreal, QC, Canada, 17–21 May 2004; pp. 877–880. [Google Scholar]
- Vydana, H.K.; Kumar, P.P.; Krishna, K.S.R.; Vuppala, A.K. Improved emotion recognition using GMM-UBMs. In Proceedings of the of 2015 IEEE International Conference on Signal Processing and Communication Engineering Systems, Guntur, India, 2–3 January 2015; pp. 53–57. [Google Scholar]
- Chen, X.; Han, W.; Ruan, H.; Liu, J.; Li, H.; Jiang, D. Sequence-to-sequence modelling for categorical speech emotion recognition using recurrent neural network. In Proceedings of the 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), Beijing, China, 20–22 May 2018; pp. 1–4. [Google Scholar]
- Bertero, D.; Fung, P. A first look into a convolutional neural network for speech emotion detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5115–5119. [Google Scholar]
- Khan, N.; Ullah, A.; Haq, I.U.; Menon, V.G.; Baik, S.W. SD-Net: Understanding overcrowded scenes in real-time via an efficient dilated convolutional neural network. J. Real-Time Image Process. 2021, 18, 1729–1743. [Google Scholar] [CrossRef]
- Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. In NeurIPS Proceedings: Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 3856–3866. [Google Scholar]
- Li, R.; Wu, Z.; Jia, J.; Zhao, S.; Meng, H. Dilated residual network with multi-head self-attention for speech emotion recognition. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6675–6679. [Google Scholar]
- Tao, J.H.; Liu, F.Z.; Zhang, M.; Jia, H.B. Design of speech corpus for mandarin text to speech. In Proceedings of the Blizzard Challenge 2008 Workshop, Brisbane, Australia, 21 September 2008; p. 1. [Google Scholar]
- Weninger, F.; Wöllmer, M.; Schuller, B. Emotion Recognition in Naturalistic Speech and Language—A Survey. In Emotion Recognition: A Pattern Analysis Approach; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2015; pp. 237–267. [Google Scholar]
- Kim, Y.; Provost, E.M. ISLA: Temporal segmentation and labeling for audio-visual emotion recognition. IEEE Trans. Affect. Comput. 2019, 10, 196–208. [Google Scholar] [CrossRef]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Zafeiriou, S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016. [Google Scholar]
- Janovi, P.; Zou, X.; Kküer, M. Speech enhancement based on Sparse Code Shrinkage employing multiple speech models. Speech Commun. 2012, 54, 108–118. [Google Scholar] [CrossRef]
- Anagnostopoulos, C.N.; Iliou, T.; Giannoukos, I. Features and classifiers for emotion recognition from speech: A survey from 2000 to 2011. Artif. Intell. Rev. 2015, 43, 155–177. [Google Scholar] [CrossRef]
- Lee, C.-C.; Mower, E.; Busso, C.; Lee, S.; Narayanan, S. Emotion recognition using a hierarchical binary decision tree approach. Speech Commun. 2011, 53, 1162–1171. [Google Scholar] [CrossRef]
- Langari, S.; Marvi, H.; Zahedi, M. Efficient Speech Emotion Recognition Using Modified Feature Extraction. Inform. Med. Unlocked 2020, 20, 100424. [Google Scholar] [CrossRef]
- Qing, G.; Zh, Z.; Da, X.; Zhi, X. Review on speech emotion recognition research. CAAI Trans. Intell. Syst. 2020, 15, 1–13. [Google Scholar]
- Sun, Y.; Song, C. Emotional speech feature extraction and optimization of phase space reconstruction. Xi’an Dianzi Keji Daxue Xuebao J. Xidian Univ. 2017, 44, 162–168. [Google Scholar]
- Peng, S.; Yun, J.; Cheng, Z.; Li, Z. Speech emotion recognition using sparse feature transfer. J. Data Acquisit. Process. 2016, 31, 325–330. [Google Scholar] [CrossRef]
- Gideon, J.; Mclnnis, M.G.; Provost, E.M. Improving cross-corpus speech emotion recognition with adversarial discriminative domain generalization (ADDoG). IEEE Trans. Affect. Comput. 2021, 12, 1055–1068. [Google Scholar] [CrossRef] [Green Version]
- Sarker, M.K.; Alam, K.M.R.; Arifuzzaman, M. Arifuzzaman Emotion recognition from speech based on relevant feature and majority voting. In Proceedings of the 2014 International Conference on Informatics, Electronics & Vision (ICIEV), Dhaka, Bangladesh, 23–24 May 2014; pp. 1–5. [Google Scholar]
- Raju, V.N.G.; Lakshmi, K.P.; Jain, V.M.; Kalidindi, A.; Padma, V. Study the influence of normalization/transformation process on the accuracy of supervised classification. In Proceedings of the 2020 Third International Conference on Smart Systems and Inventive Technology (ICSSIT), Tirunelveli, India, 20–22 August 2020; pp. 729–735. [Google Scholar]
- Wang, L.; Dang, J.; Zhang, L.; Guan, H.; Li, X.; Guo, L. Speech emotion recognition by combining amplitude and phase information using convolutional neural network. In Proceedings of the Interspeech 2018, Hyderabad, India, 2–6 September 2018; pp. 1611–1615. [Google Scholar]
- Xi, E.; Bing, S.; Yang, J. Capsule Network Performance on Complex Data. arXiv 2017, arXiv:1712.03480. [Google Scholar]
- Xiang, C.Q.; Zhang, L.; Tang, Y.; Zou, W.B.; Xu, C. MS-CapsNet: A novel multi-scale capsule network. IEEE Signal Process. Lett. 2018, 25, 1850–1854. [Google Scholar] [CrossRef]
- Wu, X.X.; Liu, S.X.; Cao, Y.W.; Li, X.; Yu, J.W.; Dai, D.Y. Speech emotion recognition using capsule network. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6695–6699. [Google Scholar]
- Nair, P.; Doshi, R.; Keselj, S. Pushing the Limits of Capsule Networks. arXiv 2021, arXiv:2103.08074. [Google Scholar]
- Ertam, F.; Aydın, G. Data classification with deep learning using Tensorflow. In Proceedings of the 2017 International Conference on Computer Science and Engineering (UBMK), Antalya, Turkey, 5–8 October 2017; pp. 755–758. [Google Scholar]
- Jiang, T.; Cheng, J. Target recognition based on CNN with LeakyReLU and PReLU activation functions. In Proceedings of the International Conference on Sensing, Diagnostics, Prognostics, and Control (SDPC), Beijing, China, 15–17 August 2019; pp. 718–722. [Google Scholar]
- Chen, K.; Ding, H.; Huo, Q. Parallelizing Adam optimizer with blockwise model-update filtering. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 3027–3031. [Google Scholar]
- Wen, X.C.; Liu, K.H.; Zhang, W.M.; Jiang, K. The application of capsule neural network based CNN for speech emotion recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 9356–9362. [Google Scholar]
- Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [Google Scholar] [CrossRef]
- Cirakman, O.; Gunsel, B. Online speaker emotion tracking with a dynamic state transition model. In Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico, 4–8 December 2016; pp. 307–312. [Google Scholar]
- Yi, L.; Mak, M.-W. Improving speech emotion recognition with adversarial data augmentation network. IEEE Trans. Neural Netw. Learn. Syst. 2022, 33, 172–1844. [Google Scholar] [CrossRef] [PubMed]
- Sugan, N.; Sai Srinivas, N.S.; Kar, N.; Kumar, L.S.; Nath, M.K.; Kanhe, A. Performance comparison of different cepstral features for speech emotion recognition. In Proceedings of the 2018 International CET Conference on Control, Communication, and Computing (IC4), Thiruvananthapuram, India, 5–7 July 2018; pp. 266–271. [Google Scholar]
Database | Emotion | SNR | Noises from NoiseX-92 | Total Samples |
---|---|---|---|---|
EMODB | Anger/A, boredom/B, fear/F, disgust/D, happiness/H, neutral/N, and sadness/S | −10 dB | babble, white, buccaneer1, buccaneer2, destroyerengine, destroyerops, f16, volvo, factory1, factory2, hfchannel, leopard, ml09, pink, machinegun | 8025 |
−5 dB | 8025 | |||
0 dB | 8025 | |||
5 dB | 8025 | |||
10 dB | 8025 |
Data Division | Folds | SNR = −10 | SNR = −5 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | Avg ± std | 1 | 2 | 3 | 4 | 5 | Avg ± std | ||
EDCV | K_folds = 5 | 56.76 | 56.95 | 56.95 | 56.64 | 57.13 | 56.89 ± 0.03 | 74.14 | 73.02 | 67.66 | 71.84 | 71.84 | 71.70 ± 4.81 |
EICV | K_folds = 5 | 57.26 | 58.69 | 58.13 | 56.88 | 56.01 | 57.39 ± 0.88 | 73.33 | 70.84 | 70.59 | 71.59 | 69.35 | 71.14 ± 1.72 |
EDRD | 5 times | 56.95 | 56.64 | 57.01 | 58.32 | 58.50 | 57.48 ± 0.59 | 73.08 | 73.96 | 71.78 | 73.15 | 73.27 | 73.05 ± 0.50 |
EIRD | 5 times | 58.07 | 57.82 | 55.58 | 59.25 | 57.69 | 57.68 ± 1.41 | 72.15 | 72.83 | 72.40 | 73.64 | 72.06 | 72.62 ± 0.33 |
Data Division | Folds | SNR = 0 | SNR = 5 | ||||||||||
1 | 2 | 3 | 4 | 5 | Avg ± std | 1 | 2 | 3 | 4 | 5 | Avg ± std | ||
EDCV | K_folds = 5 | 86.54 | 89.72 | 88.22 | 89.60 | 87.35 | 88.29 ± 1.54 | 96.07 | 96.76 | 96.32 | 96.51 | 96.20 | 96.37 ± 0.06 |
EICV | K_folds = 5 | 87.60 | 87.73 | 87.29 | 90.03 | 85.86 | 87.70 ± 1.80 | 97.32 | 97.01 | 97.26 | 95.83 | 97.57 | 97.00 ± 0.37 |
EDRD | 5 times | 86.79 | 86.92 | 88.29 | 87.23 | 86.04 | 87.05 ± 0.53 | 96.14 | 96.07 | 94.83 | 95.89 | 95.08 | 95.60 ± 0.29 |
EIRD | 5 times | 87.91 | 87.98 | 88.29 | 88.10 | 88.10 | 88.08 ± 0.02 | 96.14 | 95.76 | 96.88 | 96.45 | 95.95 | 96.24 ± 0.16 |
Data Division | Folds | SNR = 10 | Multi-SNR-5 | ||||||||||
1 | 2 | 3 | 4 | 5 | Avg ± std | 1 | 2 | 3 | 4 | 5 | Avg ± std | ||
EDCV | K_folds = 5 | 99.44 | 98.88 | 99.07 | 99.31 | 99.44 | 99.23 ± 0.05 | 89.32 | 89.89 | 88.90 | 88.87 | 88.05 | 89.01 ± 0.36 |
EICV | K_folds = 5 | 98.69 | 99.38 | 99.31 | 98.82 | 99.50 | 99.14 ± 0.10 | 89.37 | 89.46 | 89.26 | 89.50 | 88.91 | 89.30 ± 0.04 |
EDRD | 5 times | 98.94 | 99.56 | 99.50 | 99.25 | 99.56 | 99.36 ± 0.06 | 90.12 | 89.43 | 89.63 | 89.05 | 89.01 | 89.45 ± 0.17 |
EIRD | 5 times | 98.75 | 99.38 | 99.44 | 99.31 | 99.44 | 99.26 ± 0.07 | 89.05 | 90.36 | 89.46 | 89.53 | 88.85 | 89.45 ± 0.28 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Qi, Y.; Huang, H.; Zhang, H. Research on Speech Emotion Recognition Method Based A-CapsNet. Appl. Sci. 2022, 12, 12983. https://doi.org/10.3390/app122412983
Qi Y, Huang H, Zhang H. Research on Speech Emotion Recognition Method Based A-CapsNet. Applied Sciences. 2022; 12(24):12983. https://doi.org/10.3390/app122412983
Chicago/Turabian StyleQi, Yingmei, Heming Huang, and Huiyun Zhang. 2022. "Research on Speech Emotion Recognition Method Based A-CapsNet" Applied Sciences 12, no. 24: 12983. https://doi.org/10.3390/app122412983
APA StyleQi, Y., Huang, H., & Zhang, H. (2022). Research on Speech Emotion Recognition Method Based A-CapsNet. Applied Sciences, 12(24), 12983. https://doi.org/10.3390/app122412983