Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders
Abstract
:1. Introduction
- Improvement in terms of the model complexity
- Low-level (paralinguistic) feature representation
- Improvement in terms of model generalization
- Management of speeches of varying lengths
- A novel SER methodology that outperforms baseline models in terms of accuracy
2. Literature Overview
3. Proposed Model
3.1. CNN-Based Feature Encoders
3.1.1. Parallel CNNs Encoder of MFCC
- ⮚
- Deep CNN (adding more layers)
- ⮚
- Greater stride or average pooling
- ⮚
- Applying dilated convolutions
- ⮚
- Performing depth-wise convolutions
3.1.2. Paralinguistic Feature Encoder for Waveform (PFE)
3.1.3. Fully Convolutional Network Encoder of Spectrogram
3.2. Attention Mechanisms
3.2.1. Attention Mechanism “1”
3.2.2. Attention Mechanism “2”
4. Empirical Veracity and Discussion
4.1. Datasets
4.1.1. IEMOCAP: Interactive Emotional Dyadic Motion Capture Database
4.1.2. EMO-DB: Berlin Emotional Database
4.2. Software and Hardware Configuration
4.3. Model Performance and Its Comparisons
- BLSTM-FCN two-layer attention [38]: Attention-aware BLSTM-RNN and FCN networks to learn spatial and temporal representations to predict emotions.
- Deep CNN 2D [30]: A deep CNN model to derive discriminative features that uses rectangular kernels of varying shapes and sizes, along with max pooling in rectangular neighborhoods.
- ATFNN [49]: Attentive Time-Frequency Neural Network (ATFNN) to learn the discriminative speech emotion feature for SER.
- SFE [50]: the model aims to design and implement a novel feature extraction method that can extract features to recognize different emotions.
5. Conclusions and Future Scope
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Zhang, Y.; Du, J.; Wang, Z.; Zhang, J.; Tu, Y. Attention Based Fully Convolutional Network for Speech Emotion Recognition. In Proceedings of the 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Honolulu, HI, USA, 12–15 November 2018; pp. 1771–1775. [Google Scholar] [CrossRef] [Green Version]
- Zhang, S.; Zhang, S.; Huang, T.; Gao, W. Speech Emotion Recognition Using Deep Convolutional Neural Network and Discriminant Temporal Pyramid Matching. IEEE Trans. Multimed. 2018, 20, 1576–1590. [Google Scholar] [CrossRef]
- Liu, Z.-T.; Wu, M.; Cao, W.-H.; Mao, J.-W.; Xu, J.-P.; Tan, G.-Z. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 2018, 273, 271–280. [Google Scholar] [CrossRef]
- Schuller, B.; Rigoll, G.; Lang, M. Hidden Markov Model based speech emotion recognition. In Proceedings of the International Conference on Multimedia & Expo, Baltimore, MD, USA, 6–9 July 2003. [Google Scholar] [CrossRef]
- New, T.L.; Foo, S.W.; Silva, L.C.D. Classification of stress in speech using linear and nonlinear features. In Proceedings of the 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 Proceedings (ICASSP ’03), Hong Kong, China, 6–10 April 2003; Volume 2, p. II-9. [Google Scholar]
- Koolagudi, S.G.; Murthy, Y.V.S.; Bhaskar, S.P. Choice of a classifier, based on properties of a dataset: Case study-speech emotion recognition. Int. J. Speech Technol. 2018, 21, 167–183. [Google Scholar] [CrossRef]
- Henríquez, P.; Alonso, J.B.; Ferrer, M.A.; Travieso, C.M.; Orozco-Arroyave, J.R. Nonlinear dynamics characterization of emotional speech. Neurocomputing 2014, 132, 126–135. [Google Scholar] [CrossRef]
- Milton, A.; Roy, S.S.; Selvi, S.T. SVM scheme for speech emotion recognition using mfcc feature. Int. J. Comput. Appl. 2013, 69, 34–39. [Google Scholar] [CrossRef]
- Wani, T.M.; Gunawan, T.S.; Qadri, S.A.A.; Kartiwi, M.; Ambikairajah, E. A Comprehensive Review of Speech Emotion Recognition Systems. IEEE Access 2021, 9, 47795–47814. [Google Scholar] [CrossRef]
- An, X.D.; Ruan, Z. Speech Emotion Recognition algorithm based on deep learning algorithm fusion of temporal and spatial features. J. Phys. Conf. Ser. 2021, 1861, 012064. [Google Scholar] [CrossRef]
- Zhang, Z.; Wu, B.; Schuller, B. Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6705–6709. [Google Scholar] [CrossRef]
- Zou, H.; Si, Y.; Chen, C.; Rajan, D.; Chng, E.S. Speech Emotion Recognition with Co-Attention based Multi-level Acoustic Information. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022. [Google Scholar]
- Zhang, H.; Gou, R.; Shang, J.; Shen, F.; Wu, Y.; Dai, G. Pre-trained Deep Convolution Neural Network Model with Attention for Speech Emotion Recognition. Front. Physiol. 2021, 12, 643202. [Google Scholar] [CrossRef] [PubMed]
- Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 5200–5204. [Google Scholar]
- Khorram, S.; Aldeneh, Z.; Dimitriadis, D.; McInnis, M.; Provost, E.M. Capturing Long-term Temporal Dependencies with Convolutional Networks for Continuous Emotion Recognition. arXiv 2017, arXiv:1708.07050. [Google Scholar]
- Cummins, N.; Amiriparian, S.; Hagerer, G.; Batliner, A.; Steidl, S.; Schuller, B.W. An Image-based Deep Spectrum Feature Representation for the Recognition of Emotional Speech. In Proceedings of the 25th ACM International Conference on Multimedia, Mountain View, CA, USA, 23–27 October 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 478–484. [Google Scholar]
- Lech, M.; Stolar, M.; Best, C.; Bolia, R. Real-Time Speech Emotion Recognition Using a Pre-trained Image Classification Network: Effects of Bandwidth Reduction and Companding. Front. Comput. Sci. 2020, 2, 14. [Google Scholar] [CrossRef]
- Zhao, J.; Mao, X.; Chen, L. Speech emotion recognition using deep 1D & 2D CNN LSTM networks. Biomed. Signal Process. Control 2019, 47, 312–323. [Google Scholar] [CrossRef]
- Li, J.; Zhang, X.; Huang, L.; Li, F.; Duan, S.; Sun, Y. Speech Emotion Recognition Using a Dual-Channel Complementary Spectrogram and the CNN-SSAE Neutral Network. Appl. Sci. 2022, 12, 9518. [Google Scholar] [CrossRef]
- Tripathi, S.; Kumar, A.; Ramesh, A.; Singh, C.; Yenigalla, P. Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions. arXiv 2019, arXiv:1906.05681. [Google Scholar] [CrossRef]
- Atmaja, B.T.; Sasou, A. Sentiment Analysis and Emotion Recognition from Speech Using Universal Speech Representations. Sensors 2022, 22, 6369. [Google Scholar] [CrossRef] [PubMed]
- Kerkeni, L.; Serrestou, Y.; Raoof, K.; Mbarki, M.; Mahjoub, M.A.; Cleder, C. Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO. Speech Commun. 2019, 114, 22–35. [Google Scholar] [CrossRef]
- Gong, Y.; Chung, Y.-A.; Glass, J. AST: Audio Spectrogram Transformer. arXiv 2021, arXiv:2104.01778. [Google Scholar]
- Guo, S.; Feng, L.; Feng, Z.B.; Li, Y.H.; Wang, Y.; Liu, S.L.; Qiao, H. Multi-view laplacian least squares for human emotion recognition. Neurocomputing 2019, 370, 78–87. [Google Scholar] [CrossRef]
- Kutlimuratov, A.; Abdusalomov, A.; Whangbo, T.K. Evolving Hierarchical and Tag Information Via the Deeply Enhanced Weighted Non-Negative Matrix Factorization of Rating Predictions. Symmetry 2020, 12, 1930. [Google Scholar] [CrossRef]
- Fahad, M.S.; Deepak, A.; Pradhan, G.; Yadav, J. DNN-HMM-Based Speaker-Adaptive Emotion Recognition Using MFCC and Epoch-Based Features. Circuits Syst. Signal Process. 2021, 40, 466–489. [Google Scholar] [CrossRef]
- Mustaqeem; Kwon, S. CLSTM: Deep Feature-Based Speech Emotion Recognition Using the Hierarchical ConvLSTM Network. Mathematics 2020, 8, 2133. [Google Scholar] [CrossRef]
- Vryzas, N.; Vrysis, L.; Matsiola, M.; Kotsakis, R.; Dimoulas, C.; Kalliris, G. Continuous Speech Emotion Recognition with Convolutional Neural Networks. J. Audio Eng. Soc. 2020, 68, 14–24. [Google Scholar] [CrossRef]
- Shrestha, L.; Dubey, S.; Olimov, F.; Rafique, M.A.; Jeon, M. 3D Convolutional with Attention for Action Recognition. arXiv 2022, arXiv:2206.02203. [Google Scholar] [CrossRef]
- Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmad, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speech emotion recognition for smart affective services. Multimed. Tools Appl. 2019, 78, 5571–5589. [Google Scholar] [CrossRef]
- Zhu, L.; Chen, L.; Zhao, D.; Zhou, J.; Zhang, W. Emotion Recognition from Chinese Speech for Smart Affective Services Using a Combination of SVM and DBN. Sensors 2017, 17, 1694. [Google Scholar] [CrossRef] [Green Version]
- Liu, B.; Qin, H.; Gong, Y.; Ge, W.; Xia, M.; Shi, L. EERA-ASR: An Energy-Efficient Reconfigurable Architecture for Automatic Speech Recognition with Hybrid DNN and Approximate Computing. IEEE Access 2018, 6, 52227–52237. [Google Scholar] [CrossRef]
- Mustaqeem; Kwon, S. Att-Net: Enhanced emotion recognition system using lightweight self-attention module. Appl. Soft Comput. 2021, 102, 107101. [Google Scholar] [CrossRef]
- Alex, S.B.; Mary, L.; Babu, B.P. Attention and Feature Selection for Automatic Speech Emotion Recognition Using Utterance and Syllable-Level Prosodic Features. Circuits Syst. Signal Process. 2020, 39, 5681–5709. [Google Scholar] [CrossRef]
- Chorowski, J.K.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-based models for speech recognition. In Advances in Neural Information Processing Systems, Proceedings of the Annual Conference on Neural Information Processing Systems 2015, Montreal, QC, Canada, 7–12 December 2015; MIT Press: Cambridge, MA, USA, 2015; pp. 577–585. [Google Scholar]
- Abdusalomov, A.; Baratov, N.; Kutlimuratov, A.; Whangbo, T.K. An Improvement of the Fire Detection and Classification Method Using YOLOv3 for Surveillance Systems. Sensors 2021, 21, 6519. [Google Scholar] [CrossRef]
- Li, P.; Song, Y.; McLoughlin, I.; Guo, W.; Dai, L. An attention pooling based representation learning method for speech emotion recognition. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018; pp. 3087–3091. [Google Scholar] [CrossRef]
- Zhao, Z.; Bao, Z.; Zhao, Y.; Zhang, Z.; Cummins, N.; Ren, Z.; Schuller, B. Exploring Deep Spectrum Representations via Attention-Based Recurrent and Convolutional Neural Networks for Speech Emotion Recognition. IEEE Access 2019, 7, 97515–97525. [Google Scholar] [CrossRef]
- Araújo, A.F.; Norris, W.; Sim, J. Computing Receptive Fields of Convolutional Neural Networks. Distill 2019, 4, e21. [Google Scholar] [CrossRef]
- Wang, C.; Sun, H.; Zhao, R.; Cao, X. Research on Bearing Fault Diagnosis Method Based on an Adaptive Anti-Noise Network under Long Time Series. Sensors 2020, 20, 7031. [Google Scholar] [CrossRef] [PubMed]
- Hsu, S.-M.; Chen, S.-H.; Huang, T.-R. Personal Resilience Can Be Well Estimated from Heart Rate Variability and Paralinguistic Features during Human–Robot Conversations. Sensors 2021, 21, 5844. [Google Scholar] [CrossRef]
- Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2227–2231. [Google Scholar]
- Aggarwal, A.; Srivastava, A.; Agarwal, A.; Chahal, N.; Singh, D.; Alnuaim, A.A.; Alhadlaq, A.; Lee, H.-N. Two-Way Feature Extraction for Speech Em otion Recognition Using Deep Learning. Sensors 2022, 22, 2378. [Google Scholar] [CrossRef]
- Mocanu, B.; Tapu, R.; Zaharia, T. Utterance Level Feature Aggregation with Deep Metric Learning for Speech Emotion Recognition. Sensors 2021, 21, 4233. [Google Scholar] [CrossRef] [PubMed]
- Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings of the Interspeech 2017, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, NV, USA, 3–6 December 2012; MIT Press: Cambridge, MA, USA, 2012; pp. 1097–1105. [Google Scholar]
- Busso, C.; Bulut, M.; Lee, C.-C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Evaluation 2008, 42, 335–359. [Google Scholar] [CrossRef]
- Burkhardt, F.; Paeschke, A.; Rolfes, A.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. In Proceedings of the 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, 4–8 September 2005. [Google Scholar]
- Lu, C.; Zheng, W.; Lian, H.; Zong, Y.; Tang, C.; Li, S.; Zhao, Y. Speech Emotion Recognition via an Attentive Time-Frequency Neural Network. arXiv 2022, arXiv:2210.12430. [Google Scholar] [CrossRef]
- Abdulmohsin, H.A.; Wahab, H.B.A.; Hossen, A.M.J.A. A new proposed statistical feature extraction method in speech emotion recognition. Comput. Electr. Eng. 2021, 93, 107172. [Google Scholar] [CrossRef]
- Ilyosov, A.; Kutlimuratov, A.; Whangbo, T.-K. Deep-Sequence–Aware Candidate Generation for e-Learning System. Processes 2021, 9, 1454. [Google Scholar] [CrossRef]
- Kutlimuratov, A.; Abdusalomov, A.B.; Oteniyazov, R.; Mirzakhalilov, S.; Whangbo, T.K. Modeling and applying implicit dormant features for recommendation via clustering and deep factorization. Sensors 2022, 22, 8224. [Google Scholar] [CrossRef]
- Abdusalomov, A.B.; Mukhiddinov, M.; Kutlimuratov, A.; Whangbo, T.K. Improved Real-Time Fire Warning System Based on Advanced Technologies for Visually Impaired People. Sensors 2022, 22, 7305. [Google Scholar] [CrossRef]
Software | Programming tools | Python, Pandas, Keras-TensorFlow |
OS | Windows 10 | |
Hardware | CPU | AMD Ryzen Threadripper 1900X 8-Core Processor 3.80 GHz |
GPU | Titan Xp 32 GB | |
RAM | 128 |
Models | EMO-DB | IEMOCAP | ||
---|---|---|---|---|
WA in % | UA in % | WA in % | UA in % | |
BLSTM-FCN two-layer attention | - | - | 68.1 | 67 |
Deep CNN 2D | 69.2 | 67.9 | 71.2 | 70.6 |
SFE | 71.1 | 68.4 | - | - |
ATFNN | - | - | 72.66 | 64.48 |
Proposed model | 71.8 | 70.9 | 72.4 | 71.1 |
Models | Average Accuracy | |
---|---|---|
EMO-DB | IEMOCAP | |
BLSTM-FCN two-layer attention | 86.54 | - |
Deep CNN 2D | 82.20 | 78.61 |
SFE | 80.32 | - |
ATFNN | 87.5 | 89.1 |
Proposed model | 89.76 | 91.18 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Makhmudov, F.; Kutlimuratov, A.; Akhmedov, F.; Abdallah, M.S.; Cho, Y.-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics 2022, 11, 4047. https://doi.org/10.3390/electronics11234047
Makhmudov F, Kutlimuratov A, Akhmedov F, Abdallah MS, Cho Y-I. Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics. 2022; 11(23):4047. https://doi.org/10.3390/electronics11234047
Chicago/Turabian StyleMakhmudov, Fazliddin, Alpamis Kutlimuratov, Farkhod Akhmedov, Mohamed S. Abdallah, and Young-Im Cho. 2022. "Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders" Electronics 11, no. 23: 4047. https://doi.org/10.3390/electronics11234047
APA StyleMakhmudov, F., Kutlimuratov, A., Akhmedov, F., Abdallah, M. S., & Cho, Y. -I. (2022). Modeling Speech Emotion Recognition via Attention-Oriented Parallel CNN Encoders. Electronics, 11(23), 4047. https://doi.org/10.3390/electronics11234047