Cross-Corpus Training Strategy for Speech Emotion Recognition Using Self-Supervised Representations
Round 1
Reviewer 1 Report
The paper deals with many aspects related to cross-corpus deep machine learning techniques for solving problems related to automatic human emotion recognition based on acoustic features of the speaker's speech. Many corpora (EmoDB, RAVDESS, CREMA-D and IEMOCAP) were chosen for the experiments. The topic is undoubtedly relevant. On the basis of comparative analyses, it can be argued that qualitative indicators still depend on the input data, which in turn depend on the individual characteristics of the person who reacts emotionally to some life process or situation. However, in my humble opinion, there are shortcomings that need to be corrected.
Shortcomings:
1) First of all, it is striking that the first part of the article suggests that human speech is one of the simplest and most effective means of communication. However, it is well known that speech is not always effectively analysed and recognised (even with noise reduction) when there is a lot of background noise. This problem can be partially solved by processing different physiological signals, video or audiovisual analysis. Even if we take, for example, one of the RAVDESS corpus used by the authors, it is audiovisual. Therefore, we need to address this gap and a description should be added, e.g. Top Best Results of not only audio modality but also video with different evaluation metrics (https://paperswithcode.com/sota/facial-expression-recognition-on-ravdess) and audiovisual (https://paperswithcode.com/sota/facial-emotion-recognition-on-ravdess) on the available RAVDESS corpus. In addition, adding such a description would expand the references to previous papers from the global research community (2021-22) that are consistently presented at audio-oriented conferences or related to audio through multimodal data processing (CVPR, INTERSPEECH, ICASSP, and others) or journals in the first quartile.
2) Why were these four corpora used and not the others? It is useful to mention the other corpora before choosing these corpora and to say why they were not suitable for this study.
3) The experiments section should be expanded to include more details. Were any data augmentation techniques used?
4) Were any learning rates and variations (cosine annealing, etc.) used? If not, why not?
5) The style of the article needs minor revision for spelling and punctuation errors, but overall the article is easy to read.
It seems to me that all the suggested additions will only improve this article. The article still needs work and expansion.
Moderate editing of English language required.
Author Response
Hello,
Thanks for your feedback. The response to your review is in the attached pdf file
Author Response File: Author Response.pdf
Reviewer 2 Report
The aim of the manuscript is to present a new strategies for training algorithms for Speech Emotion Recognition. The authors have decided to apply cross-corpus training strategy on existing datasets from literature. The manuscript is well presented and clear to undestand; minor changes are required:
- The introduction should be supported with more references, to support your hypothesis;
- The Section "Previous Work" should be enhanced with some accuracy values that can be used as indicator to be compared with the result of your work; how do you quantitatively define a positive and a negative outcomes (row 67-68)?
- Table 1 must be described more in detail, also including the measurement unit where necessary.
- Section 5 "Experiments", could be tricky for the reader. I recommend dividing the section in 3 subsections describing clearly the differences among the three experiments.
Minor typos are present in the manuscript.
Author Response
Hello,
Thanks for your feedback. The response to your review is in the attached pdf file
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors provided comprehensive answers to the questions and expanded the article. The content is valuable to the scientific community and therefore worthy of publication. Recommended!
Minor editing of English language required.