Improving Audio Classification Method by Combining Self-Supervision with Knowledge Distillation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presents a novel framework that integrates self-supervision with knowledge distillation to enhance the performance of audio classifications. This paper has a novel contribution to audio signal processing and classifications. Many thanks to the authors for presenting comprehensive results in the paper. However, the authors still need to improve the paper quality. Here are some comments.
(a) The keywords are not alphabetically organized as per the Journal's requirements.
(b) The Introduction section is too long as it includes background information and related works. The authors are requested to break this section into two sections namely (i) Introduction and (ii) Related Works.
(c) The authors have comprehensively compared their results with other related methods in Table 10-12. However, the authors are requested to explain those works in detail in the Related Works section. The performance comparison is based on only one parameter ( as mentioned in the Tables ). Please include more performance parameters in these tables to fairly judge the superiority of the authors' contributions.
(d) The authors can merge section 2 and section 3 as these two sections are related. The authors are requested to provide more details on their model presented in Fig. 1 as this is the main contribution of this work.
(e) The authors are requested to improve the quality of Fig. 1, Fig. 2, and Fig. 6 as the Labels are hardly visible.
(f) Equations (3), (6), (7), (9), and (10) are very difficult to comprehend. The authors shall improve the presentation of these equations as per the Journal's template.
(g) On page 10, second paragraph, the authors considered four scenarios namely train, machine, ship, and pigeon. Are these scenarios arbitrarily chosen? The justification for considering these scenarios is not present in the paper.
(h) This paper contains many results. The Conclusion section is too short compared to this huge work. This section must be more comprehensive correlating to the results.
Comments on the Quality of English Language
The quality of the English language is acceptable. The paper needs a couple of comprehensive edits.
Author Response
Comments 1: The keywords are not alphabetically organized as per the Journal's requirements.
Response 1: Thank you for pointing this out. We agree with this comment. The keywords are arranged in alphabetical order according to the requirements of the journal in this revision.
Comments 2: The Introduction section is too long as it includes background information and related works. The authors are requested to break this section into two sections namely (i) Introduction and (ii) Related Works.
Response 2: Thank you for pointing this out. In response to the issue of the introduction being too long, we have divided the background and related work into two sections.
Comments 3: The authors have comprehensively compared their results with other related methods in Table 10-12. However, the authors are requested to explain those works in detail in the Related Works section. The performance comparison is based on only one parameter ( as mentioned in the Tables ). Please include more performance parameters in these tables to fairly judge the superiority of the authors' contributions.
Response 3: Thank you for your comment. We agree with this viewpoint. Audio classification uses accuracy indicators for comparison, which can reflect the effectiveness of classification and facilitate comparison with other methods. To demonstrate the comprehensiveness of the comparison, we added the modeling parameters of the model and the dataset used for pre-training, and elaborated on the contribution of our method as comprehensively as possible during the method comparison.
Comments 4: The authors can merge section 2 and section 3 as these two sections are related. The authors are requested to provide more details on their model presented in Fig. 1 as this is the main contribution of this work.
Response 4: Thank you for pointing this out. In response to the issue of the introduction being too long, we have divided the background and related work into two sections. And the modified content is as follows. Thank you for pointing out this point. As the second and third sections are relevant, we have merged them. In the detailed description of the method flow in Figure 1, we have added an attachment for a more detailed explanation.
Comments 5: The authors are requested to improve the quality of Fig. 1, Fig. 2, and Fig. 6 as the Labels are hardly visible.
Response 5: Thank you for pointing this out. We agree with this comment. We have improved the quality of Figures 1, 2, and 6 to ensure that the labels are clearly visible.
Comments 6: Equations (3), (6), (7), (9), and (10) are very difficult to comprehend. The authors shall improve the presentation of these equations as per the Journal's template.
Response 6: Thank you for pointing this out. We agree with this comment. We have rearranged equations (3), (6), (7), (9), and (10) based on the journal template.
Comments 7: On page 10, second paragraph, the authors considered four scenarios namely train, machine, ship, and pigeon. Are these scenarios arbitrarily chosen? The justification for considering these scenarios is not present in the paper.
Response 7: Thank you for pointing this out. In order to visually observe the differences of HOG features in two-dimensional spectrograms of different categories, we randomly selected four scenarios: train, machine, ship, and pigeon, as shown in Figure 7, from top to bottom, they represent video images, one-dimensional audio raw data, two-dimensional audio spectrograms, and spectrograms extracted through HOG features.
Comments 8: This paper contains many results. The Conclusion section is too short compared to this huge work. This section must be more comprehensive correlating to the results.
Response 8: Thank you for pointing this out. We agree with this comment. We have made modifications to the conclusion to make it relevant to the conclusion of this article.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThe progress in audio classification is vast and many variants in machine learning can be studied. The approach of combining self-supervision tasks based on mask characteristics is interesting, but the resulting improvements of this 'hard' labor stay somewhat limited. But this can only be concluded by testing it.
In figure 7, the visual scenes are not relevant, and should be replaced by e.g. the time signals of the waveforms. In addition, similar colors in the 4 spectrograms should be used to enhance reader understanding.
Detailed information about the used models are not given. This makes it very difficult to reproduce this work. An appendix should be given where these details are explained (dimensions and details of perceptron layers, learning performance, hardware and training times, ... ) How are the iterations (iterative training multi-ssl and knowledge distillation) conducted?
Line 486 : Please explain better why you state that audio-visual multimodal joint modelling holds promises.
Comments on the Quality of English LanguageThe manuscript is clearly written, although the abstract clearly needs revision.
Author Response
Comments 1: The progress in audio classification is vast and many variants in machine learning can be studied. The approach of combining self-supervision tasks based on mask characteristics is interesting, but the resulting improvements of this 'hard' labor stay somewhat limited. But this can only be concluded by testing it.
Response 1: Thank you for pointing this out. We agree with this comment. In the future optimization, we will leverage the current development of self-supervision to strive for more efficient and concise ideas.
Comments 2: In figure 7, the visual scenes are not relevant, and should be replaced by e.g. the time signals of the waveforms. In addition, similar colors in the 4 spectrograms should be used to enhance reader understanding.
Response 2: Thank you for pointing this out. We agree with this comment. In Figure 7, we have added a display of waveform signals. The reason for retaining image information is to present audio categories more concretely through a combination of audio and visual elements.
Comments 3: Detailed information about the used models are not given. This makes it very difficult to reproduce this work. An appendix should be given where these details are explained (dimensions and details of perceptron layers, learning performance, hardware and training times, ... ) How are the iterations (iterative training multi-ssl and knowledge distillation) conducted?
Response 3: Thank you for your comment. We agree with this viewpoint. In order to provide a better detailed description of the algorithm in this article, we have added an attachment section, which provides a detailed description of the network modeling and size. At the same time, we have summarized the data preprocessing, hardware, and training time used in the experiment. In addition, to better demonstrate the details of multi-dimensional self supervision and knowledge distillation in training iterations, we added pseudocode to describe them.
Comments 4: Line 486 : Please explain better why you state that audio-visual multimodal joint modelling holds promises.
Response 4: Thank you for pointing this out. Based on the weight initialization experiment in Table 7 of this article, it can be inferred to some extent that pre-training with image modal data can also promote the improvement of classification performance. From a correlation perspective, the correlation between imagenet and the current audio dataset is not strong, but pre training based on it can still improve effectiveness. If we combine the video content of the audio itself, we think it may have better results, which is also the direction we want to continue exploring.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsNo further comments. The authors have implemented all the corrections that I suggested in the review report. No further actions are required.