LIME-Based Data Selection Method for SAR Images Generation Using GAN
Round 1
Reviewer 1 Report
This paper investigates the use of GAN (generative adversarial networks) to generate synthetic SAR images. IT discusses the Clever Hans phenomenon, where high accuracy can be achieved in deep neural network classification based on unreliable/unstable features. The paper was interesting to read and was generally well-motivated.
There was one statement on lines 206-210 (page 7) whwere it is stated that three classes of images were discarded to avoid the problem of unbalanced sample distribution in the training of the GAN. Definitely more explanation is needed for this statement and it raises the question of whether this discarding of these classes is justified and whether there is an automatic method to determine whether certain classes should be discarded in the proposed training method.
There were also several minor English errors listed below:
- Line 10 - "SAR image owns the most representative" should read "SAR image possessed the most representative"
- Line 34 - "This limitationeen" should read "This limitation can be"
- Line 137 - "according the spurious" should read "according to the spurious.
If the above issues can be addressed then the manuscript can be considered for publication.
Author Response
Dear editor and reviewer,
First of all, we appreciate providing us an opportunity to revise this paper. Thanks for your
precious suggestions for our work that helped us enhance the quality of this manuscript.
Sentences/phrases that have been modified or added are marked in blue color in the revised
paper according to your comments. In the following, we provide a point-to-point response to
reviewers’ comments.
Q1. There was one statement on lines 206-210 (page 7) where it is stated that three classes of images were discarded to avoid the problem of unbalanced sample distribution in the training of the GAN. Definitely more explanation is needed for this statement and it raises the question of whether this discarding of these classes is justified and whether there is an automatic method to determine whether certain classes should be discarded in the proposed training method.
A1. We gratefully appreciate your valuable questions. We respond to your questions as follows:
In comparison to optical images, SAR images have small extra-class differences and low resolution. Therefore, it is difficult to perform more detailed processing. It probably would be better to use the cases in which the CNN showed “Clever Hans” decisions to guide the GAN’s training instead of discarding them entirely. However, this study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. Besides, creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
We also declare this limitation in Section 5 Discussion as follows:
This study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. Therefore, the samples in which the CNN implements "Clever Hans"-type decision strategy are discarded entirely. However, it would probably be better to use these samples to guide GAN’s training. So creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
Q2. Line 10- "SAR image owns the most representative" should read "SAR image possessed the most representative"
A2. "owns" has been corrected to "possessed".
Q3. Line 34 - "This limitationeen" should read "This limitation can be"
A3. This sentence has been rewritten as "This limitation can be …".
Q4. Line 137 - "according the spurious" should read "according to the spurious.
A4. "according" has been corrected to "according to".
We proofread the manuscript carefully again and corrected some other mistakes labeled with blue color.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors present an approach to improve artificial SAR data generation using Local Interpretable Model-agnostic Explanation (LIME) for improved data selection for the training of Generative Adversarial Networks (GAN). The topic is relevant and of increasing interest to the remote sensing community since it addresses the common problem of a lack of large labeled datasets by supporting methods of training data generation.
The manuscript is concise, well-written and well-structured. It is very much appreciated that the authors submitted a manuscript in an already quite mature state. The introduction provides a good background for the rest of the paper and introduces the reader to other studies in the field. Although structure and reasoning are well presented, the results section could be expanded and a discussion section needs to be added.
Below, I list my main comments followed by minor points that mostly refer to aspects of style or language.
Parts taken from the manuscript are italicized.
- A general question: why do you completely discard cases in which the CNN showed “Clever Hans” decisions? Would it not be better to create a model that does not produce these cases at all (and support a GAN that can handle these as well)? Just because one classifier model showed this kind of behavior does not mean that any type of (CNN) classifier will draw these “wrong” conclusions from the data. Would it not make more sense to use these apparently challenging or “fooling” examples specifically to improve the model performance or specifically guide the training rather than removing them from the data entirely?
- The results section could be expanded and should go into more detail. It should also be more clearly separated from the methods descriptions. Apart from the visual comparison of LIME images, do you maybe see any way to provide quantitative metrics or statistical indicators in the analysis (e.g. multiple metrics comparing the resulting GANs)?
- l. 224: the aim of the study to explore LIME rather than develop a better GAN should actually be clarified at the very beginning of the manuscript (i.e. in the introduction). It should not be necessary to clarify this in the results section.
- A dedicated discussion section is missing entirely, even though there are a lot of aspects to discuss in some detail here, e.g. how LIME performs (or may perform) in comparison to CAM-based approaches, what are the limitations or failure modes of the approach, what other strategies except for data selection could be employed based on LIME (e.g. active learning) etc.
Language and style
- l. 30: “Worse, even if we collect…”
- l. 35: there is a typo in this line.
- l. 58: “…Hans was a horse capable…”
- l. 59: please avoid contractions like “can’t”.
- l. 63: “…that match people’s cognition,…”
- l. 70: “Multiple methods were proposed…”
- l. 79: “…of the input, thereby, determining the…”
- l. 90: do you mean alleviate instead of “ally” here?
- l. 107: “…has difficulty capturing the local….”
- l. 137: “…SAR images according to the…”
- l. 144: “A CNN is trained on the MSTAR dataset. All the SAR images are converted to…”
- l. 204: “…SAR images in which the positive…”
Miscellaneous
- ll. 55ff: I think it would be better to merge these two sentences. The subsequent explanation of the “Clever Hans” phenomenon could be improved as well.
- There are many cases of missing spaces before brackets and parantheses throughout the manuscript, e.g. ll. 71, 72, 75, 80, 82.
- l. 78: please remove the superfluous period here.
- p. 4, first line after equation (3): it is not clear what is supposed to be conveyed with the statement “…ensure the explanation contains prefer fewer features.”
- l. 137: what do you mean by “no practical significance to generate SAR images”?
- l. 159: The definition of MSTAR (full name) should be mentioned on the first occurrence of the abbreviation.
Author Response
Dear editor and reviewer,
First of all, we are grateful for providing us with an opportunity to revise this paper. Thanks for your precious suggestions for our work that helped us enhance the quality of this manuscript. Sentences/phrases that have been modified or added are marked in blue color in the revised paper according to your comments. In the following, we provide a point-to-point response to reviewers’ comments.
Q1. A general question: why do you completely discard cases in which the CNN showed “Clever Hans” decisions? Would it not be better to create a model that does not produce these cases at all (and support a GAN that can handle these as well)? Just because one classifier model showed this kind of behavior does not mean that any type of (CNN) classifier will draw these “wrong” conclusions from the data. Would it not make more sense to use these apparently challenging or “fooling” examples specifically to improve the model performance or specifically guide the training rather than removing them from the data entirely?
A1. We gratefully appreciate your valuable questions. We respond to your questions as follows:
In comparison to optical images, SAR images have small extra-class differences and low resolution. Therefore, it is difficult to perform more detailed processing. It probably would be better to use the cases in which the CNN showed “Clever Hans” decisions to guide the GAN’s training instead of discarding them entirely. However, this study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. We also declare this limitation in Section 5. Discussion as follows:
This study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. Therefore, the samples in which the CNN implements "Clever Hans"-type decision strategy are discarded entirely. However, it would probably be better to use these samples to guide GAN’s training. So creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
Q2. The results section could be expanded and should go into more detail. It should also be more clearly separated from the methods descriptions. Apart from the visual comparison of LIME images, do you maybe see any way to provide quantitative metrics or statistical indicators in the analysis (e.g. multiple metrics comparing the resulting GANs)?
A2. We quite agree with the reviewer that more quantitative metrics or statistical indicators are necessary and we lack clear explanation. Thus, we provided more metrics to measure our method in Section 4.3. In this section, we provided the analysis of GANs from the aspects of visual quality, independence, authenticity and diversity to prove the effectiveness of our method.
We firstly provide visual quality comparison results to qualitatively analyze the effectiveness of our method. The result has been added as Figure7 in the revised manuscript. We have added Section 4.3.1 as follows:
In this part, the generated images are subjectively evaluated through human vision. The comparison of the visual quality of two types of generated images and real images is shown in Figure7. Compared with the images generated by C-DCGAN trained on the MSTAR dataset, the images generated by C-DCGAN trained on the MSTAR-LIME dataset have clearer contour features, more obvious scattering features, and finer texture. It indicates that the process of data selection can improve the quality of generated images.
Then we provide the results of MI value, GAN-test and GAN-train to quantitatively analyze the effectiveness of our method. The results of MI value and GAN-train have been added as Table 2 and Table 4. We have added Section 4.3.2 and Section 4.3.4 as follows:
In this experiment, Mutual information (MI) value is employed to measure the independence between the generated images and the corresponding real images. MI value is a measure of the degree of interdependence between random variables. This experiment calculates the average of the MI values of 100 images as the final results. The comparison results of the two types of generated images and the real images are shown in Table 2. The smaller the interdependence between two images, the smaller their MI value, i.e. the independence between the two images is more obvious. It is evident from Table 2 that the MI values of the images generated by the C-DCGAN trained on the MSTAR-LIME dataset are significantly smaller than the MI values of the images generated by the C-DCGAN trained on the MSTAR dataset. In other words, the images generated by our method are more independent of the real image than the images generated by C-DCGAN without data selection. Therefore, it is more reasonable to apply the images generated by our method to the SAR target recognition task.
In this experiment, the GAN-train score is employed to evaluate the diversity of the generated images. GAN-train score is calculated by using the generated samples to train the classifier, and then testing the classifier on the real samples. The GAN-train score is higher, the generated images are more diverse. The CNN presented in Figure 4 is utilized to conduct the image classification task for GAN-train measure. The result of GAN-train is shown in Table 4. According to Table4, the performance of GAN-train of our proposed method is better than the generation method without data selection. Therefore, our method can generate more diverse images than the generation method without data selection.
We also discuss the experiment results from qualitative and quantitative analysis in Section 5. Discussion, as follows:
In our study, we provide qualitative and quantitative analysis to prove the effectiveness of the proposed method. In qualitative analysis, firstly, the comparison of visualization results of LIME and CAM methods demonstrate that LIME can reflect the contributions to the decision comprehensively and clearly. Secondly, we provide an intuitive comparison of generated images to subjectively prove the improvement of the quality of generated images. In quantitative analysis, MI value, GAN-test score and GAN-train score are calculated to prove the improvement of independence, authenticity and diversity of images generated by our method.
Q3. l. 224: the aim of the study to explore LIME rather than develop a better GAN should actually be clarified at the very beginning of the manuscript (i.e. in the introduction). It should not be necessary to clarify this in the results section.
A3. We quite agree with the reviewer that we should clarify the aim of our study at the very beginning of the manuscript. Thus, we clarify in the Introduction (l.101) that we aim to explore the effectiveness of LIME method in alleviating the "Clever Hans" phenomenon in DNNs rather than obtain a better GAN.
Note that, our aim is to explore the effectiveness of LIME method in alleviating the "Clever Hans" phenomenon in DNNs, but not to obtain a GAN with superior performance by optimizing the architecture and the loss function of the GAN. Despite this, the GAN employed in this study still generates images of high quality.
Q4. A dedicated discussion section is missing entirely, even though there are a lot of aspects to discuss in some detail here, e.g. how LIME performs (or may perform) in comparison to CAM-based approaches, what are the limitations or failure modes of the approach, what other strategies except for data selection could be employed based on LIME (e.g. active learning) etc.
A4. We quite agree with the reviewer that we should add a discussion section. We have added Section 5 Discussion. In this section, we firstly discuss the experiment results from qualitative and quantitative analysis. Then, we discuss the future research directions. Section 5 Discussion is as follows:
In our study, we provide qualitative and quantitative analysis to prove the effectiveness of the proposed method. In qualitative analysis, firstly, the comparison of visualization results of LIME and CAM methods demonstrate that LIME can reflect the contributions to the decision comprehensively and clearly. Secondly, we provide an intuitive comparison of generated images to subjectively prove the improvement of the quality of generated images. In quantitative analysis, MI value, GAN-test score and GAN-train score are calculated to prove the improvement of independence, authenticity and diversity of images generated by our method.
This study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. Therefore, the samples in which the CNN implements "Clever Hans"-type decision strategy are discarded entirely. However, it would probably be better to use these samples to guide GAN’s training. So creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
We gratefully appreciate your valuable questions. We provide a point-to-point response to your questions as follows:
Firstly, LIME aims at detecting positive and negative contribution pixels, while CAM methods aim at providing a highlighted region that precisely matches the target in original image. In addition, SAR images are quite different from optical images. SAR images have low resolution and usually contain a number of interference spots. In this case, the highlighted area in the heatmaps generated by CAM methods designed for optical images excessively covers the target or even deviates from the target. In contrast, LIME method can delineate a region that precisely covers the target. Therefore, we can clearly observe the "Clever Hans" phenomenon in the classification task by using LIME method. This is the reason why we select the dataset according to the LIME results.
Secondly, the limitation of this method is mainly that the manual process of data selection. It is difficult to perform more detailed processing due to small extra-class differences and low resolution of SAR images. It probably would be better to use the cases in which the CNN showed “Clever Hans” decisions to guide the GAN’s training instead of discarding them entirely. Creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
Thirdly, in our study, we alleviate the "Clever Hans" phenomenon in DNNs by discarding the cases in which the CNN showed “Clever Hans” decisions. Besides, using the visualization results to directly guide the training of networks instead of removing them from the dataset will be our future research direction. We also declare it in Section 5 as follows:
However, it would probably be better to use these samples to guide GAN’s training. So creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
Q5. l. 30: “Worse, even if we collect…”
A5. "even" has been corrected to "even if".
Q6. l. 35: there is a typo in this line.
A6. " limitationeen " has been corrected to " limitation can be ".
Q7. l. 58: “…Hans was a horse capable…”
A7. "is" has been corrected to "was".
Q8. l. 59: please avoid contractions like “can’t”.
A8. "can’t" has been rewritten to " can not".
Q9. l. 63: “…that match people’s cognition,…”
A9. "matches" has been rewritten to " match".
Q10. l. 70: “Multiple methods were proposed…”
A10. "lots of" has been corrected to " Multiple ".
Q11. l. 79: “…of the input, thereby, determining the…”
A11. "determines" has been corrected to "determining".
Q12. l. 90: do you mean alleviate instead of “ally” here?
A12. "ally" has been corrected to " alleviate".
Q13. l. 107: “…has difficulty capturing the local….”
A13. " is difficult to capture " has been corrected to " has difficulty capturing ".
Q14. l. 137: “…SAR images according to the…”
A14. "according" has been corrected to "according to".
Q15. l. 144: “A CNN is trained on the MSTAR dataset. All the SAR images are converted to…”
A15. "with" has been corrected to "on"; "convered" has been corrected to "converted".
Q16. l. 204: “…SAR images in which the positive…”
A16. "that" has been corrected to "in which ".
Q17. ll. 55ff: I think it would be better to merge these two sentences. The subsequent explanation of the “Clever Hans” phenomenon could be improved as well.
A17. Firstly, we quite agree with the reviewer that merging "Although there are numerous novel GANs that can improve the quality of generated SAR images, Clever Hans, a serious problem on networks' interpretation, is seldom considered. " would be better. We have merged these two sentences as follows:
Although there are numerous novel GANs that can improve the quality of generated SAR images, Clever Hans, a serious problem on networks' interpretation, is seldom considered.
Secondly, we have rewritten the explanation of the "Clever Hans" phenomenon as follows:
Hans was a horse that conquered the audience with the extraordinary ability of mathematical calculations. In fact, Hans did not carry out a real reasoning process but just relied on unconscious cues from the trainers and the observers, such as facial expressions, postures, etc., to give the correct answer.
Q18. There are many cases of missing spaces before brackets and parentheses throughout the manuscript, e.g. ll. 71, 72, 75, 80, 82.
A18. The spaces before brackets and parentheses have been added throughout the manuscript.
Q19. l. 78: please remove the superfluous period here.
A19. The superfluous period has been removed.
Q20. p. 4, first line after equation (3): it is not clear what is supposed to be conveyed with the statement “…ensure the explanation contains prefer fewer features.”
A20. As not everymay be simple enough to be interpretable, thus we let be a measure of complexity (as opposed to interpretability) of the explanation. For example, for decision trees may be the depth of the tree, while for linear models, may be the number of non-zero weights. To make the description clearer, we have rewritten the sentence as:
The model complexity is kept low to ensure the interpretability of , e.g. a shallow depth for decision trees and a small number of non-zero weights for linear models.
Q21. l. 137: what do you mean by “no practical significance to generate SAR images”?
A21. Practical significance means the potential application of the generated images to target recognition. It can be reflected from many quantitative metrics, such as authenticity, diversity, etc. To make the description clearer, we have rewritten the sentence as:
The SAR images generated by GANs according to the spurious connection between the real SAR images and the corresponding labels can not be applied to SAR target recognition due to lack of authenticity and reliability.
Q22. l. 159: The definition of MSTAR (full name) should be mentioned on the first occurrence of the abbreviation.
A22. The full name of MSTAR has been mentioned on the first occurrence (l.100) of the abbreviation.
We proofread the manuscript carefully again and corrected some other mistakes labeled with blue color.
Author Response File: Author Response.pdf
Round 2
Reviewer 2 Report
The revised version has been improved over the first manuscript version. The authors have addressed most of the comments, however, the discussion section needs more work. The current version of the discussion is too brief and does not reflect sufficiently on the results and insights of the work. It is a mixture of study goal description and summary but does not really discuss results and implications. Please address at least the following topics in more detail:
- How does LIME perform in comparison to CAM-based approaches?
- What are the limitations or failure modes of the technique?
- What other strategies except for data selection could be employed based on LIME (e.g. active learning) etc.?
Author Response
Dear editor and reviewer,
First of all, we appreciate providing us an opportunity to revise this paper. Thanks for your precious suggestions for our work that helped us enhance the quality of this manuscript. Sentences/phrases that have been modified or added are marked in blue color in the revised paper according to your comments. In the following, we provide a point-to-point response to reviewers’ comments.
We quite agree that the following questions should be discussed in Section 5 Discussion. Therefore, we respond to the three questions in this section.
Q1. How does LIME perform in comparison to CAM-based approaches?
A1. We respond to your question as follows:
LIME aims at detecting positive and negative contribution pixels, while CAM methods aim at providing a highlighted region that precisely matches the target in original image. In addition, SAR images are quite different from optical images. SAR images have low resolution and usually contain a number of interference spots. In this case, the highlighted area in the heatmaps generated by CAM methods designed for optical images excessively covers the target or even deviates from the target. In contrast, LIME method can delineate a region that precisely covers the target. Therefore, we can clearly observe the "Clever Hans" phenomenon in the classification task by using LIME method. This is the reason why we select the dataset according to the LIME results.
Q2. What are the limitations or failure modes of the technique?
A2. We respond to your question as follows:
The limitation of this method is mainly the manual process of data selection. It is difficult to perform more detailed processing due to small extra-class differences and low resolution of SAR images. It probably would be better to use the cases in which the CNN showed “Clever Hans” decisions to guide the GAN’s training instead of discarding them entirely. Creating a model that automatically uses the visualization results to guide the training of networks will be our future work.
Q3. What other strategies except for data selection could be employed based on LIME (e.g. active learning) etc.?
A3. We respond to your question as follows:
We can also carry out some feature engineering work based on the result of LIME, such as removing some misleading features to focus on the generation of SAR target. Besides, the result of LIME can guide us to select the best model, i.e. we can create a model that automatically uses the visualization results to guide the training of networks.
We have rewritten the Discussion as follows:
In our study, we provide qualitative and quantitative analysis to prove the effectiveness of the proposed method. In qualitative analysis, firstly the comparison of visualization results of LIME and CAM methods demonstrates that LIME can reflect the contributions to the decision comprehensively and clearly. Secondly, we provide an intuitive comparison of generated images to subjectively prove the improvement of the quality of generated images. In quantitative analysis, MI value, GAN-test score and GAN-train score are calculated to prove the improvement of independence, authenticity and diversity of images generated by our method.
Next, we will discuss our method from the following three aspects: (1) How does LIME perform in comparison to CAM-based approaches? LIME aims at detecting positive and negative contribution pixels, while the CAM methods aim at providing the positive contribution region by highlighting the related area. The highlighted area in the heatmaps generated by CAM methods designed for optical images usually excessively covers the target or even deviates from the target since that SAR images have low resolution and contain many interference spots. In contrast, LIME method can accurately demarcate the target area. Therefore, the "Clever Hans" phenomenon can be clearly observed in the visualization results of LIME. This is the reason why we select the dataset according to the result of LIME. (2) What are the limitations or failure modes of the technique? The limitation of this method is mainly the manual process of data selection. It is difficult to perform more detailed processing due to small extra-class differences and low resolution of SAR images. It probably would be better to use the cases in which the CNN showed “Clever Hans” decisions to guide the GAN’s training instead of discarding them entirely. However, this study aims to improve the quality of GANs by alleviating the "Clever Hans" phenomenon in DNNs instead of proposing a complicated and detailed data selection method. Therefore, the samples in which the CNN implements "Clever Hans"-type decision strategy are discarded entirely. (3) What other strategies except for data selection could be employed based on LIME (e.g. active learning) etc.? In our method, we use the result of LIME to guide the selection of SAR images. Besides, we can carry out some feature engineering work based on the result of LIME, such as removing some misleading features to focus on the generation of SAR target. At a deeper level, the result of LIME can guide us to select the best model, i.e. we can create a model that automatically uses the visualization results to guide the training of networks. The above two research directions based on LIME will be our future research direction.
We proofread the manuscript carefully again and corrected some other mistakes labeled with blue color.