Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation with Transformer

Remote Sens. 2022, 14(19), 4942; https://doi.org/10.3390/rs14194942

by Weitao Li

, Hui Gao^*

, Yi Su and Biffon Manyura Momanyi

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Remote Sens. 2022, 14(19), 4942; https://doi.org/10.3390/rs14194942

Submission received: 21 August 2022 / Revised: 21 September 2022 / Accepted: 29 September 2022 / Published: 3 October 2022

(This article belongs to the Special Issue Deep Learning for Satellite Image Segmentation)

Round 1

Reviewer 1 Report

Thanks for the very-well written manuscript that studies unsupervised domain adaptation for remote sensing semantic segmentation. Domain adaption is a time-old significant topic in computer vision/remote sensing as one can collect so many of the sample images for training. Its implication is significant and its applications are many.

The manuscript is very well written with thorough experimental analysis. Could the authors consider the following:

1. Unsupervised domain adaptation has its merits and disadvantages against a very large set of supervised learning (many sets of labels) or completely unsupervised learning from the onset. One would not expect comparative experiments (because of too many different experimental parameters) but some discussion of it would benefit the community in the earlier part of the manuscript (i.e., in this I refer to the machine learning community). I appreciate that this is somewhat given fact in this paper, but I suppose a couple of short paragraphs would help to settle the minds of the machine learning community.

2. The authors can also perhaps add the relationship to their previous work and describe its nature of evolution and exploration (what prompted you to study this from your previously published work).

3. Data evidence is always satisfying but numerical data sometimes do not always bring all the necessary proof - as they are often subject to interpretations. One can say both 98% and 95% accuracies are OK given the industry/community expectations - but what do we sacrifice to attain the small percentage in accuracy? I trust there is already some existing discussion on resources required but it may need to be slightly elaborated, The remote sensing community is often generous to the computing resources and complexity - which are fine but it is worth addressing that the authors are aware of the implication of model complexity.

Overall, congratulation on the very well-thought and well-prepared manuscript. Thanks.

Author Response

Point 1: Unsupervised domain adaptation has its merits and disadvantages against a very large set of supervised learning (many sets of labels) or completely unsupervised learning from the onset. One would not expect comparative experiments (because of too many different experimental parameters) but some discussion of it would benefit the community in the earlier part of the manuscript (i.e., in this I refer to the machine learning community). I appreciate that this is somewhat given fact in this paper, but I suppose a couple of short paragraphs would help to settle the minds of the machine learning community.

Response 1: In Section 4.2, we devoted a lot to discussing comparative experiments with different hyper-parameters (e.g., λ and K), which indeed is tedious. We have removed some unnecessary descriptions to shorten the paragraphs in 4.2. Since other reviewers commented on this section, we still kept some descriptions to support our arguments.

Point 2: The authors can also perhaps add the relationship to their previous work and describe its nature of evolution and exploration (what prompted you to study this from your previously published work).

Response 2: Since our previous version of Introduction lacked coherence, we have revised it a lot to make it easier for the readers to understand the evolution and motivation of our method.

Point 3: Data evidence is always satisfying but numerical data sometimes do not always bring all the necessary proof - as they are often subject to interpretations. One can say both 98% and 95% accuracies are OK given the industry/community expectations - but what do we sacrifice to attain the small percentage in accuracy? I trust there is already some existing discussion on resources required but it may need to be slightly elaborated, The remote sensing community is often generous to the computing resources and complexity - which are fine but it is worth addressing that the authors are aware of the implication of model complexity.

Response 3: We have added the Section 4.3 to analyze the complexity of the computation.

We sincerely thank you for your constructive comments and suggestions.

Reviewer 2 Report

line: 59 What is CBAT?

Line: 67 What is LSR?

Line: 69 What is CMT?

Line: 79 What is MixUp?

Line: 82 What is CSC?

Line: 82 What is SimCLR?

Line: 85 What is ClassMix?

Line: 88 What is DACS?

Line: 91 - 93 Needs rephrasing.

Line: 168 In the Figure, What are Q, K and V need a brief description.

Line 236: Both the Datasets should be briefly described in separate paragraphs. what are the similarities and differences between the two?

Line 286: Why DAFormer is adopted. what are the reasons behind this?

Section 3.4 needs to be explained which four experiments were done.

Section 3.5 needs details for the environment used for this comparison.

The language of the paper needs to be improved in the revised submission.

Thank you

Good Luck.

Author Response

Point 1: line: 59 What is CBAT?

Response 1: It’s an UDA method, and we replaced it with its authors “Zou et al.”.

Point 2: Line: 67 What is LSR?

Response 2: It’s an UDA method but we found that it is not strictly related to the self-training, so we deleted this reference.

Point 3: Line: 69 What is CMT?

Response 3: It’s an UDA method for RS images, and we replaced it with its authors “Yan et al.”.

Point 4: Line: 79 What is MixUp?

Response 4: It’s a mixing strategy, and we replaced it with its authors “Yun et al.”.

Point 5: Line: 82 What is CSC?

Response 5: It’s an UDA method for RS images, and we replaced it with its authors “Gao et al.”.

Point 6: Line: 82 What is SimCLR?

Response 6: It’s a self-supervised method, and we replaced it with its authors “Chen et al.”.

Point 7: Line: 85 What is ClassMix?

Response 7: It’s a mixing strategy and we replaced it with its authors “Olsson et al.”.

Point 8: Line: 88 What is DACS?

Response 8: It’s an UDA method and we replaced it with its authors “Tranheden et al.”.

Point 9: Line: 91 - 93 Needs rephrasing.

Response 9: Thanks for pointing it out. We have rephrased them. (line 131 - 133)

Point 10: Line: 168 In the Figure, What are Q, K and V need a brief description.

Response 10: We have added a short description of why they are called Q, K, and V in the caption of figure 3. These names are due to historical reasons, so they don't really have a special meaning, but are just the results of three kinds of encoding. Therefore, we think we can leave them without too much description.

Point 11: Line 236: Both the Datasets should be briefly described in separate paragraphs.

what are the similarities and differences between the two?

Response 11: We have put them in different paragraphs (line 273 - 280) and include some comparison in the following paragraphs, e.g., they have different band modes, colors, location, and buildings (line 281 - 289).

Point 12: Line 286: Why DAFormer is adopted. what are the reasons behind this?

Response 12: Because it achieves the best results compared recent Transformer for UDA semantic segmentation. We have added the reasons in detail (line 308 - 313).

Point 13: Section 3.4 needs to be explained which four experiments were done.

Response 13: We have briefly described four experiments in detail (line 351 - 352).

Point 14: Section 3.5 needs details for the environment used for this comparison.

Response 14: We have added the details for the environment used for comparison (line 362 - 363). Since previous methods used CNNs and we use Transformer, there is no way to keep many of the experimental settings consistent. We can only ensure that the same datasets and average results are used for all experiments.

Point 15: The language of the paper needs to be improved in the revised submission.

Response 14: Thanks for your reminder. We have spent a lot of time improving the languages.

We sincerely thank you for your constructive comments and suggestions.

Reviewer 3 Report

This paper proposed an unsupervised domain adaptation in self-training way for remote sensing semantic segmentation, using Transformer structure and cross-domain mixed sampling. Besides, Gradual Class Weights (GCW) and Local Dynamic Quality (LDQ) are designed to optimize the training strategies. The structure of the paper is complete. However, there are many errors or inconsistencies in the current version, which making the paper quite difficult to follow. Therefore, the paper cannot be accepted for publication without major revision.

1. Introduction: The research backgrounds, including technology, issues and solutions and motivations of this paper are illustrated here. But it is unclear and lacks coherence, i.e., the introduction of the three types of UDA semantic segmentation methods, the purpose of data enhancement and motivation of two proposed strategies: Gradual Class Weights and Local Dynamic Quality.

1) When introducing the three types of UDA semantic segmentation methods, there is a lack of transition from self-training methods to adversarial-learning methods. And the reasons for using self-training UDA in the following, i.e., the advantages UDA in self-training way, are not fully reflected here. The summary of existing methods (line 71-76) is also suggested to be separately listed as a paragraph to make clear description.

2) What is the purpose of data enhancement? Why does it show up in this part such a sudden? The problem to be solved and its contact with UDA task needs to be described here.

3) The two proposed strategies: Gradual Class Weights and Local Dynamic Quality are significant parts of innovations of the proposed method, but their motivations are not distinctly represented above.

2. Discussion:

1）As shown in Figure 10., P_ch get the highest when lameda=0.5. how to explain this phenomenon? Table A1 seems to indicate that the best result is acquired when lameda=0.5，however, the last lameda was set to 0.7 in experiments.

2) From the point of Table A2，the result is the best when K=2. Why the K was set to 3 in experiments (line 376)? And I cannot find the last selected K and intervals for experiments in Implementation Details.

3) How to conclude that “As for the values of K, a larger one usually results in more robust quality (line 421)”. The result in Figure 12 may not support the theory that the proposed LDQ not only maintains the desirably high quality of the correct pseudo-labels but also leaves the error ones relative low quality (line 415).

4) The result of GCM in Table 3 differs from that of w. GCW in Table 1.

3. Figures and Tables:

1) Figure 1. mismatched the context, i.e., Target Label, Target Image, and local dynamic confidence (LDC).

2) According to line 326-328, the VAI-POT in Table3 and Table4 should be POT-VAI.

4. Many description errors: source labels y_T (Line 126); less information about the roads (line 183); number of false positive pixels and number of false positive pixels (line 285); the performance of LDQ are more desirable than LDQ in clutter (line 301); a source domain to a target source (line 310); Missing content after “It indicates that” (line 368); other repeated words.

5. Whether the "image-wise quality (line 362)" including Equal Quality and Image-wise Quality？

6. Please carefully modify the grammar errors in the paper to make the expression scientific and precise.

Author Response

Point 1: When introducing the three types of UDA semantic segmentation methods, there is a lack of transition from self-training methods to adversarial-learning methods. And the reasons for using self-training UDA in the following, i.e., the advantages UDA in self-training way, are not fully reflected here. The summary of existing methods (line 71-76) is also suggested to be separately listed as a paragraph to make clear description.

Response 1: We have almost rewritten the introduction to make it easier for readers to understand the motivation of our methods. A separate paragraph is added to describe the adversarial-learning methods (line 56 - 68). The main reason we use self-training UDA framework is that it does not use an auxiliary network, which will largely increase the computational complexity (line 69 - 71). And the summary of existing methods is listed separately as a new paragraph (line 84 - 89).

Point 2: What is the purpose of data enhancement? Why does it show up in this part such a sudden? The problem to be solved and its contact with UDA task needs to be described here.

Response 2: We have added the transition and purpose of data enhancement (line 90 - 93) that because the data augmentation is from a general point of view to improve the performance.

Point 3: The two proposed strategies: Gradual Class Weights and Local Dynamic Quality are significant parts of innovations of the proposed method, but their motivations are not distinctly represented above.

Response 3: We have listed the motivations for GCW and LDQ in the form of extant challenges (line 124 – 130).

Point 4: As shown in Figure 10., P_ch get the highest when lameda=0.5. how to explain this phenomenon? Table A1 seems to indicate that the best result is acquired when lameda=0.5，however, the last lameda was set to 0.7 in experiments.

Response 4: We have added the explanation that why P_ch get the the highest when lameda=0.5 (line 453 – 457). Meanwhile, please note that the results in Section 4.2 and Table A1 are obtained with LDQ only. While GCW and LDQ are used together, the best result is acquired when lambda=0.7 and K = 3 (line 411 – 415 & 540 - 543).

Point 5: From the point of Table A2，the result is the best when K=2. Why the K was set to 3 in experiments (line 376)? And I cannot find the last selected K and intervals for experiments in Implementation Details.

Response 5: While GCW and LDQ are used together, the best result is acquired when lambda=0.7 and K = 3 (line 411 – 415). Due to computational constraints, incomplete combinations of hyper-parameters have been tested. So we do not list the all experiments.

Point 6: How to conclude that “As for the values of K, a larger one usually results in more robust quality (line 421)”. The result in Figure 12 may not support the theory that the proposed LDQ not only maintains the desirably high quality of the correct pseudo-labels but also leaves the error ones relative low quality (line 415).

Response 6: These two conclusions are not rigorous, so we have removed the statements describing them.

Point 7: The result of GCM in Table 3 differs from that of w. GCW in Table 1.

Response 7: Thanks for pointing this out. We have modified the Table 3 to be consistent with Table 1.

Point 8: Figure 1. mismatched the context, i.e., Target Label, Target Image, and local dynamic confidence (LDC).

Response 8: We have revised the text in Figure 1 and change LDC to LDQ (line 154).

Point 9: According to line 326-328, the VAI-POT in Table3 and Table4 should be POT-VAI.

Response 9: We have modified them from VAI-POT to POT-VAI.

Point 10: Many description errors: source labels y_T (Line 126); less information about the roads (line 183); number of false positive pixels and number of false positive pixels (line 285); the performance of LDQ are more desirable than LDQ in clutter (line 301); a source domain to a target source (line 310); Missing content after “It indicates that” (line 368); other repeated words.

Response 10: We have revised them.

source labels y_T (Line 126)-> source labels y_T (line 168)

less information about the roads (line 183) -> less information about the cars (line 223)

number of false positive pixels (line 285) -> number of false negative pixels (line 331)

more desirable than LDQ in clutter (line 301) -> more desirable than GCW in clutter (line 346)

to a target source (line 310) -> to a target domain (line 356 - 357)

We removed the “It indicates that” (line 368).

Point 11: Whether the "image-wise quality (line 362)" including Equal Quality and Image-wise Quality？

Response 11: Yes. And we have added the “equal quality” next to “image-wise quality” (line 410).

Point 12: Please carefully modify the grammar errors in the paper to make the expression scientific and precise.

Response 12: Thanks for your reminder. We have spent a lot of time improving the languages.

We sincerely thank you for your constructive comments and suggestions.

Round 2

Reviewer 3 Report

The authors have addressed all my questions.

Article Menu

Unsupervised Domain Adaptation for Remote Sensing Semantic Segmentation with Transformer

Further Information

Guidelines

MDPI Initiatives

Follow MDPI