BiTSRS: A Bi-Decoder Transformer Segmentor for High-Spatial-Resolution Remote Sensing Images
Round 1
Reviewer 1 Report
1. All symbols and blocks shown in Figs. 1-3 should be detailedly explained in the figure itself.
2. Section 2 should be reorganized since current Subsection 2.2 could be regarded as a subset of Subsection 2.1.
3. Formulations of each loss in Eq. 7 should be given.
4. The datasets used in the paper, e.g. Vaihingen and Potsdam, are widely used for semantic segmentation. Thus more state-of-the-art results on these datasets should be compared in the paper, especially those of transformer-based methods.
5. A large amount of references do not meet the requirements of publication. Please check them and fix the missed information.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
My main concern with this paper is regarding its contributions, their representation, and comparison with reported results in the literature. The motivation behind the contributions is not clear and not well-explained, particularly in the introduction section. The only explanation given is that "the SWIN transformer focuses more on encoding, so we decided to focus on decoding". The overview of the model architecture (a couple of paragraphs at the beginning of section 3) is also not clear. The description of each module is also trivial and lacking depth. It is not clear which modules are taken off-the-shelf, which are inspired, and which are invented. This makes the claimed contributions understated, and the method looks like component collecting/assembling with no novelty. These issues need to be addressed in the revised version; otherwise, contributions look lacking. It is also interesting seeing that conventional fully convolutional networks achieve better results in some classes (almost in all classes in the Vaihingen dataset). This also needs and explanation. Qualitative results show that in challenging scenarios (such as the barren class in the LoveDA dataset), it faces difficulty like others, and in some cases, it performs similarly to the others (although it performs better on the clutter class).
Moreover, the compared methods are reported as state-of-the-art methods on the segmentation task, while this claim seems not credible. There are many high-performing segmentation methods, such as Vit-adapter[1], segFormer[2], BeiT[3], etc., achieving high performance on natural images (and maybe on RS if tried). Most importantly, there are methods in the literature reported to have higher performance on the evaluated datasets. For example, UnetFormer[4] (among many others) achieves 81% mIOU, indicating a large gap with the proposed method and the compared methods. This discrepancy is also seen for the other evaluated datasets. This difference in reported results should be explained and corrected, otherwise it could call the credibility of the work into question.
Additionally, some closely related works (with encoder-decoder and transformers) [5,6] specifically designed for RS have been done on segmentation of remote sensing images to detect (especially WAMI images) are overlooked and missing. They can be mentioned in the related work section, and even the method can be evaluated on their evaluated challenging scenarios where small objects are involved.
[1] Chen, Zhe, et al. "Vision Transformer Adapter for Dense Predictions." arXiv preprint arXiv:2205.08534 (2022).
[2] Xie, Enze, et al. "SegFormer: Simple and efficient design for semantic segmentation with transformers." Advances in Neural Information Processing Systems 34 (2021): 12077-12090.
[3] Bao, Hangbo, Li Dong, and Furu Wei. "Beit: Bert pre-training of image transformers." arXiv preprint arXiv:2106.08254 (2021).
[4] Wang, Libo, et al. "UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery." ISPRS Journal of Photogrammetry and Remote Sensing 190 (2022): 196-214.
[5] Negin, Farhood, et al. "Transforming Temporal Embeddings to Keypoint Heatmaps for Detection of Tiny Vehicles in Wide Area Motion Imagery (WAMI) Sequences." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.
[6] Motorcu, Hakki, et al. "HM-net: A regression network for object center detection and tracking on wide area motion imagery." IEEE Access 10 (2021): 1346-1359.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
The author proposed ‘A Bi-decoder T ransformer Segmentor for High Spatial Resolution Remote Sensing Images’(BiTSRS), used in hyperspectral image segmentation, and the effectiveness of this method is analyzed experimentally. However, there are still some problems to solve before considering the publishing manuscript.
1.For the ITM module, the author does not mention whether the boundary is extended or not while downsampling, ignoring the extraction of image edge information. This will cause information loss and affect the final result.
2.For the Swin Encoder module, the author does not specify the number of stage structures.Why does the 4-stage structure work best? It is suggested that the author increase the comparative experiment.
3.The function of FDCN module is not accurately expressed. FDCN adjusts the loss function in the training process, but the function of extracting the feature map details mentioned in this paper should be explained in more detail. The internal structure of FDCN is not explained in detail. FDCN's role in the whole network and the existence of this module should be highlighted.
4. For the algorithm comparison section, more recent advanced algorithms should be compared.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Explanations of key symbols and blocks shown in Figs. 1-3 should be added in the figures themselves or in the captions. This will make the readers understanding the figures without reading the text.
Titles of subsection 2.1 and 2.2 are not modified as said in the response. Please check again.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The overall presentation of the paper has been improved. The author has responded extensively to the reviewers' comments by addressing the raised concerns. However, my main concern regarding the reported accuracies and the comparison remains unanswered (this issue was also raised by two other reviewers). I gave UnetFormer as an example that achieves better performance and there are other methods that also achieve higher results which are not reported. The pros and cons of UnetFormer are explained in the response, but nothing is mentioned about it in the text. Furthermore, a lower score is reported in the comparison table than the reported results in their paper. Either the results should be compared with the reported results in the paper, or it should be mentioned that the reported results are from a modified code that is not run under the same conditions as the original work and is reproduced with the conditions to make a fair comparison. This is confusing and it is the responsibility of the author to check for different methods in the literature to provide a reasonable comparison of methods. If the comparisons are not fair (e.g. using pretrained weights), this should be mentioned and explained. This discrepancy between the reported results in the original works and the compared results should also be addressed in another revision or a response should be provided to make this point clear.
Author Response
Please see the attachment.
Reviewer 3 Report
The problems have been solved. There are no other problems.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf