CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation
Round 1
Reviewer 1 Report
The authors deal with the problem of semantic segmentation of remote sensing images and propose a hybrid convolutional and transformer model called CTFuse. CTFuse has a serial structure where the convolutional part is used first to extract small-size target information, and then the Transformer part is used to embed large-size target information. The proposed model also employs spatial and channel attention in both convolutional and transformer parts to better represent global and local information. The model is evaluated on ISPRS Vaihingen and Potsdam datasets where it achieves SOTA results compared to several other existing models.
The paper is very well-written and the results are valuable, so my advice is to be accepted after addressing several minor issues that are listed below:
1) Line 225: CTfuse -> CTFuse
2) Line 311: 4.1.1. Vaihingen -> The Vainhingen dataset
3) Line 320: 4.1.2. Potsdam -> The Potsdam dataset
4) Table 2 caption: the vaihingen -> the Vaihingen
5) Table 4 caption: the vaihingen -> the Vaihingen
6) Figure 11 caption: Consider changing the caption to depict the content of the figure better (e.g. "Overview of the application of Grad-CAM visualization method").
7) Figure 12 caption: The Grad-CAM -> The Grad-CAM visualization
8) Figure 13 caption: on Vaihingen and Potsdom datasets -> on the Vaihingen and Potsdam datasets
9) Sharing the code through GitHub is highly appreciated, but I would like to suggest you write a better description of the project (README.md) that would include instructions needed to reproduce experiments. You should also translate comments in the code into English, and once your paper gets published include a link to it.
Author Response
(1) We made a modification on line 225.
(2) We made a modification on line 311.
(3) We made a modification on line 320.
(4) We made a modification in Table 2.
(5) We made a modification in Table 4.
(6) We made a modification in Figure 11 caption.
(7) We made a modification in Figure 12 caption.
(8) We made a modification in Figure 13 caption.
(9) Thank you for your affirmation of our published code. We will upload the English version of the comments to GitHub later.
Author Response File: Author Response.pdf
Reviewer 2 Report
Here are some suggestions:
1. Authors should emphasize their motivation for combining Transformer and CNN
2. Authors should adequately summarize existing Transformers, such as:
[1] DSformer: A Double Sampling Transformer for Multivariate Time Series Long-term Prediction
[2] TopFormer: Token pyramid transformer for mobile semantic segmentation
3. Add a description of the data set and the source. In addition, the author should provide the proportion of the training set, the validation set, and the test set.
4.Evaluation indicators need to add references.
5. It is recommended to add ablation experiments
Author Response
- In lines 4 to 6 of the Abstract we mention the motivation for combining Transformer and CNN.
- We have added the references you mentioned at references 55 and 56 respectively.
- The description of the dataset is provided on lines 300 to 326. Data sources are provided on lines 544 to 546.
- We added relevant references on line 327.
- Ablation experiments are provided in Table1.
Author Response File: Author Response.pdf
Reviewer 3 Report
1.The work of this paper is heavy but the innovation is insufficient
2.The most mentioned in the abstract is the combination of CNN and Transformer and attention, while the Keywords are not reflected
3.The experimental part is best divided into Local Evaluation and Benchmark Evaluation
4.There are not many evaluation indicators in the table of the experimental part, among which the more important indicator Acc has not been given, and the given indicator has not indicated the unit
5.The optimal results of Table 1 and Table 4 are not highlighted
Minor editing of English language required
Author Response
1. Thanks to the reviewer for affirming our workload.
2. We added "attention" to the keywords.
3. Thank you for your comments.
4. We give the F1 value in the article. The F1 value can comprehensively reflect the accuracy and recall rate, so we use F1 instead of accuracy. The units are then also noted in the table.
5. We highlighted the optimal results of Table 1 and Table 4.
Author Response File: Author Response.pdf