An Efficient Hybrid CNN-Transformer Approach for Remote Sensing Super-Resolution
Round 1
Reviewer 1 Report (New Reviewer)
Comments and Suggestions for AuthorsThe authors presents an Efficient Super-Resolution Hybrid Network (EHNet) based on a UNet-like architecture that adeptly fuses CNN and swin transformer.
I think the overall quality of the paper is good
Figures 1, 2, and 3 and the explanation are too far apart
Is there a reason you used SwinTransformer as backbone?
I think it would be good to present a little more analysis on LEFB
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report (Previous Reviewer 2)
Comments and Suggestions for AuthorsThis manuscript presents a lightweight hybrid transformer architecture for single-image super resolution (SISR) of remote sensing images. The authors employed existing deep learning modules to improve the SISR performance. The reviewer’s comments are as follows.
1. There are several previous methods reporting superior performance on the UCMerced and AID datasets, compared to the proposed method. It is required to highlight the benefit and main contributions of the proposed method.
2. It is required to conduct more experiments at least with the scaling factor of 3, following previous SISR literatures on the UCMerced and AID datasets.
3. The quality of all figures should be improved. It is not able to reproduce the proposed model based on the current version of the figures.
4. It would be beneficial to explain the motivation of developing lightweight deep learning models for the remote sensing imagery application.
Comments on the Quality of English Language1. Presentation should be improved. Several paragraphs only contain 1~2 sentences.
2. There are several inconsistencies in the use of abbreviations. (e.g, FLOPs vs. Flops)
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsThe authors have made a significant improvement in the earlier version of the manuscript. Now there is better clarity under the abstract section. In this study, the authors have proposed the super-resolution hybrid network (EHNet) based on the encoder, featuring our novel Lightweight Feature Extraction Block (LFEB), which employs a more efficient convolution method than depth-wise separable convolution based on depth-wise convolution. Some of the observations are as follows:
1. Some of the abbreviations are still missing to be specified such as SRCNN. Go through carefully.
2. Table 1 is good but what type means? It is recommended, all the abbreviations must be specified under the footnotes of the table.
3. Now methodology is fine but in Fig. 3 what are the inputs and outputs of the Lightweight Feature Extraction Block (LFEB).
4. All equation number must be specified separately for Eq (6). Recheck, all the parameters of the equations are properly specified.
5. Discussion could be improved on the results along with the limitation or challenges in the proposed work.
6. Some of the information missing in the reference such as
(a) Hu, X.; Naiel, M.A.; Wong, A.; Lamm, M.; Fieguth, P. RUNet: A robust UNet architecture for image super-resolution. In Proceedings of the Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019; pp. 0-0. (Page error)
(b) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Advances in neural information processing systems 2017, 30. (Page missing)
(c) DOIs could be added for each reference.
Author Response
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report (Previous Reviewer 1)
Comments and Suggestions for AuthorsAuthor's have addressed all the necessary queries.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors Authors have proposed the Efficient Super-20 Resolution Hybrid Network (EHNet), blending a lightweight convolutional module encoder with an advanced Swin-transformer-based decoder to overcome various critical issues such as overfitting, and mismatching semantic information. The study is important but some of the major observations are as follows:1. I suggest the authors add important outcomes (qualitative or quantitative) of the article in the abstract section.
2. Many abbreviations have been missed to be specified as they arrived first in the abstract as well as in the manuscript. Please recheck the manuscript and make the necessary changes.
3. The keyword needs to be corrected “convolution neural network”.
4. L77, what is IPT? Specify clearly.
5. Under Related work, it would be good if a comparative table regarding the state-of-the-art techniques could be added.
6. Also make sure that no paragraph more than 250 words.
7. Explanation of methodology is fine but a little error in the spell mistake in the caption of the Fig. 3 Lightweight Feature Extraction Block (LFEB).
8. L298, Avoid the usage of “our”. Specify the name of encoder.
9. L381, could you specify FLOP somewhere in the manuscript.
10. Some grammatical error in the lines L394 to 396.
11. Table 1 and 5 are at inadequate place.
Author Response
please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript presents a lightweight hybrid transformer architecture for single-image super resolution of remote sensing images. The authors employed existing deep learning modules including depthwise separable convolution, SE layer, and CSP module, without comprehensive analysis on their effects to the performance improvement. In overall, academic novelty is insufficient to be published in this high quality journal. The reviewer’s comments are as follows.
1) The authors need to clarify the main contribution and highlight the novelty of the proposed method. It is required to conduct comprehensive analysis on the effectiveness of the main contribution.
2) Presentation of the manuscript is very poor. There are many typos and duplicated definitions of abbreviations. The quality of figures should be improved. In overall, it is required to rewrite the manuscript.
3) It is required to elaborate explanations about existing modules utilized in the proposed architecture.
4) The authors compared the proposed method with previous algorithms published before 2021. It is required to compare the performance with more recent methods.
5) The SE layer is an outdated feature recalibration module proposed in 2018. It is required to compare the effectiveness of the SE layer with recent attention mechanisms.
6) Figure 7 contains inadequate subfigures for the HR images.
7) There are errors in Equations (2) and (4).
8) The authors should provide additional information and conduct more experiments to elucidate why the SUB module is more suitable for transformer-based models compared to existing convolution-based upsampling methods.
Comments on the Quality of English LanguageComprehensive revision is required to improve presentation of the current manuscript. The manuscript contains many typos and grammatical errors.
Author Response
please see the attachment
Author Response File: Author Response.pdf