Next Article in Journal
AFE-RCNN: Adaptive Feature Enhancement RCNN for 3D Object Detection
Next Article in Special Issue
CRTransSar: A Visual Transformer Based on Contextual Joint Representation Learning for SAR Ship Detection
Previous Article in Journal
Toward an Optimal Selection of Constraints for Terrestrial Reference Frame (TRF)
Previous Article in Special Issue
TCD-Net: A Novel Deep Learning Framework for Fully Polarimetric Change Detection Using Transfer Learning
 
 
Article
Peer-Review Record

A Transformer-Based Coarse-to-Fine Wide-Swath SAR Image Registration Method under Weak Texture Conditions

Remote Sens. 2022, 14(5), 1175; https://doi.org/10.3390/rs14051175
by Yibo Fan, Feng Wang * and Haipeng Wang
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Reviewer 4: Anonymous
Remote Sens. 2022, 14(5), 1175; https://doi.org/10.3390/rs14051175
Submission received: 19 January 2022 / Revised: 23 February 2022 / Accepted: 23 February 2022 / Published: 27 February 2022
(This article belongs to the Special Issue Synthetic Aperture Radar (SAR) Meets Deep Learning)

Round 1

Reviewer 1 Report

The authors focused on the study of image registration for wide swath SAR data. In particular, the method mainly includes four steps: down-sampling image pre-registration, sub-image acquisition, dense matching and transformation solution. Overall, the paper is interesting and the results are promising. I have the following comments: 1. The introduction is not well written. I’m confused about which approaches were proposed for the image registration in the field of CV and which were proposed for SAR images. What’s the main issue in SAR image registration? In addition, please cite the-state-of-the-art on image registration in the area of CV[1] and in the area of SAR[2]. 2. When introducing the architecture of the proposed HRNet, it’s beneficial to illustrate the reason why you design the network in this way. More detailed illustration is needed. 3. The authors discussed the robustness of the proposed approach to rotation and translation. Is the proposed image registration approach robust to noise? For example, due to the motion errors[3], SAR images obtained in practical application always have some unfocused area. Is the proposed approach robust to the noise caused by motion errors? Adding some experiments is intractable, however, some discussion is beneficial. 4. A language revision is necessary to fit some minor grammatical errors, e.g. the misuse of prepositions affects the readability of article. 5. The authors should pay attention to the format of references. [1] Y. Zheng et al., "SymReg-GAN: Symmetric Image Registration with Generative Adversarial Networks," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2021.3083543. [2] Y. Xiang, N. Jiao, F. Wang and H. You, "A Robust Two-Stage Registration Algorithm for Large Optical and SAR Images," in IEEE Transactions on Geoscience and Remote Sensing, doi: 10.1109/TGRS.2021.3133863. [3] W. Pu, "SAE-Net: A Deep Neural Network for SAR Autofocus," in IEEE Transactions on Geoscience and Remote Sensing, doi: 10.1109/TGRS.2021.3139914.

Author Response

Thank you very much for your valuable comments, please find the detail response in the attached file.

Author Response File: Author Response.docx

Reviewer 2 Report

Good for the presentation in the present form

Author Response

Thank you very much for your encouragement.

Reviewer 3 Report

The authors proposed a hybrid approach for SAR image registration. In overall, the approach seems feasible. The authors also made adequate experiments and compared their results to other SOTA methods. There are some concerns need to be clarified:

1. Figure 1 is presented but not refered in the manuscript. It seems like the explanation for Figure 1 appears in lines 44 to lines 82. I suggest the authors to put the reference near these paragraphs.

2. in lines 104-108, I think the writing is kind of weird. I think it is not necessary to include the section title in the paragraph. Maybe simply said "In Section 2, the proposed framework of ..." is better?

3. Since kmeans is a well-known method, I don't think Figure 6 is necessary.

4. In subsection 2.4.2, the authors introduce the equation of Transformer architecture and then quickly jump to the Performer. As illustrated in Figure 2, I think the authors actually use the Performer model. Although the Performer model is based on the Transformer architecture, I suggest the authors to use the term "Performer" as the title of subsection 2.4.2 to be consistent to the description in Figure 2. Please check the descriptions in lines 268 to 273 as well.

5. As described by the authors, this study was motivated by the LoFTR approach. However, there is no discussions about the differences between the proposed method and LoFTR. For example, the authors could explain the limitation of LoFTR and how to incoporate it into this study to tackle the SAR image registration problem. Otherwise, it is hard for readers to realize the relationship between the proposed method and LoFTR.

6. The loss function presented in this study is very similar to that in LoFTR. Maybe a reference is needed?

7. I think the focus of this manuscript is a bit off. The point of this study should be the DL part, whereas authors put a lot of words to describe traditional feature descriptors and the superiority of DL in other applications. Perhaps it would be better to say more about those SAR image registration methods that use DL. 

Author Response

Thank you very much for your valuable comments, please find the detail response in the attached file.

Author Response File: Author Response.docx

Reviewer 4 Report

The paper proposes a coarse-to-fine registration approach for SAR images, with initial coarse registration done by traditional feature points and refinement by transformer.

The major problem with the paper is the complete lack of clear description of the proposed algorithm. The algorithm (see line 274) is presented in an extremely superficial manner - no details, no exact sequence of steps is presented; it is not clear what is the input data of each step is and what is the output, what exactly is done. 

The sections following this brief description of general algorithm don't clarify any of the mentioned problems either - instead they focus on providing additional literature review and information on the Transformer neural network architecture and other methods already existing in literature (rather than providing the detailed description of the proposed method). Essentially, within the "proposed method" (subsection 2.4) and within the whole paper almost no information is given on the proposed algorithm.

To illustrate this issue, below is the only passage from paper that desribes how authors used GMS (see line 168):
"The core idea of GMS is: the correct points was found higher in the neighbourhood of the correct points, birds of a feather flock together"
This is not an adequate level of description, the paper should present exact and clear descriptopn of every step of the algorithm rather than "a core idea".


On the other hand, the comparison and analysis with SAR-SIFT is done extensively; but this comparison is not very relevant to the paper: SAR-SIFT isn't a complete pipeline for image matching, just a feature point detector; so it would be more fair to compare the proposed approach with another complete pipeline. This comparison is done - with less details - later in the paper: see Table 4. As seen from the table, the proposed method is on-par or slower than the competitors, so the whole beneficial comparison with SAR-SIFT is misleading to the reader (also the detailed math is done with inaccuracies - e.g. see equation (3), line 206 - where the second power of "w" is lost (and also "s" and "s capped" parameters, which stand for different things - the number of images and the number of scales - seem to be mixed up). The separability of Gaussian filter also isn't taken into calculation.

The description of results can be improved: (i) more information should be provided on metrics (e.g. NCM metric can be described in more detail - what subset of points is used for calculation), (ii) what hardware is used to measure the calculation speed of the algorithms, (iii) what resolutions are used to calculate the speed and the accuracy of algorithm used in Table 4 (subsection 3.4 suggests resolution scaling is a tunable parameter).

The English in the paper should be improved. To illustrate the point, some of the inconsistencies are listed below; however, the whole paper should be completely proof-ridden as there are dozens, dozens of typos. Some (small) portion of grammar errors:
line 127 "is outstanding performed"
line 156  "if the pixel’s gray is distinguish"
line 231 "being good but failed"

As well as many misspelings: 
line 29 "common" is not a correct word, "joint" fits better
line 33 "multi-temporal" what?
line 72 "methods[" (no space)
line 100 "sar" not capitalized, line 383 "cnn" not capitalized
line 125 "dose"

Author Response

Thank you very much for your valuable comments, please find the detail response in the attached file.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

The authors had revised the manuscript according to my suggestions. I think the quality of this manuscript is adequate for publishing.

A minor issue of the manuscript is that the cross references are missing in lines 366 & 367.

Author Response

Thank you very much for the valuable comments on the cross-reference. The error in the cross-reference has been corrected, specifically on page 10, line 371 of the revised version paper:

At the beginning of this paper, we tried a variety of convolutional neural network models, including ResNets, EfficientNet [45] and FPN [46], we found that HRNet has the best effect.

Author Response File: Author Response.docx

Reviewer 4 Report

After the revision, I fell the major issues have been fixed; however the issue with the lack of clear algorithm description isn't completely gone: as stated in the author's comments, "...we rewrite the specific steps of this part and give a detailed description of the input, output, process in conjunction with the bottom half of Figure 2". However, the image cannot replace a complete algorithm description. The steps described in lines starting from 340 take the result of the K-means output. However, the input itseldf is decribed in other place of the paper, the whole algorithm is fragmented overthe whole paper; while the paper may contain all the steps, the readability would be drastically improved if it was put together in one paragraph.

Author Response

Thank you very much for the valuable comments.

       The description of the algorithmic framework in this paper may be somewhat fragmented. So we made the following changes to make the article more compact and more readable:

  1. We have positioned the cross-reference of Figure 2 more precisely, referring specifically to the lower half of Figure 2.
  2. In addition, from the perspective of article readability, we have supplemented the description: in the first half of the paragraph, we have added a description about using ORB, GMS for coarse matching, and Kmeans to obtain coarse matching sub-image pairs;
  3. At the end of the paragraph, the merging of point sets, filtering, solving transformation models, etc. are mentioned. As the reviewer said, this modification resulted in a complete description of the overall matching framework of the wide swath SAR image registration, meanwhile, improving the completeness of the algorithm description.

The specific changes in the text are as follows (on page 9, line 338 to page 10, line 363):

The overall process consists of several steps as shown in the lower half of Figure 2:

  1. Feature extraction network HRNet (High-Resolution Net) [41]. Before this step, we combine ORB and GMS to obtain the image rough matching results, use the K-means++ method to obtain the cluster centers of the rough matching feature points, and obtain several pairs of rough matching image pairs. The input of HRNet is every rough matching image pair, and the output of the network is the high and low-resolution feature map after HRNet’s feature extraction and fusion;
  2. The low-resolution module. The input is a low-resolution feature map obtained from HRNet, which is expanded into a one-dimensional form and added with positional encoding. The one-dimensional feature vector after position encoding is processed by the Performer [42] to obtain the feature vector weighted by the global information of the image.
  3. The matching module. The one-dimensional feature vector obtained from the two images in the previous step is operated to obtain a similarity matrix. The confidence matrix is obtained after softmax processing on the similarity matrix. The pairs greater than a threshold in the confidence matrix and satisfy the mutual proximity criterion are selected as the rough matching prediction.
  4. Refine module. For each coarse match obtained by the matching module, a window of size w*w is cut from the corresponding position of the high-resolution feature map. The features contained in the window are weighted by the Performer, and the accurate matching coordinates are finally obtained through cross-correlation and softmax. For each pair of rough matching images, the output of the above step are matched point pairs with precise coordinates, and after adding the initial offset of rough matching, all point pairs are fused into a whole matched point set. After implementing the RANSAC filtering algorithm, the final overall matching point pair is generated, and then the spatial transformation solution is completed.

There is also a process description of the overall framework of the algorithm in this article on lines 122-128 on page 3, but the supplementary part here (on page 9, line 338 to page 10, line 363) will be more detailed.

Author Response File: Author Response.docx

Back to TopTop