Multiscale and Multitemporal Road Detection from High Resolution SAR Images Using Attention Mechanism
Round 1
Reviewer 1 Report
This study proposes a network for road detection from SAR imagery. Its main contributions consist in taking advantage of multiscale and multitemporal information.
The document is easy to read and follow.
The English needs an extensive review with a lot of misspelled words.
The document is well supported with references although old (in 42 references only 7 from 2020 and 2021).
The subject of the paper is interesting with a great potential of application.
The introduction section has about 2.5 pages which seems too much. It should be shortened.
One of the weaknesses of this study is the few number of images used in the dataset.
Another limitation of this study is that it is clear from Figure 10 that the proposed method missed a lot of roads specially the narrower ones.
In the beginning of the Abstract “Road detection from images gas emerged”. Did you mean “has emerged”?! Please correct.
In line 33 please correct “in the the road”.
In line 33 authors say that “… synthetic aperture radar (SAR) has received a lot of attention in the the road extraction area recently…”. Authors should include some recent references about works using road extraction at the end of this sentence.
Authors should replace the “Reference […]” structure when referring to other works with “Authors in […] ” or something similar.
Figure 3, 4 and 5 appear in the document before referenced in the text. Please correct.
In line 196 authors say that “We apply data augmentation…”. Authors should refer how many samples existed originally and how many resulted in total in the dataset after applying image augmentation.
The authors use 397 images for training and 97 images for test. Authors should explain why are using such a small dataset.
In line 299 authors should explain why are removed regions whose area is less than eighty pixels.
Table 2 should appear in the document before Figure 8 and 9.
Figure 8 and 9 should only appear in the document after being referred in the text. Please correct.
In the discussion and conclusion sections authors say that with more images better results are obtained. That’s an obvious finding and intuitive. Authors should instead discuss and conclude the number of images/samples necessary versus the quality of the obtained results.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The paper presents a method for road detection from multiple SAR ιmages combining both multiscale and multi-temporal an attention mechanism. The paper is very well written and organized. Although it is based on three previous works, namely HRNet [35], criss-cross attention [36] and multi-scale attention [37], it combines ideas from these three papers very elegantly towards solving the specific semantic segmentation problem, i.e. road detection from multiple Terra-SAR images. I personally believe that the proposed method could have many interesting applications/extensions for satellite remote sensing and therefore I think that a “future work” section is missing and needs to be added in the “Conclusions” section.
I think that the manuscript can be accepted with minor revisions, as there are some issues that require corrections or further clarifications (see below):
l.1: “gas emerged => “has emerged”
l.5: “Though some” => “Some”
p.2-3: You could replace “Reference” with author names.
Figure 2: This Figure is a copy from [35], so I think a) a reference is required and b) some additional details (e.g. 2-3 sentences) are needed in the paragraph referencing this figure, so that the reader can better understand this architecture (e.g. what and how many convolutions are performed, etc.).
p. 185: reolution => resolution
p. 208-209: You write “inspired by criss-cross attention” and you explain the algorithm you propose very elegantly, but some additiona details are needed:a) to explain why you use crsss-cross attention instead of “non-local” attention (e.g. transformer architectures). An obvious answer are the performance advantages are already demonstrated in [36], but you could elaborate a bit on this and on some of your choices (e.g. you don’t explain the criss-cross concept and you don’t seem to use the V (value) feature).
p.222: fristly => firstly
Paragraph after Figure 4: In this paragraph for many tensors you use (T-1) instead of the T that is shown in Figure 4. Please correct.
p.8, second line: “fea” => “feature”
p. 233: the notation used (“{…}/ F…”) is not clear-please explain or revise.
l. 251: I suggest to replace this “encoded” sequence of steps with plain text, e.g. the average reader may not be familiar with BN(Batch Normalization) or the ReLU function.
Section 3.1: Please elaborate on a) how you performed the annotation of the set and b) the availability of your dataset (and TerraSAR-X images in general). Also, you should mention here (and not later) that you split the images in 1024x1024 tiles, bit still it is not clear how you select the train and test tiles (e.g. writing 397x1024x1024 is confusing: please first explain in detail how you split the images into (non-overlapping?) tiles and how many tiles are created in this way.
p. 273-275: The dates provided do not match with those in Table 1: 9/5/2013 (Area 1) and 4/3/2013(Area 2).
Table 1: Please remove the white space at the start of the cells in left column, so that the lines match.
p. 298: Why do you need this post-processing? Do you apply it in all methods, for fair comparison?
Figure 7: I thought MSMTHRNet(TAF) used ONLY temporal attention. Why do you show the multiscale attention structure in the Figure (as a second step)?
Author Response
Please see the attachment
Author Response File: Author Response.pdf