A Remote Sensing Method for Crop Mapping Based on Multiscale Neighborhood Feature Extraction
Round 1
Reviewer 1 Report (Previous Reviewer 4)
Thank you for the updated manuscript.
-P2L80: The classification of methods into "supervised methods, machine learning methods, and deep learning methods", is a bit odd. Maybe the authors mean traditional machine learning methods (RF / SVM) and deep learning methods.
-P4L154: Not Clear what is meant by "the central pixel being between yes or no"
-P4L157: I still do not understand the argument for the spatial coordinates. The reasoning presented by the authors, to my understanding is: 1. A pixel at a boundary will be surrounded by pixels with a lot of variability, 2. A pixel at the center of a parcel will have a more homogeneous surrounding 3. Thus the spatial coordinates are crucial.
I do not see the logical step from 2 to 3. A simple convolution followed by pooling can tell if the region is homogeneous or not. How do the coordinates help ? Also just to be sure: we are talking about the integer pixel position not the geo-location of the pixels right?
I would like the authors to motivate this part of their methods more clearly even if I agree that the numerical experiments support the idea.
Author Response
Please see the attachment.
Reviewer 2 Report (Previous Reviewer 2)
The article was improved. And no further comments on this article
Author Response
Thank you for your constructive comments and approve, which are very helpful to improve the manuscript.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Please check the attached PDF file.
Comments for author File: Comments.pdf
Author Response
Thank you for your constructive comments, which have been very helpful in improving the manuscript. We have carefully studied the comments and made revisions in accordance with the reviewers' suggestions. The quality of the writing and analysis of this manuscript has been improved and we hope that it will be approved by the reviewers. The following is a point-by-point response. Revised content is highlighted in yellow. Please check the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
My comments on article titled ‘a remote sensing method for crop mapping based on multiscale neighborhood feature extraction’ are listed as the following:
1. In the article, the reference data was introduced but the dates when your team collected were missing. Thereafter the Sentinel-2 data used for this study was not mentioned in detail. This information if of importance for the study and should be clearly mentioned in the article.
2. The fig. 1 is not appropriately annotated as Hulunbeier city is at city level but Heilongjiang is at provincial level. You may add more city boundaries of Heilongjiang.
3. With regards of labeled field in fig.8 to 12, it did not introduce the method of how your team prepared the label fields. There were also missing something that the reference data played.
4. In fig. 11 and 12, the areas in green of sowing pictures were labeled to soybean. It seems wrong.
5. check all figures and add the legends
6. check all acronyms, give the full name when these acronyms appear in the article.
7. the conclusions section should be rewritten. The main points of this article were not clearly presented.
Author Response
Thank you for your constructive comments, which have been very helpful in improving the manuscript. We have carefully studied the comments and made revisions in accordance with the reviewers' suggestions. The quality of the writing and analysis of this manuscript has been improved and we hope that it will be approved by the reviewers. The following is a point-by-point response. Revised content is highlighted in yellow. Please check the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
1. Line 221-224. What does channel information mean? Is it spectral information? Please clarify it in the paper.
2. Line 230. What does temporal channel mean?
3. Line 333. The explanation of Pe seems incorrect.
4. Are Table 2 the results of Region A-D or the entire study area?
5. Lines 405-409 and 429-435 are better moved to the method part.
6. Wider boundaries than those in the Labels are witnessed in Fig.12 MACN results. Does window size cause this problem? Since the author proclaims that 5x5 is the optimal window size in this study, how to explain this problem?
Author Response
Thank you for your constructive comments, which have been very helpful in improving the manuscript. We have carefully studied the comments and made revisions in accordance with the reviewers' suggestions. The quality of the writing and analysis of this manuscript has been improved and we hope that it will be approved by the reviewers. The following is a point-by-point response. Revised content is highlighted in yellow. Please check the attachment.
Author Response File: Author Response.docx
Reviewer 4 Report
Summary:
This paper addresses the problem of crop type classification from multi-temporal satellite optical data. The authors propose a neural architecture for pixel-based prediction of crop type. The architecture takes as input three local spatial windows around the pixel of interest to provide some context and reduce variations in the predictions of neighbouring pixels. Additionally, the network comprises a coordinate convolution, and a spatial and spectral modulation block. The authors evaluate the performance of this model on a small dataset of ~2k parcels of maize and soybean, using two satellite observations per sample. They compare their approach to a CNN, an LSTM and a RF and show that the proposed approach brings a significant performance improvement compared to these baselines.
Strengths
- The proposed approach performs better than the selected baselines on the authors’ dataset.
Weaknesses
- The proposed approach is quite complex, and not exposed in a very clear way. More pressingly, the motivation is not very well articulated and the missing implementation details and missing ablation study do not help convincing the reader of the value of the different design choices.
- The choice of problem statement is also confusing. I do not understand why the authors do not address the problem as a semantic segmentation task where a complete image is processed at once. The available data allows for this framing, and this would directly solve the “local context” problem that the authors set out to solve.
- The dataset used for model evaluation is small (2K parcels) and is a simple binary classification task.
- The literature review missed a lot of relevant references on deep learning for crop type classification.
- In my opinion, the quality of writing could be improved by using present tense active form, avoiding hand wavy motivations, and streamline the presentation of the method which is at the moment quite hard to follow.
For all those reasons, I find that the proposed paper is not mature for publication.
See detailed comments below.
Detailed feedback:
- P1L44: “Object-oriented classification method first segment the images […] and then use algorithms to classify the segmented results” There are more recent deep learning approaches that perform both steps at the same time by framing the task as panoptic segmentation (ref: Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks, Garnot et al. ICCV 2021).
- P2L91: I am not sure that the “mixed pixel problem” (i.e., boundary pixels with mixed semantic content) is the most pressing research question in this field at the moment. To name a few, self-supervised learning, geographical generalisation, long-tail distribution of crop types, efficient temporal encoding, low spatial resolution, multi-modality.
- P2L84: The size of the input tensor grows linearly with the length of the time series, not exponentially.
- P3L110: The literature review misses many relevant references for deep learning models for crop mapping from satellite time series (Convolutional LSTMs for cloud-robust segmentation of remote sensing imagery, Multi-temporal land cover classification with sequential recurrent encoders, Self-attention for raw optical satellite time series classification, Satellite Image Time Series Classification with Pixel-Set Encoders and Temporal Self-Attention, Panoptic Segmentation of Satellite Image Time Series with Convolutional Temporal Attention Networks)
- P3L111: The fact that spatial context is not properly taken into account is not related to deep learning, it’s related to the problem statement.
- P3L118: the authors focus on a setting of “pixel based classification” where each pixel is classified separately. It is a good point that this setting suffers from a lack of contextual information. Yet, one common way to go beyond this limitation is to frame the problem as a semantic segmentation task. A model processes all the pixels of an image at once and makes a prediction for each pixel. The setting addressed by the authors looks like an intermediate option only taking information in a local 5x5 window around the pixel under consideration. I think the authors need to motivate this setting. Why not directly do semantic segmentation, which would directly solve problems (1) and (2) mentioned in this paragraph ?
- P3L125: It’s not clear why the absolute spatial position of a pixel would help the classification, the authors need to present some arguments there.
- P3L138: The first sentence of the paragraph is redundant with the previous paragraph.
- P4L168: The dataset size is relatively small, consider also testing the proposed methods on larger publicly available benchmarks (e.g., BreizhCrops, PASTIS, DENETHOR).
- P5L195: Why take only two dates ? The full time series over one year would surely help.
- P5L215: “the input was divided into three scales (1x1, 3x3, 5x5)” This is a bit ambiguous in my opinion. Do the authors mean that for each pixel they take three local windows of shape 1,3 and 5? If so, the three windows have the same resolution so are a the same scale. Also all the information of the 1 and 3 window is contained in the 5 window. So it’s not clear what the two first window bring to the model. This should be motivated and evaluated in the ablation study (but isn't).
- P5L218: “Coordinate convolution was introduced in the input layer of the 3 × 218 3 and 5 × 5 branches to strengthen the spatial relationships between the central pixel and 219 its neighboring pixels, thereby enhancing boundary information while effectively reduc- 220 ing feature losses. “
The motivation for the coordinate convolution seems a bit hand wavy. Can the authors explicitly expose how the coordinates are expected to help making better predictions ? - P5L226: The dataset used in the paper only has 2 dates per sample. It’s a bit of a stretch to talk about a temporal sequence. For that same reason the choice of an LSTM as a baseline is questionable.
- P5: It’s not clear if the convolutional layers are shared between dates or if there is one convolutional network for each of the two dates.
- P6Fig2: The symbol used for concatenation is typically used for matrix multiplication so it’s a bit confusing. Same comment for Figs 3 and 4
- P8L266: “Unlike the original channel attention module, the channel convolutional attention module replaced the two-layer multilayer perceptron (MLP) with two layers of full convolution. […] In this module, the input feature maps (h × w × c) were first squeezed by global pooling and maximum pooling to obtain two (1 × 1 × c) vectors. Then, they were fed into the first layer of full convolution separately and the size of the convolutional kernels was determined by the adaptive function, which was activated by the ReLu function after convolution. “
This does not make sense. Feeding a 1 x 1 x c feature map to a 1x1 convolution is effectively the same as using an MLP. So there is no difference with the original channel attention method. - P10L338: please cite relevant work for the baseline models.
- P11L348: A lot of implementation details are missing: for the proposed approach learning rate, batch size, optimiser, number of epochs, stopping strategy. For the competing approaches it’s not obvious how the other architectures are applied to the dataset. Does the LSTM only see the central pixel ? Again applying an LSTM to only 2 dates is a bit overkill, an MLP would probably perform equally or better. How is the temporal dimension handled by the CNN? Which window size does the CNN have access to (1,3,5 ?). What hyper parameters for the competing methods ? What is the size of each model in terms of number of trainable parameters ?
- P14L436: why does the ablation study not report the performance metrics ? No conclusion can be drawn on the value of the coordinate convolution and the channel attention just based on qualitative results. Also as mentioned earlier I would also evaluate the interest of having 3 branches instead of just operating on the larger window size.
- General comment: I usually recommend writing articles in the present tense and in active form (e.g., “we use …, we propose to …). I found the use of past tense and passive form (“convolutions were used”) harder to read.
Author Response
Thank you for your constructive comments, which have been very helpful in improving the manuscript. We have carefully studied the comments and made revisions in accordance with the reviewers' suggestions. The quality of the writing and analysis of this manuscript has been improved and we hope that it will be approved by the reviewers. The following is a point-by-point response. Revised content is highlighted in yellow. Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 4 Report
I appreciate the effort of the authors to improve the article. In particular, the added numerical results for the ablation study give more evidence that the design choices made improve the performance. Yet I think the paper still has important weaknesses:
- The comparison with other architectures is not fair. Comparing the proposed approach with 16M trainable parameters to a CNN with only 900k parameters is not fair. The expressive power of a model is loosely related to the number of parameters. So a model with 16 times more parameters is expected to perform better. To prove that the proposed approach is better suited to the problem at hand, I recommend the authors to compare to a standard CNN architecture with a similar number of parameters. Resnet-18 would be a good candidate (~11M parameters).
- The motivation for the chosen problem statement is still not clear (see comment in my previous review for P3L118). The available dataset provides dense annotation (one label for each pixel in the AOI). So the problem could be tackled as a semantic segmentation task where the model takes for example a 64x64 pixel input image and makes a prediction for each pixel. We still have pixel based predictions but the model has access “by design” to a larger region. This is a standard way of doing crop type mapping (https://openaccess.thecvf.com/content_CVPRW_2019/html/cv4gc/Rustowicz_Semantic_Segmentation_of_Crop_Type_in_Africa_A_Novel_Dataset_CVPRW_2019_paper.html ; https://www.mdpi.com/444188 ; Pre-season crop type mapping using deep neural networks). Can the authors expend on what motivates them instead, to focus on their problem statement? What are the situations in which this framing is more relevant than semantic segmentation ?
- The added motivation for including the coordinate convolution (P3L149) is still a bit of a tautology (~”it is essential to sense the position to facilitate the acquisition of the position”). Thus I still think that this part of the method should be motivated with more compelling arguments. Also the ablation study for the coordinate convolution should be using the same network but without giving the coordinates as additional inputs. I think the present ablation is removing some branches of the network, so the decrease in performance can be attributed to the reduced size of the architecture. If with the same network but without giving the coordinates as input, the performance is lower, then that would be a good sign that the coordinates are indeed a valuable information. Obviously the network would be a bit smaller, because instead of hxwxc+2 feature maps, it would be only hxwxc but this difference would be marginal.
- The related work section of the introduction is poorly organised and does not help to get a clear view of the recent state of the art and how the proposed approach compares to those methods. Also most of this related work section deals with crop type mapping addressed in an other setting than the one tackled by the authors. Also it’s “panoptic segmentation” not “panoramic”.
- I agree that the ambiguity of boundary pixels is a common problem in crop type mapping. But the “salt and pepper phenomenon” is not a standard concept in that literature. Maybe spatial inconsistency is a better wording.