A Multi-Scale Convolution and Multi-Layer Fusion Network for Remote Sensing Forest Tree Species Recognition
Round 1
Reviewer 1 Report
The paper proposes a remote sensing image forest tree species classification method to overcome the challenge caused by the similarity of the visual characteristics between different species. The network introduces the SMCAC module and the MSFF module to obtain the robust feature representation and improve the accuracy of the tree species classification task.
However, the completion of the work is not sufficient: (1) There are only three references from recent years related to the tree species classification task in the paper. It is believed that the authors have not investigated the research status comprehensively. (2) In "4. Discussion", Figure 12 and Figure 13 cannot illustrate the feature extraction ability of the module effectively, the t-SNE visualization method is suggested to use to visualize the feature vectors of the network and prove the robustness of the feature. (3) In "3.1 Datasets", the preprocessing step is not explained clearly.
Some obvious conceptual errors are expected to be check carefully in the paper: (1) According to Figure 4, the point-wise convolution C31 and the convolutional pyramid are parallel, but these two operations in the formula (1) on page 6 are cascade. (2) The introduction of the confusion matrix in section 3.2 is wrong. Each row represents the true values in the confusion matrix. (3) Why the accuracy results of the proposed method are different in the comparative experiment and the ablation experiment? In Table 3, the accuracy result is 71.05%, while the accuracy result is 89.63% in Table 4.
In addition, the writing in the paper is not rigorous:
1. The meanings of the "Gap" in Table 1 and the "Fa" in Equation (2) are not shown in the text.
2. In the second paragraph on page 8, there are some errors in the mathematical expression of the feature map, such as 4/H * 4/H *2C. Please check whether the expression is correct.
3. The text in the fourth paragraph on page 8 is not relevant to the content of the paper.
4. The Kappa coefficient is not described in section 3.2, but it is used to evaluate the performance directly.
5. In section 5, the name of the proposed network has a spelling mistake, such as SACA.
6. In section "references", the references are not in the correct format required by the jounal. The content of the reference is incomplete, and the expression of the conference name is not uniform. The authors are hoped to attach great importance to it.
The English expression of the figure title and the table title is not clear and standardized. Some figure title is ambiguity, such as "Figure 10. Near-infrared light, green and blue, images processed in 3 bands".
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 2 Report
The authors proposed a multi-scale convolution and multi-layer fusion network for forest tree species recognition. The scientific soundness is good and the manuscipts was well organized. However, the authors only did the experiements on one private dataset. To verify the generalization of the network, it is better to confirm the proposed network on other datasets.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Reviewer 3 Report
This paper (# remotesensing-2556690) proposes a remote sensing image forest tree species classification method based on Multi-scale Convolution and Multi-level Fusion Network (MCMFN) architecture. In particular, it is argued that two modules, Shallow Multi-scale Convolution Attention Combination (SMCAC) and Multi-layer Selection Feature Fusion (MSFF), have been adopted to improve classification accuracy. It is a very hot topic that many readers will be interested in. However, in order to clarify the newly revealed knowledge in the paper, it is recommended to revise it further from the following perspective.
(1) The proposed MCMFN-based classification method will be of great help to researchers interested in implementation or verification when the learning process described step by step according to the algorithm format is summarized at the end of Section 2 (Materials and Methods).
(2) To fully analyze one newly proposed classification algorithm, it is necessary to analyze not only its classification performance but also its computational complexity. The complexity analysis should be done theoretically and experimentally.
(3) In addition to the learning-related hyperparameters of the MCMFN algorithm shown in Table 2, consideration of the number of multi-scale for convolution and the number of multi-levels for fusion also seems to be a topic to be mentioned.
(4) The addition of a reference method (e.g., URL for reference) to the experimental dataset preprocessed in TreeSatAI will help researchers interested in implementation.
(5) In the ablation study (e.g., Table 4) and comparison of overall accuracy (e.g., Figures 12 and 13), it is natural that state-of-the-art (SOTA) methods are included in the comparative experiment. In particular, it seems necessary to compare the SOTA employing the ACmix attention mechanism.
(6) Newly coined words should be presented with a source or accompanied by a clear definition. For example, the "shallow features" in the sentence at Line 137 must be accompanied by a definition.
(7) The following arguments are very interesting: "Using traditional single-scale CNNs to extract feature information from remote sensing images leads to the loss of some effective feature information, directly impacting the classification performance." (Lines 189-191) and "To compensate for the spatial information lost in the convolution process, we use 3×3 point-wise convolution C_{31} to avoid losing crucial information between pixels." (Lines 203-204) => If the arguments of the above sentences are proved as an example of ablation experimental results rather than overall accuracy, it seems that the excellence of the proposed method can be better demonstrated.
(8) If any limitations of the proposed method have been discovered at this point, a detailed mention of them would be of great help to researchers interested in this topic.
(9) Some other minor issues include:
- In Figure 1, which shows an example of the large similarity of remote-sensed forest species, a detailed explanation of the large similarity and difference of remote-detected forest species seems to need to be added to the caption.
- Shallow Multi-scale Convolution Attention Combination (SMCAC, Line 19) <=> Shallow Multi-Scale Dilated Attention Combination (SMCAC, Line 197), which one is correct? In addition, what is the SACA at Lines 452 and 468?
- In Expression (1), it is necessary to clearly define how the notation C_{31} and C_{3} are different operations. In addition, the operations 'cat()', 'Gap', and '.' (dot) need to be clearly defined, even if there are things that suggest what they are.
- "As illustrated in Figure 5, the ACmix attention mechanism [44] ..." (Line 216), where [44] is? What is the difference between a square and an inverted trapezoidal square?
- In addition, the contents of Figures 5 and 6 are very similar to the contents of the existing literature (e.g., Figures 1 and 2 in "On the Integration of Self-Attention and Convolution", CVPR-2022). An explanation of any difference would be very kind.
- The typos still exist: Covolution in Figure 5; (H/4)x(H/4)x2C, ..., => (H/4)x(W/4)x2C, .., (Line 268-277)
The typos still exist.
Author Response
Please see the attachment.
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
This version of the manuscript has been fully revised according to the review comments, and I think that it can be accepted.
Author Response
Thank you again for your review.
Reviewer 3 Report
As mentioned in the first round, this paper is a very hot topic that many readers will be interested in. However, in order to clarify the newly found knowledge in the paper, it had been recommended to modify it from the following perspective: to implement or verify the reported results, an algorithmic description of the proposed method is required; the newly proposed algorithm should analyze computational complexity; comparative experiments should include SOTAs, in particular using AC mixing attention mechanisms; the limitations of the proposed method found at this point; and some other issues. Almost all of the above suggestions and opinions were properly responded in the current revised version. However, one thing that needs to be clearly mentioned is the addition of SOTA to comparative experiments. Reference [25], which has been included as a reference in the current version, adopts the mechanism of the ACmix attention. Therefore, it seems that the reference [25] should be included in the comparative analysis to show the excellence of the proposed method. Otherwise, it seems that there should be an explanation as to why it was excluded.
Author Response
Please see the attachment.
Author Response File: Author Response.docx