Next Article in Journal
AF-SRNet: Quantitative Precipitation Forecasting Model Based on Attention Fusion Mechanism and Residual Spatiotemporal Feature Extraction
Next Article in Special Issue
FEPVNet: A Network with Adaptive Strategies for Cross-Scale Mapping of Photovoltaic Panels from Multi-Source Images
Previous Article in Journal
Rupture Process of the 2022 Mw6.6 Menyuan, China, Earthquake from Joint Inversion of Accelerogram Data and InSAR Measurements
Previous Article in Special Issue
Remote Sensing Image Scene Classification via Self-Supervised Learning and Knowledge Distillation
 
 
Article
Peer-Review Record

Continual Contrastive Learning for Cross-Dataset Scene Classification

Remote Sens. 2022, 14(20), 5105; https://doi.org/10.3390/rs14205105
by Rui Peng 1,2, Wenzhi Zhao 1,2,*, Kaiyuan Li 1,2, Fengcheng Ji 1,2 and Caixia Rong 1,2
Reviewer 1:
Reviewer 2:
Reviewer 3: Anonymous
Remote Sens. 2022, 14(20), 5105; https://doi.org/10.3390/rs14205105
Submission received: 24 July 2022 / Revised: 20 September 2022 / Accepted: 9 October 2022 / Published: 12 October 2022

Round 1

Reviewer 1 Report

The work describe use of contrastive learning and knowledge distillation to alleviate catastrophic forgetting problem for cross-dataset scene classification, i.e. learning the classifier continuously from the stream of non i.i.d data. Authors proposed a new method CCLNet. Experiments were done using three datasets (AID, RSI, NWPU) for scene classification, where incremental learning were simulated providing them one after another, what results in three incremental steps. Ablation study and comparison with other CL methods are provided. Additionally, many figures/visualizations supports experimental part of the work.

Strengths:

1. Good combination of existing methods for fighting with catastrophic forgetting (feature distillation, LwF) with an additional contrastive learning inspired by new self-supervised contrastive methods.

Weak points:

1. The way how the article is written - line: 289, line:27 saying "aggregate of the tree datasets", and the method presented in Fig.1, it's not clear if the exemplars from the previous steps are used or not. I assumed it shouldn't, because authors compared to non-exemplar based CL methods later on, i.e. LwF, EWC. But this shouldn't raise any questions in this work. If exemplars are used - please specify numbers in memory buffer, otherwise state in the text "exemplar free method" that will be more clear.

2. Algorithm 1 - makes more confusion than clarification. Notation is not easy to follow and should be improved.  At the first place, authors need to answer the question: "What they try to present there?" Otherwise, waste of space. Just a few examples: (1) what happen if t =1, then first line after loop: D_1 <- D_0, what is D_0? (2) there are no line numbers, so hard to refer (!) but in third for loop we calculate those values separately, e.g. "sim(f_new, f_old)" and what? No assignment? After doing this, we update parameters W{*,b}_t-1 (don't know why it's t-1 if current is t) with eq. 12. But this equation just present final loss components. I can understand that authors assumed a gradient descent algorithm, and later reveled that it's Adam, but this is confusing.  (3) W{*,b} are not described, (4) where is epoch loop there? I guess 1..M, but I don't think that's correct, as in epoch we iterate through all samples in D_t. (5) In "Require:" D_t^L D_t^U are mentioned, but not used later. They form D_t - but it's not clear here.

3. Ablation study is performed with reduced data sets (15%,15%,10%) instead of 30%,30%,20%. It is hard to compare the numbers between values provided in this experiment (Tab.1) and later, i.e. Tab.2. 

4. Ablation study - How it's possible that in the first step the methods are so different, if this is the first incremental step? Now distillation from previous model - as we start task t=1. We see, that each proposed element hurt the current task performance. Would be nice to see combination CE + Contr if the contrastive add something for the current task. Otherwise, it's just a fight for a stability and not forgetting previous tasks, paying the price with plasticity.

5. Ablation: l.476-477 "the optimal overall accuracy was obtained for the method 476
which utilized spatial loss, class loss and contrast loss simultaneously". If we consider overall average accuracy, this is not the best method, row above at the final task is better. We see that this method starting in very low point for first task and ending very low with the final one. That's why it looks so good in the forgetting comparison.

6. Why authors do not consider placing tradeoff parameters for each component in the final loss, adjusting them. Like lambda parameter for LwF. In this way, maybe more optimal values could be found for the methods. Playing with stability/plasticity for knowled distillation (Spa, Class) vs learning new things (CE, Contr).

7. Having only single, fix setting of 3 datasets is not convincing, neither I consider it as a close to extensive experiments for continual learning. What was stated in the conclusion (l.696). Even with 3 dataset more scenarios can be created, e.g. shuffle, more splits, overlapping classes - domain IL vs class IL, just to name a few.

8. Tab.2 what hyperparameters were used for the methods? E.g. what lambda for EWC, LwF? How it was chosen? Why starting point for Step 1. for those methods is so low and not comparable with the "Ours"? This is the first incremental step why we see such a discrepancy here?

9. Some strong claims which are not true. Few examples: 

- l. 88: "Nevertheless, the regularization-based continual learning method will forget the previous scene data without knowledge distillation due to the complicated cross-domain remote sensing scene images." Not true. Knowledge distillation is one of the method to alleviate forgetting. Not the only one. What's more confusing is that KD is still regularization for continual learning - we regularize an old model with a new model.

- l.108 "In addition, the images acquired from different satellites have the problems of variation in illumination, backgrounds, scale and noise that caused rapid forgetting." This doesn't sound like cause for forgetting. It's more about concept shift or domain change/adaptation. We can talk about forgetting phenomenon for the model. I cannot say that some particular data cause more forgetting than the other in general. This needs to be more precise or changed.

- l. 125: "Although the obtained contrastive information could enhance the feature representation further by leveraging contrastive learning, " - not clear what authors meant here...

-l. 136-138 once again lack of KD "resulting in the failure..." Strong claim. Overall knowledge retention - yes. Only KD, no.

10. l247-253: repetition, do not bring anything new, especially in this section.

11. Eq. 11 - strange notation for summation of (-,-) ?

12. Very weak description of training details. We know that centos7.6 was used, but no architecture of MLP projector head. This is a strange priority put on the details for having reproducibility experiments. We can assume that model is pretrained on ImageNet - it's not said in line 429.

13. Overall lr is very small, additional weight decay of 1e-3 can raise question simply how much the model can train with this? Instead of so many plots with confusion matrix maybe would be good to place training/validation curves during the incremental training. Or even present simply once again CE with simply fine-tuning in Tab.2, or kNN with fixed backbone for the background information how good is this pretrained model.

14. Some typos:

Typos:

- l.304: Unnecessary T letter.

- Eq.4 - cossim -> cos_sim

15. Even with zooming some figures are not readable, e.g. Fig. 12 (!)

16. About figures itself: I would better see less figures/picture but more convincing results of experiments and good description of used methods parameters.

 

I will opt for reject. This paper is based on one experimental setting. Results for task 1 are not clear - for ablation study and compared methods. For me, this work needs more than just a next iterative revision.

Author Response

We greatly appreciate your comments, which have helped us to improve the quality of our manuscripts. The following is a document of our responses to your comments.

Author Response File: Author Response.docx

Reviewer 2 Report

Please revise English. For example, "which namely catastrophic forgetting" is not clear. Also, in my opinion, although I see that others are using it, the use of dramatic words like "catastrophic" is not appropriate for a journal article. What is going to happen if an algorithm forget old images and that is catastrophic? Capitalization is missing in some parts.

Add info about the distribution of classes in each data set. I don't think accuracy is a best measure for a journal publication. First sentence in 5.1 is unclear. What are the 15-15-10 for? Why the difference among the datasets? Explain. In table 1, why only some columns are bolded? Is 0.26% increase in accuracy enough to be considered an improvement? In table 3, authors need to compare with more recent methods, only classical ones are not enough.

Figure 6 is not very useful. Too busy, with only few explanation.

References are good but seem regionalized. Found one self.

Author Response

Thank you for your suggestions, these suggestions helped a lot to improve the quality of our manuscript!

Author Response File: Author Response.docx

Reviewer 3 Report

This paper develops a new algorithm to tackle the challenge of catastrophic forgetting in continual learning settings. The idea is to use knowledge distillation to encode information about past works while updating the model to learn a new task. Additionally, contrastive learning is used to stabilize data representations in the embedding space. Experimental results on three real-world datasets are provided to demonstrate that the proposed algorithm is effective and leads to state-of-the-art performance when compared to prior works.
  Prior works in the Remote Sensing literature have not addressed continual learning settings and the paper is novel in this aspect. The results are also convincing. I have the following comments to be addressed before publication.   1. A group of continual learning algorithms uses generative learning to synthetically generate samples of past learned tasks. For example:   a. Shin, H., Lee, J.K., Kim, J. and Kim, J., 2017. Continual learning with deep generative replay. Advances in neural information processing systems, 30.   b. Rostami, M., Kolouri, S., Pilly, P. and McClelland, J., 2020, April. Generative continual concept learning. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 34, No. 04, pp. 5545-5552).   c.  Verma, V.K., Liang, K.J., Mehta, N., Rai, P. and Carin, L., 2021. Efficient feature transformations for discriminative and generative continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13865-13875).   The advantage of the above work is that they do not need a memory buffer. I think it is advantageous to include the above works in the introduction section to be complete.   2. In section 4.2, what is the rationale behind using the order AID, NWPU-45, and RSI-CB256? Is it possible to report performances on the other two possible task orders? I think comparing different task orders can be interesting.   3. I think having only 3 tasks is limiting. Is it possible to increase the number of tasks, e.g., by splitting the datasets according to using different classes? The splitting approach is common in continual learning literature to generate more tasks.   4. From Table 2, it looks like not only catastrophic forgetting is addressed but we have a positive backward transfer. This is an interesting result but could you elaborate more on why we have this outcome?   5. I recommend improving the clarity of Figure 5, including enlargement, increasing the font sizes on axes, etc.

Author Response

Thank you very much for your helpful comments, which greatly improved the quality of our articles!

Author Response File: Author Response.docx

Round 2

Reviewer 2 Report

author answered my questions accordingly.

Reviewer 3 Report

The authors have addressed my concerns by performing new experiments and revising the paper.

Back to TopTop