STDecoder-CD: How to Decode the Hierarchical Transformer in Change Detection Tasks
Abstract
:1. Introduction
- (1)
- Transformer networks have strong context shaping ability, but very few works have been conducted on transformers for CD. Thus, we introduce TransCNN to the CD task to capture long-range contextual information within the spatial/temporal scope, which is beneficial for identifying relevant changes in multitemporal images.
- (2)
- Using Siamese TransCNN as the backbone, three types of decoding strategies were investigated to provide practical guidance for change detection.
- (3)
- Extensive experiments on three CD datasets validated the effectiveness and efficiency of the proposed algorithms, i.e., STDecoder-CD and its variant networks.
2. Materials and Methods
2.1. Relevant Datasets
2.2. A Brief Introduction to TransCNN
2.3. Incorporation of the Sharpness-Aware Minimizer (SAM)
2.4. Design of the Decoder Module
2.4.1. Type I STDecoder-CD
- (1)
- The pre-change (t1) and post-change (t2) RS images are input into two branches of the Siamese network simultaneously, and the Siamese network with shared parameters separately extracts bitemporal representations.
- (2)
- Unlike UNet [6], there are only four blocks attached to each Siamese branch. Before the UNet-like encoder–decoder operation, there is a “Conv Block” layer, and after the encoder–decoder operation, there is an “FCN” layer. The “Conv Block” layer splits the input image into tokens of non-overlapping patches with a resolution of 16 × 16, while the classifier “FCN” interpolates the decoded feature maps by four times and identifies the change of interest from them.
- (3)
- The encoder’s network architecture (X) is completely different from that of the decoder (Y); the former adopts the Siamese TransCNN structure, and the latter retains the “Conv + BN + ReLU” structure used in UNet.
- (4)
- A deep supervision strategy was developed to enrich the model capacity with extra regularization. As shown in Figure 2a, during the training process, the loss of each deep supervision is computed independently and is directly backpropagated to intermediate layers. To be specific, outputs of the FCN layer and each decoder block are fed into a plain 3 × 3 convolution layer followed by a bilinear upsampling layer, and then their losses can be computed (in total, there are four losses in the network). In this way, the intermediate layers are effectively trained, and their weights can be suitably updated [12], thus alleviating the presence of the vanishing gradient.
- (5)
- As shown in Figure 2a, to fuse the hierarchical feature representations XnA and XnB (n = 1, 2, 3, 4), we apply the concatenation method to the top layer of the TransCNN layer and use the subtraction method for the remaining layers. This combination is termed Difference Detector I in this study. Actually, for a Siamese network, the concatenation weight sharing mode lacks focus on change information between the former and later phases, while the subtraction weight sharing mode lacks focus on the former phase information [30], so the best strategy is to combine the two weight sharing modes within one network architecture. However, [30] also warns that if we first apply the subtraction method and subsequently use the concatenation method, there will be a significant drop in accuracy.
2.4.2. Type II STDecoder-CD
2.4.3. Type III STDecoder-CD
2.5. The Loss Function
2.6. Evaluation Metrics
3. Results
3.1. Implementation Details
3.2. Comparison and Analysis
3.2.1. The Performance of the Proposed Models
- (1)
- FC-Siam-Diff [6] is a feature-difference method, which extracts multi-level features of bitemporal images from a Siamese UNet architecture, and their difference is used to detect changes.
- (2)
- STANet [5] is a Siamese-based spatial–temporal attention network for CD. Note that STANet is fine-tuned on the ImageNet-pretrained ResNet-18 model.
- (3)
- IFN [12] is a multi-scaled feature concatenation method. It fuses multi-level deep representations of bitemporal images with image difference representations by means of attention modules for change map reconstruction. It uses pretrained VGG16 as the backbone network.
- (4)
- ChangeSTAR [17] is a multi-task architecture for joint semantic segmentation and object change detection. It can reuse any deep semantic segmentation architecture via the elaborately designed ChangeMixin module. ChangeSTAR uses a pretrained deep network (ResNet101) as the encoder. In this analysis, we set the number of iterations to 5600, and the batch size was set to 4 for input images with a size of 512 × 512 and to 16 for images with a size of 256 × 256. The SGD optimizer with momentum = 0.9 and weight_decay = 0.0001 was used during the training process.
- (5)
- SNUNet-CD [13] uses a densely connected (NestedUNet) Siamese network for change detection.
- (6)
- SiUNet3+-CD [1] is a Siamese UNet3+-based CD network architecture. Note that SiUNet3+-CD uses the GN (Group Normalization) layer to normalize the inputs to a hidden layer so that the batch size can be set as small as possible [5] in conditions of insufficient computational resources. At the suggestion of [1], the maximum epoch of iterations was set to 70, and the original learning rate was 1 × 10−3.
- (7)
- BIT [19] is a transformer-based CD method. It uses a transformer encoder–decoder network to enhance the context information of CNN features via semantic tokens followed by feature differencing to obtain the change map.
- (8)
- ChangeFormer [20] is a transformer-based Siamese network for CD. It uses a hierarchical transformer encoder in a Siamese architecture with a simple MLP decoder. We downloaded the pretrained model of ChangeFormer from http://www.dropbox.com/s/undtrlxiz7bkag5/pretrained_changeformer.pt?dl=0 (accessed on 20 February 2022).
3.2.2. Efficiency Evaluation
3.2.3. Ablation Studies—Testing on the PIESAT-CD Dataset
- (1)
- For the Type I decoder, the CD performance of Difference Detector II is only slightly better than that of Difference Detector I in terms of recall, F1 score, and IOU, but the parameters of the former are 10.95M higher than the latter. This implies that the advantage of Difference Detector II is not obvious. Guo et al. [30] further pointed out that the CD accuracy derived from Difference Detector I is several percentage points higher than that derived from the SC or SD detector. Herein, the SC detector only uses Siamese concatenation operations to fuse the hierarchical deep features, and SD only uses Siamese subtraction operations for fusion. Thus, jointly considering the efficiency, accuracy, and robustness, Difference Detector I is still the optimal option for the Type I decoder.
- (2)
- For the Type II decoder, the CD performance using the “many-to-one” strategy is slightly better than the “many-to-many” strategy in terms of precision, recall, F1, and IOU. One possible reason for this observation is that: the “many-to-one” strategy involves more Conv3ReLU operations, so it can enrich the representation of feature information. In addition, the “many-to-one” strategy can preserve the feature diversity by integrating the hierarchical deep features into a singular feature vector using a parallel architecture, whereas the “many-to-many” strategy processes the hierarchical features in series. Shifting from parallel to series might result in loss of information and the destruction of feature diversity. Moreover, the absence of deep supervision on Y2 and Y3 in Figure 8b may also hurt the accuracy of the “many-to-many” strategy.
- (3)
- In the BIT model [19], the transformer decoder (MHCA) was successfully used as a decoder to enhance the original features with context-rich tokens by modeling the long-range dependencies. However, as displayed in Table 3, the CD performance of Type III without MHCA is slightly better than the original Type III with it, although the difference is not large. This implies that the transformer decoder is not essential to Type III STDecoder-CD. One possible reason for this observation is that: the deep features generated by TransCNN itself are context-rich, and there may be no need to guide the network to model the long-range relations again using MHCA. On the other hand, if the original features are derived from CNNs, which only take into account the local spatial context at different levels, it is highly recommended to introduce MHCA as the decoder. In addition, unlike FPN, the Type III decoder lacks a hierarchical design; as shown in Figure 4a, it directly concatenates the multi-scale encoding features together, thereby producing sub-optimal results.
4. Discussion
5. Conclusions
- (1)
- Due to its powerful change feature extraction and representation capabilities, the TransCNN-dominated STDecoder-CD model is superior to other compared SOTA methods (STANet, ChangeFormer, BIT, SiUNet3+-CD, etc.) in terms of object location, boundary production, and quantitative metrics evaluation.
- (2)
- Even though different decoding strategies do not seem to impact the accuracy metrics to a significant level, the proposed STDecoder-CD series architecture is not completely free from the influence of the decoding strategies. The Type I decoding module gives the best CD performance.
- (3)
- Using the ablation or replacement strategy to modify the three proposed decoder architectures has a limited impact on their CD performance.
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Zhao, B.; Tang, P.P.; Luo, X.Y.; Li, L.Z.; Bai, S. SiUNet3+-CD: A full-scale connected Siamese network for change detection of VHR images. Eur. J. Remote Sens. 2022, 55, 232–250. [Google Scholar] [CrossRef]
- Wiratama, W.; Lee, J.; Park, S.E.; Sim, D. Dual-dense convolution network for change detection of high-resolution panchromatic imagery. Appl. Sci. 2018, 8, 1785. [Google Scholar] [CrossRef] [Green Version]
- Wang, M.; Zhang, H.; Sun, W.; Li, S.; Wang, F.; Yang, G. A coarse-to-fine deep learning based land use change detection method for high-resolution remote sensing images. Remote Sens. 2020, 12, 1933. [Google Scholar] [CrossRef]
- Peng, D.; Zhang, Y.; Guan, H. End-to-end change detection for high resolution satellite images using improved UNet++. Remote Sens. 2019, 11, 1382. [Google Scholar] [CrossRef] [Green Version]
- Hao, C.; Shi, Z. A spatial-temporal attention-based method and a new dataset for remote sensing image change detection. Remote Sens. 2020, 12, 1662. [Google Scholar]
- Rodrigo, C.D.; Saux, B.L.; Alexandre, B. Fully convolutional siamese networks for change detection. In Proceedings of the 25th IEEE International Conference on Image Processing (ICIP), Athens, Greece, 7–10 October 2018; pp. 4063–4067. [Google Scholar]
- Song, K.; Cui, F.; Jiang, J. An Efficient Lightweight Neural Network for Remote Sensing Image Change Detection. Remote Sens. 2021, 13, 5152. [Google Scholar] [CrossRef]
- Jiang, F.; Gong, M.; Zhan, T.; Fan, X. A semi-supervised GAN-based multiple change detection framework in multi-spectral images. IEEE Geosci. Remote Sens. 2019, 17, 1223–1227. [Google Scholar] [CrossRef]
- Peng, D.; Bruzzone, L.; Zhang, Y.; Guan, H.; Ding, H.; Huang, X. SemiCDNet: A semi-supervised convolutional neural network for change detection in high resolution remote-sensing images. IEEE Trans. Geosci. Remote 2021, 59, 5891–5906. [Google Scholar] [CrossRef]
- Tang, P.; Li, J.; Ding, F.; Chen, W.; Li, X. PSNet: Change detection with prototype similarity. Visual Comput. 2021, 1–10. [Google Scholar] [CrossRef]
- Zhang, W.; Lu, X. The spectral-spatial joint learning for change detection in multispectral imagery. Remote Sens. 2019, 11, 240. [Google Scholar] [CrossRef] [Green Version]
- Zhang, C.; Yue, P.; Tapete, D.; Jiang, L.; Shangguan, B.; Huang, L.; Liu, G. A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images. ISPRS J. Photogramm. 2020, 166, 183–200. [Google Scholar] [CrossRef]
- Fang, S.; Li, K.; Shao, J.; Li, Z. SNUNet-CD: A densely connected siamese network for change detection of VHR Images. IEEE Geosci. Remote Sens. 2021, 19, 1–5. [Google Scholar] [CrossRef]
- Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Li, H.; Zhu, J.; Liu, Y.; Li, H. DASNET: Dual attentive fully convolutional siamese networks for change detection of high-resolution satellite images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1194–1206. [Google Scholar] [CrossRef]
- Yang, X.; Hu, L.; Zhang, Y.M.; Li, Y.Q. MRA-SNet: Siamese Networks of Multiscale Residual and Attention for Change Detection in High-Resolution Remote Sensing Images. Remote Sens. 2021, 13, 4528. [Google Scholar] [CrossRef]
- Cao, Z.; Wu, M.; Yan, R.; Zhang, F.; Wan, X. Detection of small changed regions in remote sensing imagery using convolutional neural network. IOP Conf. Ser. Earth Environ. Sci. 2020, 502, 1–11. [Google Scholar] [CrossRef]
- Zheng, Z.; Ma, A.; Zhang, L.; Zhong, Y. Change is everywhere: Single-temporal supervised object change detection in remote sensing imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 15193–15202. [Google Scholar]
- Zheng, Z.; Zhong, Y.; Tian, S.; Ma, A.L.; Zhang, L.P. ChangeMask: Deep multi-task encoder-transformer-decoder architecture for semantic change detection. ISPRS J. Photogramm. 2022, 183, 228–239. [Google Scholar] [CrossRef]
- Chen, H.; Qi, Z.; Shi, Z. Remote sensing image change detection with transformers. IEEE Trans. Geosci. Remote 2021, 60, 1–14. [Google Scholar] [CrossRef]
- Bandara, W.G.C.; Patel, V.M. A Transformer-Based Siamese Network for Change Detection. arXiv 2022, arXiv:2201.01293. [Google Scholar]
- Wu, H.P.; Xiao, B.; Codella, N.; Liu, M.C.; Dai, X.Y.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual, 11–17 October 2021; pp. 22–31. [Google Scholar]
- Chen, X.; Hsieh, C.J.; Gong, B. When vision transformers outperform ResNets without pre-training or strong data augmentations. In Proceedings of the International Conference on Learning and Reinforcement (ICLR), Vienna, Austria, 16–17 August 2021; pp. 1–20. [Google Scholar]
- Liu, Z.; Mao, H.; Wu, C.-Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 1–15. [Google Scholar]
- Liu, Y.; Sun, G.; Qiu, Y.; Zheng, L.; Chhatkuli, A.; Gool, L.V. Transformer in convolutional neural networks. arXiv 2021, arXiv:2106.03180. [Google Scholar]
- Srinivas, A.; Lin, T.Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 16519–16529. [Google Scholar]
- Dong, X.; Bao, J.; Chen, D.; Zhang, W.M.; Yu, N.H.; Yuan, L.; Chen, D.; Guo, B.N. Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv 2021, arXiv:2107.00652. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.H.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning and Reinforcement (ICLR), Vienna, Austria, 3–7 May 2021; pp. 1–22. [Google Scholar]
- d’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L. Convit: Improving vision transformers with soft convolutional inductive biases. In Proceedings of the 38th International Conference on Machine Learning, San Diego, CA, USA, 18–24 July 2021; pp. 2286–2296. [Google Scholar]
- Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-aware minimization for efficiently improving generalization. In Proceedings of the 9th International Conference on Learning Representations, Vienna, Austria, 3–7 May 2021; pp. 1–12. [Google Scholar]
- Guo, Y.; Long, T.; Jiao, W.; Zhang, X.; He, G.; Wang, W.; Yan, P.; Xiao, H. Siamese Detail Difference and Self-Inverse Network for Forest Cover Change Extraction Based on Landsat 8 OLI Satellite Images. Remote Sens. 2022, 14, 627. [Google Scholar] [CrossRef]
- Seferbekov, S.; Iglovikov, V.; Buslaev, A.; Shvets, A. Feature pyramid network for multi-class land segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–23 June 2018; pp. 272–275. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; Li, J. Dice loss for data-imbalanced NLP tasks. arXiv 2019, arXiv:1911.02855. [Google Scholar]
- Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H. Escaping the big data paradigm with compact transformers. arXiv 2021, arXiv:2104.05704. [Google Scholar]
Index | Precision | Recall | F1 Score | IOU | |
---|---|---|---|---|---|
Method | CDD | ||||
FC-Siam-Diff | 0.7830 | 0.6260 | 0.6920 | 0.5335 | |
STANet | 0.8520 | 0.8920 | 0.8720 | 0.7723 | |
IFN | 0.9500 | 0.8610 | 0.9030 | 0.8237 | |
ChangeSTAR * | 0.9280 | 0.9132 | 0.9205 | 0.8527 | |
SNUNet-CD * | 0.9140 | 0.8420 | 0.8765 | 0.7802 | |
SiUNet3+-CD | 0.9470 | 0.8700 | 0.9069 | 0.8296 | |
BIT * | 0.9506 | 0.9417 | 0.9461 | 0.8978 | |
ChangeFormer | 0.9363 | 0.8549 | 0.8938 | 0.8079 | |
Type I (ours) * | 0.9439 | 0.8972 | 0.9200 | 0.8517 | |
Type II (ours) * | 0.9290 | 0.8761 | 0.9017 | 0.8210 | |
Type III (ours) * | 0.9313 | 0.8823 | 0.9061 | 0.8284 | |
Method | PIESAT-CD | ||||
FC-Siam-Diff * | 0.7690 | 0.4857 | 0.5954 | 0.4239 | |
STANet * | 0.7903 | 0.7660 | 0.7780 | 0.6366 | |
IFN * | 0.4389 | 0.4580 | 0.4483 | 0.2889 | |
ChangeSTAR * | 0.8298 | 0.8202 | 0.8250 | 0.7021 | |
SNUNet-CD * | 0.7839 | 0.8062 | 0.7949 | 0.6596 | |
SiUNet3+-CD * | 0.8009 | 0.7797 | 0.7901 | 0.6531 | |
BIT * | 0.8530 | 0.8047 | 0.8281 | 0.7067 | |
ChangeFormer | 0.8847 | 0.7686 | 0.8225 | 0.6986 | |
Type I (ours) * | 0.8851 | 0.8316 | 0.8575 | 0.7506 | |
Type II (ours) * | 0.8847 | 0.8294 | 0.8562 | 0.7485 | |
Type III (ours) * | 0.8710 | 0.8350 | 0.8526 | 0.7431 | |
Method | CD_Data_GZ | ||||
FC-Siam-Diff ** | 0.7218 | 0.6501 | 0.6841 | 0.5198 | |
STANet * | 0.7868 | 0.1614 | 0.2679 | 0.1547 | |
IFN * | 0.4644 | 0.4859 | 0.4749 | 0.3114 | |
ChangeSTAR * | 0.8655 | 0.7001 | 0.7741 | 0.6314 | |
SNUNet-CD ** | 0.7708 | 0.7722 | 0.7715 | 0.6280 | |
SiUNet3+-CD ** | 0.7725 | 0.7550 | 0.7637 | 0.6177 | |
BIT * | 0.7990 | 0.6316 | 0.7055 | 0.5450 | |
ChangeFormer * | 0.6934 | 0.5179 | 0.5929 | 0.4214 | |
Type I (ours) ** | 0.8020 | 0.7647 | 0.7829 | 0.6433 | |
Type II (ours) ** | 0.8216 | 0.7513 | 0.7849 | 0.6459 | |
Type III (ours) ** | 0.7857 | 0.7274 | 0.7554 | 0.6070 |
Method | Parameters (M) | FLOPs (G) |
---|---|---|
FC-Siam-Diff | 1.35 | 4.71 |
STANet | 16.93 | 12.98 |
IFN | 50.71 | 82.26 |
ChangeSTAR | 52.58 | 19.53 |
SNUNet-CD | 12.03 | 54.77 |
SiUNet3+-CD | 27.00 | 216.72 |
BIT | 12.40 | 10.59 |
ChangeFormer | 29.75 | 21.19 |
Type I (ours) | 40.84 | 12.14 |
Type II (ours) | 52.45 | 18.20 |
Type III (ours) | 53.89 | 18.29 |
Decoder | Parameters | Precision | Recall | F1 Score | IOU |
---|---|---|---|---|---|
Original Type I | 40.84M | 0.8851 | 0.8316 | 0.8575 | 0.7506 |
Type I using Difference Detector II | 51.79M | 0.8829 | 0.8357 | 0.8587 | 0.7523 |
Original Type II | 52.45M | 0.8847 | 0.8294 | 0.8562 | 0.7485 |
Type II using the “many-to-many” strategy | 54.95M | 0.8834 | 0.8176 | 0.8492 | 0.7380 |
Original Type III | 53.89M | 0.8710 | 0.8350 | 0.8526 | 0.7431 |
Type III without transformer decoder | 53.16M | 0.8792 | 0.8302 | 0.8540 | 0.7452 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhao, B.; Luo, X.; Tang, P.; Liu, Y.; Wan, H.; Ouyang, N. STDecoder-CD: How to Decode the Hierarchical Transformer in Change Detection Tasks. Appl. Sci. 2022, 12, 7903. https://doi.org/10.3390/app12157903
Zhao B, Luo X, Tang P, Liu Y, Wan H, Ouyang N. STDecoder-CD: How to Decode the Hierarchical Transformer in Change Detection Tasks. Applied Sciences. 2022; 12(15):7903. https://doi.org/10.3390/app12157903
Chicago/Turabian StyleZhao, Bo, Xiaoyan Luo, Panpan Tang, Yang Liu, Haoming Wan, and Ninglei Ouyang. 2022. "STDecoder-CD: How to Decode the Hierarchical Transformer in Change Detection Tasks" Applied Sciences 12, no. 15: 7903. https://doi.org/10.3390/app12157903
APA StyleZhao, B., Luo, X., Tang, P., Liu, Y., Wan, H., & Ouyang, N. (2022). STDecoder-CD: How to Decode the Hierarchical Transformer in Change Detection Tasks. Applied Sciences, 12(15), 7903. https://doi.org/10.3390/app12157903