1. Introduction
Change detection (CD) is a remote sensing image interpretation task to obtain a change map by comparing different temporal images of the same geographical region. Monitoring changes in specific areas of the Earth’s surface is crucial for some individuals and institutions to make critical decisions. Therefore, CD has received much attention as a research hotspot in the field of remote sensing. With the quick advancement of remote sensing technology, it has become gradually easier to obtain remote sensing images that can be used for CD, which promoted the development of CD to a certain extent. At present, CD is widely used in land use monitoring [
1], disaster assessment [
2], urban development planning [
3], environmental monitoring [
4], and other fields.
Based on the unit of analysis of the model, traditional CD methods can be classified as pixel-based [
5] and object-based [
6] methods. The pixel-based method usually generates the change map based on the pixel differences between different temporal images. This method requires data to go through several image pre-processing steps, including radiation correction [
7], geometric correction [
8], etc. For the processed data, the appropriate CD method is used to obtain a set of differential features, and finally generate the change map with a threshold segmentation or clustering method. Celik et al. [
5] uses principal component analysis (PCA) [
9] extract orthogonal feature vectors, and then implements CD by k-means clustering algorithm [
10]. However, the pixel-based method does not fully consider the relationship of feature contextual information during processing, and inevitably leads to the generation of various types of noise and isolated change pixels, which influence the quality of change maps. The object-based method uses structural and geometric information between different temporal images to generate change maps, which largely suppresses the generation of isolated noise, but the method also has high requirements for the registration of different temporal images. These two methods not only require a lot of manual intervention in the image pre-processing stage, but also rely heavily on the experience of professionals for threshold setting. Therefore, there is an immediate requirement to developing automatic and efficient CD algorithms.
In the early years, due to the scarcity of labeled data, research in the field of remote sensing image interpretation started in an unsupervised direction and has continued until now [
11,
12]. Considering the nature of unsupervised methods, most unsupervised methods use the physical properties of the data. For example, in the field of land cover classification, the texture representation of medium resolution imaging spectrometer data can be used to better reveal the land cover type [
13], and in recent studies, Hidden Markov Models have been successfully applied to fully polarized land cover classification [
14,
15], and good results have been obtained. However, data-driven deep learning methods developed rapidly in various computer vision fields in recent years due to the increasing accessibility of data, such as target detection [
16], target tracking [
17], super-resolution reconstruction [
18],, and semantic segmentation [
19], and most of the methods have achieved better results than traditional methods. Therefore CD methods are gradually evolving from traditional methods to deep learning-based methods. Since convolutional neural networks (CNN) [
20] are most commonly applied in computer vision tasks, networks based on CNN such as VGG [
21], U-Net [
22], and ResNet [
23] are widely introduced to the field of remote sensing CD to obtain better change maps.
Owing to the resemblance between the CD and the semantic segmentation tasks, early CD models were obtained through simply modification of the semantic segmentation model, so most of them are single-stream structures. Single-stream CD models have been used until now due to their simple modification and smaller computational expenditure. However, the disadvantages of the single-stream CD model were gradually emerged, so the Siamese network [
24] was introduced into the CD domain to form the Siamese CD model. Therefore, we divide the deep learning-based remote sensing CD model into single-stream model [
25,
26,
27,
28] and Siamese model [
29,
30,
31,
32]. In the single-stream CD model, Zheng et al. [
25] used U-Net as the backbone network and two three-channel images are concatenated into one six-channel image as input, embedding newly designed cross-layer blocks into the encoder stage of the backbone network, integrating multi-level context information and multi-scale features; Peng et al. [
26] concatenated a pair of bi-temporal images and inputs to the U-Net++ network, the global information is used to produce feature maps with higher spatial accuracy, and the multi-level feature maps are then assembled to produce change map with high accuracy. Unlike the single-stream CD model, the bi-temporal images are processed separately by the Siamese backbone, after which the two sets of depth features undergo a series of processing to obtain the change map. Chen et al. [
29] proposed a Siamese spatial–temporal attention network, designed a CD self-attentive mechanism to simulate the spatio-temporal relationship between feature context, and finally obtained a better change map. Zhang et al. [
30] proposed an image fusion network, highly representative deep features are extracted through a Siamese network, and then inputted the depth features into the differential identification network under deep supervision to obtain the change map.
Although the existing deep learning-based CD approaches outperform most traditional approaches, there are still many limitations in terms of structure and functionality. First, in terms of structure, most models follow a U-shaped codec structure [
22]. The problem of back-propagation gradient disappearance is solved to a certain extent using the skip connection mechanism, but because the connection is too sloppy, the features of the two convolution layers being connected have large semantic differences, which leads to increased learning difficulty of the network. Second, in terms of function, the problems of small target miss detection, low robustness to pseudo-change, and irregular edges of the extracted change regions are prevalent on most CD models, and these problems are also urgent problems in CD.
To address the above problem, we propose a model and named it Hierarchical Attention Residual Nested U-Net (HARNU-Net) according to its characteristics. First, the improved U-Net++ [
33] is being used as the backbone network for feature extraction, and the four levels of output features extracted by U-Net++ are fed into the Adjacent Feature Fusion Module (AFFM), and the AFFM can combine multi-level features and context information to make the output change map contain more regular change boundaries. Then, the fused four features are fed into the hierarchical attention residual module (HARM) separately. HARM can enhance features in finer-grained space, and effectively suppress problems such as small target miss detection and pseudo-change interference. Finally, the four processed features are concatenated at the channel level and processed to acquire a precise change map.
The major contributions of this paper are as follows:
- 1.
We proposed a novel and powerful network for remote sensing image CD, called HARNU-Net. Compared with the baseline network U-Net++, our network significantly reduces the miss detection rate of small change regions and shows strong robustness on pseudo-change cases.
- 2.
We proposed HARM for effective enhancement of features in a finer-grained space, using the feature transferability of the hierarchy to effectively filter out redundant information and provide powerful feature representation and analysis capabilities for the model. As a plug-and-play module, HARM can be easily transplanted to other models.
- 3.
The AFFM proposed by us can effectively integrate multi-level features and context information, so as to reduce the learning difficulty of the model during the training process, and make the boundary of the output change map more regular.
The rest of the paper is structured as follows.
Section 2 introduces the related work of the research. The proposed method is described in detail in
Section 3. In
Section 4, the comparative experiments between our model and other seven CD models are introduced in detail, and a sequence of ablation experiments verify the reliability and validity of our proposed module. The paper is concluded in
Section 5.
2. Related Work
The introduction of deep learning into the field of CD accelerated the development of the field. The mature CD model makes people gradually get rid of the tedious task of manually labeling change regions. Since the emergence of Siamese network, because its structural is particularly suitable for CD task, it has been applied into the field of CD by more and more researchers. Zhan et al. [
34] first brought the Siamese network to the CD task and proposed a Siamese CD model. At that time, the features extracted by their model were more abstract and robust than those generated by traditional methods. Fully Convolutional Network (FCN) [
35], which is widely applied to dense prediction tasks, is also adopted in the CD task. Rodrigo et al. [
36] proposed three CD models based on FCN, which are the first end-to-end trainable models proposed in the CD domain. In recent years, variants of FCN such as U-Net and U-Net++ are broadly applied in the field of CD. Peng et al. [
26] used U-Net++ to design an end-to-end CD model, which uses multi-level semantic feature map to generate the change map.
Most of the design of CD models focus on the feature extraction phase, while the important feature fusion phase is often ignored or only using coarse feature fusion strategy [
26,
27,
37]. Although a well-designed feature extraction part in a CD model can assist the model obtain more detailed information for CD tasks, the lack of good feature fusion strategy may lead to deterioration of the final CD results. Neural networks as a hierarchical structure network, it is not same in the semantics of output features at different levels. In many works, visualizing the features at different levels of neural networks can help us better understand how the intermediate layers of neural networks work. Typically, low-level features are larger in size and contain more local detail information, but lack the concept of global semantics; high-level features after more convolution layers become smaller in size, thus better summarize the global content of the image. In previous studies, to surmount the lack of detailed and global information in a single feature, researchers have used simple skip connection operations similar to those in FCN and U-Net to integrate high-level features with low-level features. For example, Li et al. [
37] proposed a CD model with U-Net structure. Their model only simply connects the features of different semantics. Although the skip connection operation may bring some performance gains to the model, some connections in the simple skip connections not only will not enhance the capabilities of the network, but also may have a negative impact on the network. We propose AFFM to address this issue. According to the adjacency strategy, AFFM performs complementary fusion operations on the features of different layers. This feature fusion approach preserves the feature information at each level, and complements the missing elements of features at various levels from adjacent features. At last it builds the foundation for the final output of the change map with regular boundaries.
When looking at a picture, human beings can quickly find the contents they pay attention to in complex scenes and ignore invalid information. This mechanism for diverting attention to the most important regions of an image and ignoring irrelevant parts is called attention mechanism [
38]. Researchers introduced this mechanism to the field of computer vision successfully. In recently, the attention mechanism has shown the great success in various fields, such as image classification [
39], semantic segmentation [
40], person re-identification [
41], etc. In the field of CD, Wang et al. [
42] added the fused attention module of channel and spatial features to the decoding part of the proposed CD network, which proved that using the fusion attention module can improve the accuracy of the output results; Chen et al. [
43] designed a dual-attentive concatenated FCN network, which showed better capabilities by capturing the long-range dependencies between features, and obtained more discriminative feature representations. Although all of the above methods achieved good results, they all neglected to partially strengthen and weaken the features on a finer-grained space, which led to frequent misses in detecting small change regions. Our HARM employs an attention mechanism based on finer feature segmentation, which effectively enhances the feature representation in fine-grained space and provides excellent performance in small change region detection.
5. Conclusions
In this paper, we proposed a Hierarchical Attention Residual Nested U-Net (HARNU-Net) for remote sensing images CD. To enhance the capacity of the backbone network U-Net++, we proposed A-R by remodeling its convolutional blocks, and A-R helped the backbone network replenish the missing part of information in feature extraction. Meanwhile, in order to enhance the fusion of features at all levels, we used AFFM to fuse the features from the backbone network according to the adjacent strategy, which effectively realized the mutual enhancement between feature contexts. In addition, to strengthen the feature characterization at finer granularity, we proposed HARM to achieve the elimination of invalid information of features at the channel and spatial dimensions at a finer granularity space, which provided the output of change maps with better visual effects. In our experiments, we compared HARNU-Net with seven other state-of-the-art CD methods on three CD datasets, and our method yielded the best results in terms of metrics and visualization results. Finally, we validated the usefulness of each module by ablation experiments. However, in our study, we found that our model has a disadvantage in terms of speed, so in the subsequent work our research focus will be on the implementation of model lightweighting and high-speed.