1. Introduction
Remote sensing CD has always been an important technology in image processing related to remote sensing. It detects two remote sensing photos at different times in the same area, identifies the pixels with semantic changes between them, and finally presents the detection results in the form of black and white images.
The ability and means of obtaining picture data have improved, as evidenced by the emergence of satellite remote sensing technology, the steady maturation of remote sensing imaging technology, and other developments. A large number of high-resolution (HR) remote sensing images need to be processed. HR remote sensing photos provide better geometric and spatial information when compared to medium-resolution and low-resolution images. This makes it easier for people to monitor surface changes with greater accuracy. Remote sensing CD has also become a rising star in the field of remote sensing images, but the results of CD will be disturbed by many factors [
1]. On the one hand, there is the influence of remote sensing system, on the other hand, there is the interference of environmental factors, such as the movement, deformation, occlusion of the detected object, or the recognition error caused by camera movement, light change and seasonal change. How to resist the influence of these interference factors and detect the actual changes is very challenging for remote sensing CD tasks. In numerous domains, such as land management [
2,
3], urban development planning [
4,
5], environmental monitoring [
6,
7], disaster prediction [
8], and others, remote sensing CD has been extensively employed. In recent years, the research on remote sensing CD has received more and more attention.
So far, different authors have proposed many CD techniques from different perspectives. In general, these technologies can be divided into two categories: traditional technologies and deep learning-based technologies.
The traditional remote sensing CD technology is mainly divided into the pixel-based method and object-based method. The pixel-based method refers to the analysis of each pixel to determine its changes at different time points. The classical method is the difference map method. It obtains the difference map by subtracting the remote sensing images at two time points, and then determines the change area according to the threshold or other methods [
9]. The ratio map method divides the remote sensing images at two time points to obtain the ratio map, and then determines the change area according to the threshold or other methods [
10]. With the regression analysis method, the basic idea is to obtain the change of image pixels by regression analysis of two remote sensing image data, so as to detect the change of the surface [
11]. These methods are simple in theory and easy to quickly identify the change area. However, due to the difficulty in determining the change threshold, the difficulty in extracting the change properties of the target area and the limitation of the surface detection ability, these methods usually cannot obtain complete change information. In order to make better use of the spectral information of the image, multivariate change detection (MAD) [
12], iterative reweighted multivariate change detection (IRMAD) [
13], principal component analysis (PCA) based on k-means clustering [
14], and change vector analysis (CVA) as proposed by Malila [
15], came into being. Most of these methods are unsupervised and have achieved good results in ground remote sensing CD tasks. The essence of MAD is canonical correlation analysis (CCA) in multivariate statistical analysis, but this algorithm cannot deal with multi-element remote sensing images better. Therefore, the IR-MAD algorithm is studied and proposed. The core idea of the algorithm is to set the initial weight of each pixel to 1, and assign new weights to the pixels in the two images by iteration. Phase angle data are used by the CVA technique to partition the changes. However, the stability of the algorithm cannot be guaranteed, and the effectiveness of this approach mostly relies on the caliber of the spectral bands used in the computation. Another common algorithm is principal component analysis (PCA) [
16]. PCA transforms the image into a set of linear independent representations through linear transformation, which can be used to extract the main feature components of the data. However, PCA depends on the statistical characteristics of the image, so whether the data in the changing region and the invariant region are balanced will have a great impact on the performance of the model. Based on this, Celik [
17] proposed an unsupervised CD method that integrates PCA and K-means clustering. However, the pixel-level method is sensitive to high-frequency information in high-resolution remote sensing images, and is easily affected by image geometric correction and radiometric correction errors, and its applicability is limited. Therefore, it is mainly suitable for low- and medium-resolution images. Therefore, the object-based method is often used in HR remote sensing image CD. It allows richer information representation. The ground features or surface features in the image are used as objects, and their shapes, sizes, and positions at different time points are compared to determine the change. Ma et al. [
18] used different segmentation strategies and a series of segmentation scale parameters to test four common unsupervised CD algorithms on urban area images, and studied the influence of several factors on the object-based CD method. Subsequently, by merging multi-scale uncertainty analysis, Zhang et al. [
19] presented an object-based unsupervised CD technique. Zhang et al. [
20] proposed a method based on the law of cosines with a box–whisker plot to solve the problem that the sample data does not obey the Gaussian distribution, which is superior to the traditional CD method. To some extent, traditional methods that rely on pixels and objects have aided in the development of remote sensing CD. For high-resolution bitemporal images, which contain complex texture features and detailed image detail processing, it is still challenging. Image processing for remote sensing requires new CD approaches.
Natural language processing [
21], audio recognition [
22,
23], and image processing [
24,
25,
26,
27] have all made extensive use of deep learning techniques in recent years. In order to extract features, the deep learning method does not require the manual creation of feature elements, and it has good learning capabilities [
28,
29]. Researchers’ interest in deep learning-based remote sensing image CDs has grown rapidly as a result of deep learning’s achievements in the field of image processing. Convolutional neural networks (CNNs) have been the subject of some outstanding research in the field of remote sensing CD [
30,
31], thanks to the ongoing advancements in technology. UNet [
32], fully convolutional network (FCN) [
33], and ResNet [
34] structures are commonly employed in the remote sensing CD domain for feature map extraction. In 2016, Gong [
35] applied deep neural network to remote sensing image CD for the first time. This method consists of two parts. Firstly, the pre-classification results are obtained by FCM joint clustering, and the training sample set is selected from the pre-classification results. Then, the Restricted Boltzmann Machine (RBM) is used as a tool to construct a deep neural network, and the network parameters are fine-tuned by the back propagation algorithm. Finally, the trained deep neural network outputs the CD results. In 2017, Zhan [
36] proposed an optical remote sensing image CD method based on deep twin convolutional neural network.This method transforms the CD problem into an image segmentation problem, and inputs two time images into two deep convolutional neural networks with shared weights to extract image features. Then, the feature distance map is obtained by calculation, and the CD results are obtained directly by clustering or threshold segmentation method. This method effectively improves the high time cost problem brought by the traditional CD method based on deep neural network, and realizes end-to-end remote sensing image CD. With the deepening of research, the remote sensing CD model has been continuously optimized and improved. Wang et al. [
37] introduced the attention mechanism into the deep supervised network to capture the link between different scale changes between each module of encoding and decoding to achieve more accurate CD. Yin et al. [
38] used the attention fusion module to refine layer by layer in the decoding stage for the reconstruction of the change map, which proved that the introduction of the attention mechanism in the deep learning model can improve the performance and robustness of the model.
The method of image fusion based on deep learning has been widely applied in image processing. The fundamental idea of multi-scale fusion is to utilize features from different scales to complement and enhance each other’s information, thereby improving the understanding and description of targets or scenes. Li et al. [
39] proposed a novel multi-focus image fusion method using local energy and sparse representation in the Shearlet domain. This method first decomposes the source image into low-frequency and high-frequency subbands through Shearlet transform. Then, it fuses the low-frequency subbands using sparse representation and the high-frequency subbands using local energy. Finally, the fusion image is reconstructed through inverse Shearlet transform. Zhang et al. [
40] introduced a Dimension-Driven Multi-Path Attention Residual Block (DDMARB) to effectively capture multi-scale features. Through the Channel Attention (CA) mechanism, these features are processed differently to better express the depth features of the data. Zhang et al. [
41] introduced a compact structure feature distillation block (FD-Block) for high-dimensional information in the encoder stage. The FD module adopts a multi-path feature extraction method, which can fully utilize features from different levels in the high-dimensional input. Based on this, we propose to integrate multi-scale fusion with attention mechanism to fully utilize multi-scale information in images. This not only enhances the understanding and description of image content but also preserves the sharpness of all images in the fused image. However, due to the full integration of different scales of different features and the combination of attention mechanisms, a series of problems such as inconsistent feature fusion, alignment problems, and model complexity and training difficulties will be brought about. The current deep learning algorithm cannot achieve good results in the use of these two, so it cannot guarantee the accuracy of distinguishing between changing regions and invariant regions and the problem of false detection and missed detection of small targets and edge detection. Based on this, we propose a multi-scale feature fusion Siamese network based on an attention mechanism (ABMFNet) in high-resolution remote sensing image CD tasks to solve the above problems. This network contains three modules to assist the network in training. In order to realize the connection between the low-level features and high-level features of spatial difference information, we propose a cross-scale fusion module (CFM) and a bottom-up difference enhancement feature pyramid structure (DEFPN). These two modules are used to fuse the output of different levels of feature difference information to ensure the recovery of the original image detail features during the subsequent upsampling process. In the process of fusing the difference information of the adjacent two layers, we add the feature enhancement module (FEM). This module enhances the representation of the feature pyramid and accelerates the inference speed, while achieving the most advanced performance. In order to solve the problem of insufficient feature fusion and feature connection caused by feature layers of different scales in the fusion process, we designed an attention-based multi-scale feature fusion module (AMFFM). The main contributions of this paper are as follows:
To address a series of challenges in change detection tasks for dual-temporal high-resolution remote sensing images, including handling complex texture features, intricate image details, and effectively integrating differing scale features with attention mechanisms, while mitigating issues such as inconsistent fusion, alignment problems, model complexity, and training difficulties, we propose ABMFNet. This method is end-to-end trainable and makes full use of the rich difference information in high-resolution remote sensing images. It can better improve the detection of edges and small targets, and provides a more effective solution for CD in HR remote sensing images.
We propose the Attention-based Multi-scale Feature Fusion Module (AMFFM), which not only addresses the issue of insufficient feature fusion and connection between different-scale feature layers, but also enables the model to automatically learn and select important features or regions in the image, concentrating more attention on these areas. Additionally, to better assist the AMFFM module in integrating differential information based on the attention mechanism, we design the Cross-scale Fusion Module (CFM) and the Difference Feature Enhancement Pyramid Structure (DEFPN), facilitating the connection of spatial differential information between low-level and high-level features. Furthermore, the Feature Enhancement Module (FEM) is incorporated into DEFPN to enhance the representation and inference speed of the pyramid.
We tested on three datasets; one is the dataset BICD proposed by our laboratory, and the other two public benchmark datasets LEVIR-CD and BCDD. The experimental results show that our proposed model has better improvement and resolution accuracy than the previous algorithms for HR remote sensing image CD.
The rest of the article is divided into the following: The
Section 2 introduces the detailed composition and selection advantages of each module of the entire network.
Section 3 introduces three datasets, the ablation experiment on the BICD dataset and proves the performance of the model through experiments.
Section 4 discusses the effectiveness of the proposed modules and future prospects.
Section 5 is the final summary.
2. Methodology
With the wide attention of remote sensing image CD in the field of image processing, many high-quality algorithms based on CNNs have been applied to this task in recent years [
42,
43]. The task of remote sensing image CD is essentially a two-class semantic segmentation process, and CNNs perfectly meet this task. Furthermore, CNNs possess advantages such as automatic feature extraction and shared convolutional kernels. These characteristics enable CNNs to effectively capture local patterns and abstract features within images, thereby demonstrating outstanding performance in tasks such as image classification [
44]. Therefore, in the coding stage of the network, we select ResNet for feature extraction, and then perform a simple element subtraction operation on the extracted features. The two modules of CFM and DEFPN are used to fuse the difference feature information of different sizes at different stages for subsequent cross-scale fusion. Next, AMFFM is introduced in detail for cross-scale fusion, and the attention mechanism is used to pay more attention to the changing region, while suppressing the process of the invariant region. Finally, the final prediction map is obtained by upsampling.
Figure 1 shows the whole network structure.
2.1. Backbone
The task of the main network is to extract the image feature information of the bitemporal remote sensing image pairs. In the image classification task, the size of the receptive field has an important influence on the classification effect. In order to expand the receptive field, some methods use spatial pyramid pooling [
45]. However, the disadvantage of this method is that the calculation speed is slow and the memory consumption is large. Consequently, as backbone networks, we take into consideration some models that perform better in image classification tasks, like the visual geometry group network (VGG) [
46], the densely connected convolutional network (DenseNet) [
47], the residual network (ResNet) [
34], and the deep learning of deep separable convolution (Xception) [
48]. In the CD task, due to the high resolution of the image, the shallow network is generally not used, Thus, the shallow network is unable to adequately extract the image’s feature information. The general idea is to design a network as deep as possible. However, as the number of network layers increases, deeper redundant layers will be generated, and it is difficult to train the network. In addition, as the depth of a neural network increases, the problem of vanishing or exploding gradients may arise. However, some network architectures such as ResNet have addressed this issue to a certain extent. Nevertheless, overly deep network structures may increase the risk of overfitting rather than solely reducing the effectiveness of training due to gradient vanishing. Therefore, when designing a network, it is essential to strike a balance between network depth and model complexity, avoiding overly deep structures. Through experimental comparisons, we choose ResNet34 as the backbone network, because ResNet34 effectively tackles the problem of gradient vanishing through residual modeling. The advantage of the residual model is that it can better adapt to the classification function to obtain higher classification accuracy. This structure makes the network very deep, and the training performance is still very good. The core idea is jump connection. As shown in
Figure 1a, ResNet generates five feature maps from shallow to deep in the process of feature extraction, but we only use the last four layers, which are 4, 8, 16, 32 times downsampling, and 64, 128, 256, 512 channels of shallow and deep feature maps. The Siamese structure is used to extract the features of the bitemporal remote sensing images
at the same time, and the feature maps of different sizes
are obtained.
2.2. Difference Enhances Feature Pyramid Networks
For the extraction of difference features, we directly use the element subtraction operation after backbone. The four different scale feature maps obtained directly by this operation contain shallow and deep difference information, respectively. However, if we directly use the information from these four difference features for subsequent upsampling prediction, it will lead to a smaller size of the deep network and relatively less geometric information. At the same time, the shallow information contains few image semantic features, which is not conducive to CD. So, after the backbone, we incorporated a feature fusion step. In computer vision tasks, feature pyramids [
49] are usually used to fuse feature maps with different resolutions, reflecting certain advantages. Based on this, this paper proposes a bottom-up difference enhancement feature pyramid structure for encoding.
The structure of this module is shown in
Figure 1c.
Figure 2a shows the operation of one layer; we fuse the difference feature maps of different sizes by pooling convolution for down-sampling and element addition. The feature enhancement module is added to the fusion process of the difference information of the adjacent two layers, which enhances the inference speed of the feature pyramid and improves the performance of the model.
Figure 2b shows the structure diagram of the feature enhancement module. The difference feature size obtained by subtracting the elements is
; we adopt a two-branch structure, where one is responsible for extracting channel change information, and the other is responsible for guiding the original features. The upper branch obtains the useful channel information of the differential features through the Global Average Pooling layer, and the obtained feature map is
, which is then multiplied by the original feature information of the lower branch to restore the feature size, and then the elements are added to the original features and, finally, the output features
are obtained through the
convolution layer. The calculation formula of the above process is:
In this formula, represents the activation function, indicates batch standardization processing, denotes two-dimensional convolution with convolution kernel of 1, is the activation function, ⊗ represents element multiplication, and is global average pooling.
The difference features of multiple scales obtained by element subtraction are
; firstly, it is fused with the previous layer of difference features adopted under feature enhancement and pooling. On the one hand, the feature enhancement features are obtained as output
and, on the other hand, they are transmitted to the next layer as input through the same operation. The calculation formula of the difference enhancement feature pyramid is as follows:
In this formula,
represents FEM,
represents a 2-fold average pooling downsampling layer. The difference characteristic formula is as follows:
Here, is element subtraction.
2.3. Cross-Scale Fusion Module
In the process of DEFPN module’s fusion of different features at different stages, the output of shallow feature information is not accurately modeled due to the relationship with deep features, resulting in that the subsequent feature fusion related to each semantic class does not reach the best state. Currently, there are many cross-scale fusion architectures. The CSFF module proposed by Chen et al. [
50] integrates contextual information of features to better achieve target feature extraction, addressing the issue of inter-class similarity. Inspired by this, we design a cross-scale fusion module, which captures the context dependencies of different stages through a guiding mechanism, and adds semantic and detailed information to shallow features to generate richer representations. The structure is shown in
Figure 3.
Multiple scale feature maps obtained by element subtraction are
; since the features of each stage have different channel numbers and resolutions, we first use the 1 × 1 convolution layer to sample them up to the same size 64 × 128 × 128 by bilinear interpolation. The expanded feature map is
; then, the adjacent feature maps
are densely linked to obtain four new features
. Finally, the concat operation is used to stitch on the channel dimension, and then the number of channels is restored through the 1 × 1 convolutional layer to obtain the final output feature. Therefore,
encodes the detailed information from the shallow layer and the deep semantic information jointly and, finally, a feature map with richer context information and stronger representation is obtained. The calculation formula of the above process is as follows:
In this formula, represents the bilinear interpolation upsampling, batch normalization and activation function, and refers to the Concat splicing operation.
2.4. Attention-Based Multi-Scale Feature Fusion Module
In recent years, with the in-depth study of remote sensing CD tasks, many deep learning approaches solely use basic feature extraction networks, which produces unsatisfactory results for CD tasks. On the one hand, simple feature extraction networks are heavily influenced by semantic interferences such as shooting angles, seasonal changes, and lighting variations. They often struggle to accurately label changing regions in scenarios with densely arranged objects or complex shapes. On the other hand, they fail to fully leverage multi-scale information. The fusion of multi-scale features can help establish connections between them, thereby enhancing the performance of our network. There are two common types of multi-scale feature fusion networks: parallel multi-branch networks, as seen in Inception [
51], SPPNet [
52], DeepLabV2 [
53], and PSPNet [
54]; and serial skip-connection structures, as used in FPN [
49], UNet [
32], HRNet [
55], PANet [
56], and BiFPN [
57]. These architectures perform feature extraction at different receptive fields. Although multi-scale fusion adequately considers features at different scales to merge local details and global structures, they cannot automatically learn and select important features or regions in the image and focus more attention on these areas. Therefore, we introduce attention mechanisms into multi-scale fusion. The edge information of the changing region can be better obtained by using the layer-by-layer fusion of adjacent features, and the attention mechanism [
58] can better focus on the changing region and suppress the invariant region. Based on this, we propose an attention-based multi-scale feature fusion module (AMFFM).
Figure 4 describes the detailed content of the entire module. The input is the four different scale difference feature maps
obtained by the DEFPN module. The feature maps of the adjacent scales are input into the AFFM for feature fusion. In the fusion process, the resolution of the semantic features in the multi-scale feature map is continuously restored, and the attention to the changing region is enhanced. Therefore, we use 6 AFFMs to obtain a feature map that is restored to the 1/4 size of the input image. In the subsequent use of 4-times upsampling and a classification head composed of convolution to obtain classification results. AFFM is shown in
Figure 4b. The feature maps of different stages have different channel numbers and sizes, therefore, a 2-times upsampling operation is performed for deep features, and then spliced with the adjacent upper layer features to obtain the feature map of 2C × H × W. Since the direct cross-level feature fusion method does not take into account the importance and interactivity of the channel and spatial dimension, we send the obtained feature map to the attention module. The feature map is recalibrated by space and channel, which enhances the network’s attention to more noteworthy feature information. Based on CAM and SAM in CBAM, we propose a new attention module, whose structure is shown in
Figure 4c. Different from CBAM’s method of connecting CAM and SAM in series, we adopt the idea of three branches and parallel connection. One branch is used to guide the original features, and the other two are used to extract the change information of the channel and spatial dimension of the original feature map, respectively. The three branches are fused by jump connection to achieve mutual guidance. That is, the original feature map is first refined by CAM and SAM to obtain the channel and space, and then multiplied by the elements of the original feature to restore the size, and then spliced with the original feature. Next, the consistency of the number of channels is maintained by 1 × 1 convolution. Finally, the three branch elements are added to obtain the final output feature map. The formulae of the above process are:
Here, represents a two-dimensional convolution with a convolution kernel of 1, batch normalization and activation function, represents the input feature map, represents channel attention, represents spatial attention, is the feature map after and processing, and is the feature map after and processing.
The size of the output feature map is 2C × H × W. Then, we use the 3 × 3 convolution block to keep the number of channels of the feature map consistent. The formula below illustrates the aforementioned computation process:
In the formula, is a 3 × 3 convolution block, represents a new attention module, and is a bilinear interpolation 2-times upsampling.
Figure 5 shows the heat map after we use the AFFM module. (a) and (b) are the original images, (c) is the label, (d) is the heat map of the backbone network without the AFFM module, and (e) is the heat map with the AFFM module. Areas where the original attention is not obvious, or there is a large area of focus error, can be seen; that is, the red in the feature map represents a highly concerned area, and the blue and green represent a lower weight area. After the introduction of the AFFM module, the feature map is more effective and accurate for these areas. Next, the two modules of CAM and PAM are introduced.
2.4.1. Channel Attention Module
Figure 6 shows the structure of the channel attention module, which keeps the channel dimension unchanged, compresses the spatial dimension, and focuses on the meaningful information in the input features. The size of the input feature map is 2C × H × W. Firstly, the spatial information is squeezed by average pooling and maximum pooling operations to refine the channel, the feature map is compressed into two tensors of 2C × 1 × 1. Then, the two tensors are input into a shared multi-layer perceptron (MLP) with a hidden layer, and the obtained output is combined into a feature vector by element-by-element summation. Finally, the importance of each channel is obtained by assigning different weights to each channel through
function excitation. The formula can be described as:
Here, represents the activation function, represents the input feature map, and refers to the feature map obtained by CAM.
2.4.2. Spatial Attention Module
Figure 7 is the spatial attention module structure, which keeps the spatial dimension unchanged, compresses the channel dimension, and focuses on the position information of the target. Its input is the output after multiplying the original feature elements by CAM. Firstly, the channel information is squeezed by average pooling and maximum pooling operations to refine the spatial dimension, and the feature map is compressed into two tensors of 1 × H × W. Then, the Concat splicing operation is performed on two layers, and the feature map of 1 channel is transformed by 7 × 7 convolution. Finally, the feature map output of SAM is obtained by a sigmoid function. The formula is as follows:
Here, represents the activation function, and represents the two-dimensional convolution with a convolution kernel of 7, batch normalization, and activation function. refers to the feature map obtained by SAM.
2.5. Loss Function
In the training phase, we use the BCEWithLogitsLoss binary classification loss function. BCEWithLogitsLoss is a combination of binary cross entropy loss function and
function. It not only solves the problem of instability of
function, but also avoids the problem of gradient disappearance in training. Its calculation method is to process the real label and the predicted score by the
function first, and then use the two-class cross entropy to calculate the loss. The formula is as follows:
This function compares the input-predicted value with the real value, and calculates the difference between the predicted value and the real value. In the formula, is the natural logarithm, is the predicted value of the model output, and is the true value of the label. This function helps the model gradually improve the prediction ability and better predict the target.
2.6. Learning Rate and Evaluation Index
We use Adam gradient descent as our network optimizer and BCEWithLogitsLoss as our network loss function. The selection of learning rate in deep learning task is very important. The model will not converge and the loss value will skyrocket if the learning rate is too high. If the learning rate is too low, the complexity of the fusion network will be greatly increased, and the model will be difficult to converge. We have set the attenuation index to 0.9, the batch size to 6, the learning rate to 0.001 at the beginning, and a maximum of 200 training iterations, as per our experiment. The assessment metrics that were employed were joint average intersection (MIoU), precision (PR), recall rate (RC), pixel accuracy (PA) and F1 score (F1). PA represents the proportion of correctly predicted change regions in all pixels, the ratio of successfully predicted change regions in the prediction map to the total number of pixels in all true reference change regions is represented by PR, whereas the proportion of correctly detected change regions in the original picture is represented by RC. The prediction results’ total assessment metrics are represented by the F1 score. The better the forecast outcomes, the higher their values. MIoU is to calculate the ratio between the intersection and union of two sets, which represents the change region and the invariant region in the CD task. By combining different evaluation indicators, the performance of the model can be evaluated more comprehensively. The above formulae are as follows:
In the formulae, represents true positive and is a correctly predicted change area; represents false positive, which is that the invariant region is incorrectly predicted as the change region; represents true negative, and is an invariant region for correct prediction; and represents false negative, which refers to the wrong prediction of the changing region as the invariant region.