1. Introduction
Remote sensing image change detection aims to identify pixel-level changes between dual-temporal images, which is a crucial research focus within the fields of pattern recognition and computer vision [
1]. Presently, it has extensive application in diverse domains, including monitoring natural disasters [
2], tracking urban expansion [
3], analyzing agricultural changes [
4], and studying environmental evolution [
5].
Before the ascent of deep learning, traditional change detection methods primarily involved comparing pixels. These methods typically required the design of artificial features to depict pixel disparities, thus depending on considerable expertise and experience [
6]. Moreover, it is difficult to accurately distinguish change from non-change areas when confronted with occlusions or complex scene changes. With the progress of technology, deep learning has found widespread application in many fields. By constructing multi-layer neuron structures, these technologies learn some abstract features of images, which diminishes reliance on expert knowledge and makes them very suitable for the field of remote sensing imaging [
7,
8]. While the detection results have been optimized to some extent, the deep-learning-based method struggles to effectively capture features of the changed area, indicating that there remains potential for further improvement in the detection results. The current methods of change detection based on deep learning have become mainstream; they can be categorized into three forms based on different tasks: pixel-level, object-level, and scene-level detection [
9].
Pixel-level change detection uses independent pixels as detection units and extracts change information by analyzing pixel differences with pixel-by-pixel operations, which is commonly employed in the initial stages of change detection. Typical approaches include the differential technique, the ratio method, and other direct pixel comparison methods [
10]. But these methods often fail to use image features effectively, which limits accuracy. In response to these limitations, scholars have proposed statistics-based detection methods, such as change vector analysis, principal component analysis, and texture-based analysis, as well as post-classification comparison methods that compare pixels after classification. However, these methods tend to rely on fixed features, so they are susceptible to environmental changes in images, such as lighting variations or shadows, resulting in poor performance in actual scenarios [
11]. Subsequently, machine learning methods, such as artificial neural networks, support vector machine, decision tree, and random forest, have gained traction in change detection. Compared with traditional methods, machine learning approaches demonstrate significant improvements in accuracy. After the widespread adoption of deep learning, many researchers have begun treating pixel-level change detection as a semantic segmentation problem, applying models from the segmentation field to change detection tasks [
12,
13,
14,
15].
Object-level change detection utilizes various feature information from dual-temporal images and segments objects within images [
16]. As a key to object-level change detection, object generation necessitates ensuring the consistency of object boundaries at different times. In the early stages, traditional algorithms, such as the Robert operator, the Laplacian operator, or region segmentation algorithms, were commonly employed for image segmentation [
17]. However, these methods fell short in obtaining object boundaries. Currently, there are three approaches to object generation: single-temporal segmentation boundary, multi-temporal segmentation, and combined segmentation [
18]. A single boundary is applied across all temporal intervals in the first approach, which avoids the need for complex cross-temporal object matching and alignment. Although the single-temporal method entails lower overall computational complexity, the detection results are not sufficiently accurate. In contrast, the multi-temporal segmentation approach yields finer segmentation objects by using boundary superimposition, resulting in greater robustness. The combined segmentation method involves multi-temporal remote sensing image bands, addressing the limitations of single-temporal approaches and enhancing detection accuracy. However, this method also introduces the challenge of computational complexity. In sum, object-level change detection methods focus not only on changes in the pixel value but also on changes in objects, which contain more semantic information.
Scene-level change detection employs multi-temporal remote sensing images as units to assess changes across all pixels at the same time [
19]. It integrates both local and global information, effectively reducing the influence of noise. However, scene-level methods focus on the whole scene, which requires substantial computer memory when processing large scenes. In 2019, an enhanced U-Net++ network was introduced by Peng et al. [
20]. By combining a fully convolutional neural network and the U-Net structure, the model can not only adapt to images of any size for end-to-end training but also improve detection accuracy. Building upon Peng et al.’s work, Lin proposed a new way to cut remote sensing images into regular image blocks and input them into the network to judge changes, which represents a new direction for scene-level change detection [
21]. Subsequently, by optimizing the size of image blocks, Li et al. proposed a model with further enhanced performance and computational efficiency in scene-level change detection [
22].
While object-level and scene-level change detection offer higher detection accuracy, they require predefined object or scene definitions, which requires considerable manpower and time. On the other hand, pixel-level change detection relies on a simpler data source and has more flexible application scenarios. Consequently, researchers’ attention is currently largely directed towards the simpler pixel-level change detection approach, on which we also focused in this study. At present, pixel-level change detection models have three stages: feature extraction, feature fusion, and upsampling. In feature extraction and feature fusion, the utilization of multi-scale features from dual-temporal images has been demonstrated to be an effective method for predicting subtle changes and enhancing change detection accuracy [
23]. Therefore, scholars have tried to combine a variety of excellent feature extraction modules to extract multi-scale feature information and fuse it to improve accuracy in pixel-level change detection [
24,
25,
26]. There are two main strategies: One approach involves the use of Transformer instead of a CNN as the backbone network to extract better feature information. Although Transformer-based models have a larger receptive field and can better grasp the change region, the local detail information processing ability for the edge information of the change region is limited, with the predicted change region usually presenting blurred edges. The second approach is based on the use of a U-Net structure to fuse contextual feature information [
27,
28,
29]. This type of structure can integrate multi-level contextual information; nevertheless, it is hindered by the unsophisticated upsampling method, which restricts the model’s learning capabilities. Therefore, at present, these two strategies cannot fully integrate image information or generate sufficiently accurate change maps. This makes these models prone to false and missed detection and also to presenting fuzziness in the edge area of the change map.
To address the issue of inadequate feature representation and extraction in detection models and to mitigate edge blurring to provide distinct predictions of the boundary of change areas, a change detection model incorporating cross-layer feature fusion and edge constraints is proposed. The primary contributions of our study can be outlined as follows:
A fusion network based on a CNN and Transformer was designed as a feature extractor. In the feature extraction stage, the CNN structure is used to extract local feature information, and Transformer is used to extract global feature information. Then, the features are fused by using the spatial feature interaction module and feature fusion module to promote the correlation between local and global information, optimize the model objects and missed detection, and help improve accuracy.
The addition of a boundary constraint module based on the MLP structure allows segmented edge information to be integrated into the boundary of the constrained change area in the feature map. In order to improve the learning ability of the model, the Bilinear and Pixel Shuffle methods are used to upsample the spatial and channel dimensions, respectively.
The remainder of this paper is organized as follows:
Section 2 presents an overview of the change detection literature.
Section 3 provides the overall details of the model design.
Section 4 introduces the experimental part of our proposed model, and finally,
Section 5 summarizes our findings and future work.
3. Methods
In this section, we provide a detailed description of the proposed model’s specific structure. To address the issues of insufficient feature fusion and blurred boundaries in current change detection methods, we propose a model that employs cross-layer feature fusion and feature exchange. Our approach is based on the concept of feature interaction and integrates both the local information extracted by a CNN and the global information extracted by Transformer. The edge information obtained by the Sobel operator constrains the boundaries of the change area, enhancing detection accuracy and reducing edge blurring. Specific technical details and the technical roadmap are illustrated in
Figure 1.
The proposed method encompasses feature extraction, feature fusion, and upsampling stages, similarly to the fundamental structure of current change detection methods. For the input dual-temporal images, the Sobel operator is first used to extract the edge information ( and ) from the images. Subsequently, the edges are connected with the original image on the channel as the initial input, represented by and , respectively.
After a dual-branch backbone network (CT-backbone), multiple scale feature maps are extracted, denoted by
and
, respectively, where
represents the size of the feature map relative to the original image. The specific correspondence is shown in
Table 1. In the following, the same labels have the same meaning.
Then, the extracted multi-scale feature information (
and
) is transmitted to the cross-layer feature information extraction module for fusion, aiming to enhance the representation of multi-scale features and attain an improved feature representation, denoted by
, with
. Simultaneously, the edge information extracted by the Sobel operator,
and
, is integrated through an MLP module to obtain
, with
, as the edge constraint. Then,
and
are concatenated on channels and passed through a convolutional layer to extract the feature map,
, with
, which contains rich semantic and edge information. To effectively utilize both spatial and channel information, the proposed model simultaneously employs Bilinear and Pixel Shuffle methods for upsampling based on
.
and
represent the results of the Bilinear and Pixel Shuffle methods, respectively, where
. Subsequently, both of them are fed to the feature interaction module to fully integrate spatial and channel information. The outputs are expressed as
and
, respectively, where
. Finally, the outputs of feature fusion at different scales are upsampled to the same size at different magnifications. The final change map,
, is obtained by using a deconvolutional layer. The detailed process is as follows:
where
represents the model output and
represents the upsampling method of bilinear interpolation. The above is the main process of our model. Next, we will discuss the specific implementation details of each module in detail.
3.1. Dual-Branch Backbone Network Based on CNN and Transformer
To enhance the model’s ability to extract image features, a dual-branch backbone using both a CNN and Transformer is proposed. The detailed structure is depicted in
Figure 2.
For the input images
and
, the proposed method uses two convolutional layers to extract shallow feature information:
where
represents a convolutional module with a kernel of 3 × 3, a stride of 2, and a padding of 1, and
represents the feature map extracted in the jth stage of the pre-change image and the post-change image, where
and
. Afterwards, Transformer is used to obtain deeper features.
where
consists of a downsampling module and a multi-head attention module in the Transformer model. The structure is as follows:
where
represents the standard multi-head attention module [
39]. Transformer is used to extract deep feature information; in order to further improve the feature extraction capabilities, a spatial feature interaction module is added.
Fang et al. [
40] pointed out that feature interaction is essential to change detection. The core of change detection lies in detecting change areas with the same spatial position but different temporal characteristics. Distinguishing whether an image represents a scene before or after a change is merely useful for scholars to ensure that the disappearance and appearance of targets is reasonable. Whether those targets represented appear or disappear is meaningless to the model, for which change is the key. In other words, feature interaction does not revise the semantic information of change, making it feasible for change detection. On one hand, the model can perceive contextual information between image pairs by exchanging features. On the other hand, the data distributions between the dual-temporal images become more similar after feature exchange, and automatic adaptation of dual-temporal data distributions can be achieved. The specific process is illustrated as follows:
represents spatial information exchange. The specific implementation is as follows:
where n, c, and w denote the batch size, channel number, and spatial size, respectively.
M represents an exchange mask composed of 1 and 0, indicating areas for exchange and non-exchange. In this model, the weight map output from the multi-head attention module is the criterion for determining whether an exchange should occur. If the weight in the output map exceeds threshold
, it is a region to be exchanged, setting
M = 1; otherwise,
M = 0.
By using feature interaction, the dual-branch backbone network based on the CNN and Transformer can extract both local and global information while enhancing interaction among features through spatial information exchange. This leads to a greater similarity in the data distributions between pre- and post-change images, thereby further highlighting the change areas.
3.2. Cross-Layer Feature Fusion
In deep learning, features at different levels often correspond to spatial information at different scales. With cross-layer feature fusion, it is possible to effectively integrate features at different scales, providing the model with a richer and more diverse feature representation, thereby aiding the model to more comprehensively understand the semantic information in the images. After the backbone, the model extracts feature maps
and
at various scales (where
) from the dual-temporal images. Subsequently, these feature maps are directed to the cross-layer feature fusion module for comprehensive feature fusion, which is illustrated in
Figure 3.
In
Figure 3, feature map
is subjected to a convolution operation to extract more abstract feature representations, expressed as
. Subsequently, the multiple-scale feature maps
are extracted with two branches. The left branch (indicated by the orange arrow) involves downsampling the feature map twice, while the right branch (indicated by the light-blue arrow) involves downsampling the feature map four times. Afterwards, the obtained
and initial input
are concatenated along the channels to obtain
, which integrates multi-scale features and multi-level contextual information. The same method is subsequently applied to
.
The dual-temporal images
and
undergo comprehensive feature fusion in the cross-layer feature fusion module, yielding
and
, where
. Then,
and
are concatenated along the channel dimension and passed through a convolutional module to generate
, which integrates multi-scale feature information. The specific process is outlined as follows:
In contrast to CLNet, the cross-layer fusion method accomplishes the extraction and fusion of multi-scale features through the utilization of two asymmetric branches. This approach ensures that the intermediate feature map captures both higher-level and lower-level context information, enhancing the model’s feature-capturing capabilities. The improved extraction and representation aid the model in effectively discerning change areas.
3.3. MLP-Based Edge Information Extraction Module
To enhance accuracy in predicting change area boundaries, an edge information extraction module based on the MLP structure was incorporated. MLP includes an input layer, a hidden layer, and an output layer. In the hidden layer, neurons from the preceding layer are fully connected to neurons in the subsequent layer, forming a comprehensive fully connected structure [
41]. This full connection facilitates the aggregation of features to a significant extent. In the experiment, the number of hidden layers was set to 2 for the preliminary aggregation of fragmented edges. Simultaneously, the GELU activation function was employed to accelerate the convergence of the model.
Due to the lack of semantic information, the simple edge information (
and
) from the dual-temporal images obtained by using the Sobel operator often results fragmented and not connected. To solve this, the MLP module is employed to amalgamate features from fragmented edge information, yielding edge details across multiple scales. The hidden layers of the MLP map input features through nonlinear activation functions, allowing complex nonlinear transformations in multidimensional space. This enables the MLP to capture richer and more complex patterns and information from input features. Additionally, each neuron in every hidden layer of the MLP is connected to all neurons in the previous layer, with each connection having a weight parameter. These connectivity and weight parameters facilitate information propagation and feature aggregation between different hidden layers, enabling the MLP to effectively extract rich information from input features and aggregate it into higher-level representations throughout the network. This strategy enables detailed edge information capture in change areas and mitigates edge blurring. The process is detailed below:
where
represents the output of the i-th layer of the MLP structure. Finally, five hierarchical edge features
are extracted. Subsequently, these features are fused with features
obtained from the cross-layer feature fusion stage and inputted into the upsampling module for upsampling.
3.4. Upsampling and Prediction Module
After the feature extraction and fusion module, feature maps
are obtained by merging various levels of edge information and multi-scale feature maps, where
. To improve spatial and channel information integration, the Pixel Shuffle and Bilinear upsampling methods are utilized for the individual upsampling of the feature maps. The detailed process is outlined as follows:
where
and
denote the outcomes of the upsampling using the Pixel Shuffle and Bilinear methods, where
. Then, the channel information interaction module is applied to exchange channel information on the upsampled results,
and
, as illustrated in
Figure 4 in detail.
For instance, for the feature maps and of the i-th layer (with size ), the initial step involves transforming them into by using global average pooling. Subsequently, applying softmax converts them into weights, selecting channels whose average values exceeds for channel exchange, resulting in the exchanged feature maps and . Through channel information exchange, the model can be promoted to capture the information of each feature more comprehensively. Meanwhile, by improving the diversity and richness of features, the final features are made more distinguishable. Afterwards, Bilinear is applied 16, 8, 4, and 2 times to restore them to the original image size. Finally, the obtained multiple feature maps are concatenated along the channels, and the final output is obtained by using a deconvolution.
3.5. Loss Function
For change detection models, MSE loss is widely used, as it can effectively assess the performance of change detection models on the entire image. The proposed model also uses it as part of the loss function, which is defined as follows:
As datasets may exhibit an imbalance between the number of positive and negative samples, we incorporated dice loss into the loss function. Dice loss is known for its sensitivity to imbalanced data, which serves to mitigate the effects of data imbalance, and is defined as follows:
In the above equation, n represents the total number of pixels;
and
represent the real change map and the model prediction map, respectively, and their values range from 0 to 1. To prevent the occurrence of non-change regions in dice loss, a parameter smoothing factor is considered, where
. The final loss is, therefore, defined as follows:
4. Results
4.1. Datasets and Experimental Setup
The experiments were conducted with three public datasets: LEVIR-CD [
42], WHU Building [
43], and xBD [
44].
The LEVIR-CD dataset collects Google Earth remote sensing images of multiple cities, including Austin and Lakeway in Texas, USA. It presents a large number of illumination changes due to seasonal effects, and the building change areas are small and dense, which makes it more challenging to determine the actual change areas. The WHU Building dataset was proposed by the Wuhan University team. Compared with the LEVIR-CD dataset, it has larger buildings, and the change areas are sparser. The xBD dataset was proposed by MIT and contains remote sensing images before and after 19 natural disasters such as earthquakes, volcanoes, and floods, and the change areas are mostly irregular, making detection more difficult.
In our experiments, the original images were cropped into non-overlapping 256 × 256 sections and then randomly allocated to training, validation, and test sets in a ratio of 7:2:1 for experimentation. Notably, the xBD dataset classifies change areas into four damage levels: no damage, minor damage, major damage, and destroyed. As our focus lies solely in change areas, we treated the latter three classes as change during experimentation. The models were trained from scratch for 30 epochs, using an initial learning rate of 0.001 and a batch size of 16. The learning rate decreased by 10% every 5 iterations after the initial 15 iterations. The hyperparameters for spatial feature exchange and channel feature exchange were set to 0.5. To ensure an equitable performance comparison, the loss functions of all the comparative methods were replaced with the proposed model’s loss function, neutralizing performance discrepancies due to varied loss functions.
4.2. Comparison Method
DSIFN [
45]: A change detection model that uses cross-layer connections for feature fusion.
SNUNet [
46]: A change detection model employing multi-layer feature fusion with dense connections that combines a Siamese network and NestedUNet to extract sophisticated features and incorporates channel attention and deep supervision techniques to enhance the recognition ability of intermediate features.
BIT [
36]: A detection model based on Transformer that uses Transformer to build an encoder–decoder structure, enhances the feature information of the context through semantic tokens and feature differences, and obtains the change map.
ChangeFormer [
38]: A change detection model that enhances its feature extraction capabilities by replacing the convolutional neural network with Transformer. Additionally, it utilizes the MLP structure to enhance feature differences.
SGSLN [
47]: A novel strategy involving the swapping of dual codec backbones for binary change detection. A temporal fusion attention module is employed to effectively fuse dual-temporal features for enhanced detection.
4.3. Evaluation Metrics
To quantitatively assess the models’ performance, five evaluation metrics were selected to measure the disparities between the predicted change maps and the actual change maps. The chosen indicators included precision (
P), recall (
R), F1 score (
F1), overall accuracy (
OA), and average intersection over union (
mIoU). Precision (P) is the ratio of correctly predicted changed pixels to all predicted changed pixels, while recall (
R) denotes the ratio of the overall true changed pixels. F1 score is the harmonic mean of precision and recall. Overall accuracy (
OA) reflects the proportion of correctly predicted pixels to the entire pixel count. The average intersection over union (
mIoU) provides a comprehensive assessment of detection performance for both change and non-change areas. The calculation equations for these five indicators are as follows:
In the above, TP is the true value, TN is the true negative value, FP is the false positive value, and FN is the false negative value.
4.4. Results and Discussion
The experimental results show the adaptability of the proposed model across diverse change scenarios of varying scales. It not only demonstrated impressive performance on both the WHU Building and xBD datasets, which have larger structures, but also on the xBD dataset, which has smaller change areas. The results are here discussed in terms of both quantitative and qualitative aspects.
Quantitative results: As shown in
Table 2 and
Table 3, the proposed model outperformed the other five models, achieving optimal scores across the five evaluation indicators. On the LEVIR and xBD datasets, while the precision (P), recall (R), and F1 scores of the proposed model show only marginal improvement over the second-best method, the mIoU index exhibits a notable increase of approximately 1.5 percent compared with other models. This is attributed to the inclusion of the boundary constraint module, which heightens the model’s sensitivity to change area edges through boundary constraints. Consequently, the blurring of the edges and the connection of the areas are reduced, aligning the predicted change areas more closely with their actual shapes. The change areas in the WHU dataset are larger, and their edges exhibit more regular shapes, so the proposed model, on the WHU dataset, outperformed the other methods, which have no explicit edge constraints, in terms of accuracy and mIoU.
Qualitative results: As shown in
Figure 5, on the LEVIR dataset, the change areas in images a and b exhibit denser and more regular boundaries. In image a, the SGSLN model exhibits suboptimal detection in the region highlighted by the blue box. This is due to the influence of the house shadow, resulting in fragmented results and an inability to accurately delineate the change area. Similarly, the SNUNet, DSIFN, BIT, and ChangeFormer methods are also affected by the shadow in this region, exhibiting varying degrees of overlap in their detection outcomes and poorly distinguishing change areas. In contrast, the proposed method demonstrated superior visual performance with minimal connected areas. Similarly, in image b, the proposed model outperformed the others significantly in the yellow box. The change area in image c presents an irregular shape, posing greater detection challenges than the first two images. However, the results illustrate that the proposed model exceled at capturing the region and preserving the shape of the change area, exhibiting no instances of missed detection or blurred boundaries. The WHU Building dataset features larger change areas with a more regular pattern than the LEVIR dataset.
Figure 6 shows the results for the WHU dataset. In images e and f, it is evident that the proposed model provided more comprehensive predictions of change areas and achieved greater accuracy at the boundaries.
The xBD dataset presents denser and smaller change areas characterized by numerous irregular shapes than the LEVIR dataset. Similar to the above, superior results were achieved by the proposed model when facing these challenges. As shown in
Figure 7, the region highlighted by the green box in image g reveals that all the models except ChangeFormer generated false detection results, erroneously identifying the top portion of land as a change area. Despite ChangeFormer having better performance in discerning change areas, its edge prediction notably lagged behind that of the proposed method. Likewise, within the green-marked area in image h, only the proposed method achieved exceptional detection outcomes for the irregular segments within the change area.
Overall, whether it was the xBD dataset with small and dense change areas, the LEVIR dataset with more common change area shapes, or the WHU dataset with larger change areas, the proposed model achieved superior outcomes. This is largely attributed to the efficacy of our boundary constraint module. By integrating boundary constraints, the proposed model achieves two key objectives: On one hand, it effectively discriminates among various change areas in dense regions and reduces regional overlap. On the other hand, it ensures that the predicted boundaries closely match their actual values.
To further validate the effectiveness of the model, distinct colors were employed to represent true positive (TP; white), true negative (TN; black), false positive (FP; red), and false negative (FN; green) results, as depicted in
Figure 8. The proposed model outperformed the others in various aspects. It effectively avoided false positive (FP) instances, indicated by the red regions. In images a, d, and h, the proposed model closely approximated the real values along the edges. Moreover, it significantly reduced the occurrence of missed detection, evident in the fewer green regions compared with the results of the other methods. This distinction is particularly noticeable in images a, c, and i.
To evaluate the effectiveness of the edge constraint module, ablation experiments were conducted on the LEVIR dataset.
Table 4 showcases the quantitative findings, and
Figure 9 illustrates the qualitative results. The model with the edge constraint module demonstrated improvements across different metrics compared with the model without it, with a notable increase in the mIoU metric. This improvement highlights the role of the edge constraint module in accurately predicting the change region. Visually, the change areas closely aligned with the true values at the edges, providing empirical evidence of the efficacy of the boundary constraint module.
Furthermore, the influence of employing two upsampling methods during the upsampling stage was taken into account. According to the results in
Table 4, it is clear that the combined use of the Pixel Shuffle and Bilinear upsampling methods can significantly boost detection accuracy. This increase stems from the concurrent integration of channel and spatial information, thereby improving the model’s capability to precisely capture change regions.
5. Discussion
Selecting accurate models and algorithms is crucial for change detection. By applying precise algorithms or models in change detection, the detection accuracy can be improved, achieving more desirable results. This aligns with the current development trends in change detection.
Although previous studies based on CNN networks have demonstrated the powerful feature extraction capabilities of deep learning methods, there are still issues with the clarity of boundaries in the identified regions. This shortcoming is mainly due to the inadequate aggregation of contextual information during feature extraction. The network proposed in this study achieves contextual feature aggregation through the use of a cross-layer feature fusion module and significantly enhances the precision of change regions by integrating spatial and channel information via a dual-branch upsampling module. Additionally, the introduction of a boundary constraint module, which consolidates fragmented edge information through an MLP module, effectively increases boundary constraints within change regions and reduces boundary blurring. These improvements are not only academically significant but also provide more precise and reliable solutions for practical change detection tasks, especially in natural disaster assessment and urban building change monitoring.
Despite the superiority of our method across multiple datasets, there is still room for further improvement. Future research can be carried out in the following aspects: First, more types of datasets and application scenarios can be explored to verify the generality and adaptability of the method. Second, although Transformers effectively capture global information, their computational requirements pose challenges for model training. To advance and extend the proposed method, future research could explore lightweight change detection methods aimed at enhancing the practicality and efficiency of existing methods.