1. Introduction
Building change detection (BCD) is one of the most significant research directions in remote sensing image processing [
1]. By identifying structures at the same geographical location in multiple images at various time periods, it is possible to determine whether substantial changes have occurred to the buildings in the area, where substantial change refers to changes in physical attributes, such as the conversion of wasteland into buildings and converting roads into buildings [
2,
3]. BCD is widely used in land resource management [
4], environmental monitoring [
5], urban planning [
6], and post-disaster reconstruction [
7]. Therefore, it is of the utmost importance to develop effective BCD methods.
Generally, there are two types of BCD methods: traditional methods and deep learning (DL)-based methods. Traditional methods are further divided into pixel- and object-based methods [
8]. Pixel-based methods usually generate difference maps by comparing spectral or texture information between pixels, and then obtain BCD results by threshold segmentation or clustering algorithms [
9,
10]. However, independent pixel information ignores contextual information, which leads to a lot of noise [
11]. Moreover, pixel-based methods are mainly suitable for low-resolution images with simple detail information [
12]. Hay et al. [
13] introduced the concept of objects to remote sensing images. There has been a significant amount of research conducted on object-based methods since then [
14,
15,
16]. Based on the rich spectral, texture, structural, and geometric information in bi-temporal images, the core idea is to segment images into unrelated objects and analyze the differences [
17]. By utilizing objects’ spectral and spatial characteristics, object-based methods can improve detection accuracy [
18]. The effectiveness of these methods, however, depends on the object segmentation algorithm, and does not take semantic information into account. This can easily be disrupted by pseudo-variation [
19]. Therefore, the generalization performance of these two types of methods is not very suitable to meet realistic needs due to limitations of applicable image pixels and specific conditions [
20,
21].
With the continuous improvement of satellite earth observation capabilities, it becomes easier to obtain remote sensing images with high spatial and temporal resolution [
22,
23,
24]. Meanwhile, details of ground objects are revealed more effectively [
25]. In recent years, DL has demonstrated excellent results in a variety of computer vision tasks [
26,
27,
28,
29,
30]. Compared to traditional methods, DL technology not only improves feature extraction ability, but also improves detection efficiency, so it is widely used in BCD as well [
31,
32]. Because convolutional neural networks (CNN) process large amounts of data, methods such as SNUNet [
33], STANet [
34], and SCADNet [
35] have performed reasonably well on BCD. By cleverly fusing the shallow and deep features of convolutional neural networks in a suitable manner, it is possible to improve the BCD performance of high-resolution remote sensing images [
36]. Despite CNN’s ability to provide deep features rich in semantic information, which are conducive to identifying internal differences in buildings, it lacks more detailed information [
37]. In addition, while deep convolution can extract the details of shallow features, thereby preserving the building edge contour, an inadequate amount of semantic information will result in error detection [
38]. Compared to CNN, Transformer is capable of extracting feature information and modeling global dependency structures, which can reduce the probability of feature information being lost during model calculation [
39]. Chen et al. [
40] transform the input bi-temporal images into high-level semantic markup and model the context in a compact tag-based space-time model. Utilizing the relationship between each pixel and the semantic information enhances the feature representation of the original pixel space, further highlighting the changing buildings. To achieve Transformer’s ideal state of generalizing existing models to solve other problems, a large number of high-quality datasets must be trained on the network [
41,
42].
In spite of the fact that many researchers have proposed excellent BCD methods in recent years, the following two problems still persist. On the one hand, current mainstream methods produce relatively simple feature forms that do not adequately represent the characteristics of changing buildings, so they are easily influenced by pseudo-changes. On the other hand, there are large scale differences in the extracted feature information. Ignoring the fusion of heterogeneous features will result in the loss of valuable information, which makes it difficult to accurately identify local–global feature expressions between buildings. Therefore, we propose the progressive context-aware aggregation network for BCD. The critical goal of our proposed method is to extract the local–global changing building features effectively, and fuse the extracted multi-scale and multi-level features more reasonably. Finally, we use a fully convolutional network (FCN) to further refine the feature map after dense fusion to obtain BCD results. Our major contributions are summarized below:
- (1)
We design the progressive context-aware aggregation module to cleverly stack deep convolution and self-attention, thus leveraging the feature extraction capability of the above two individuals. Deep convolution extracts shallow change information about buildings in bi-temporal images, while self-attention further acquires high-level semantic information. As a result, our extracted local–global feature information contains not only local useful information but also global complementary information.
- (2)
We propose the multi-scale and multi-level dense reconstruction (MMDR) module, which groups extracted local–global features according to pre- and post-temporal sequences and gradually reconstructs them, making the local–global information fusion more reasonable. Each group is connected through our multi-level dense reconstruction strategy. In addition, subsequent groups are able to reconstruct information based on prior reconstruction information provided by the previous group. This promotes the retention of effective information during the reconstruction process, and further enhances the ability to recognize areas that are changing within the building.
3. Experiments and Results
We first introduce the four BCD datasets (LEVIR-CD, SYSU-CD, WHU-CD, S2Looking-CD) used in the experiments in detail, followed by the evaluation indicators and parameter settings. After that, we analyze the ablation experiment results. As a final step, we compare our method with eight other comparison methods for comprehensive visual and quantitative experiments.
3.1. Datasets
To demonstrate the effectiveness of our proposed method, we use four common BCD datasets: LEVIR-CD, SYSU-CD, WHU-CD, and S2Looking-CD. Each dataset contains bi-temporal images as well as building labels that are changing over time. The specifics of their introduction are as follows:
The LEVIR-CD [
34] is a collection of architectural images created by Bei-hang University, containing original Google Earth images collected between 2002 and 2018. Every original image has a resolution of 0.5 m and is 1024 × 1024 pixels in size. These changes involve barren land, residential areas, garages, grasslands, and other building modifications. In order to facilitate faster computation, each image was divided into 256 × 256 pixels without overlap. We, therefore, used 3096 images for training, 432 pairs for validation, and 921 pairs for testing.
Figure 3 illustrates six different scenarios from the LEVIR-CD dataset.
The SYSU-CD [
39] dataset was released by Sun Yat-Sen University. A total of 20,000 256 × 256 pixels with a resolution of 0.5 m aerial images were captured between 2007 and 2014 in Hong Kong. The construction of urban and mountain buildings, urban roads, and coastline expansion comprise the majority of the changes in the dataset. The images were divided into training, validation, and testing sets according to a ratio of 6:2:2, resulting in 12,000, 4000, and 4000 images for training, validation, and testing, respectively.
Figure 4 illustrates six different scenarios from the SYSU-CD dataset.
The WHU-CD [
41] dataset is a BCD dataset released by Wuhan University and contains one image of 15,354 × 32,507 pixels with a resolution of 0.5 m. The original image pairs were taken between 2012 and 2016 in Christchurch. The reconstruction of buildings after earthquakes and the transformation of wasteland into buildings are the two major types of building change. We cropped the original image to 256 × 256 pixels without overlapping, and obtained 7432 images in total. Based on the 7:1:2 ratio, we divided the cut images into three sets, the training set, the validation set, and the test set, with 5201, 744 and 1487 images, respectively.
Figure 5 illustrates six different scenarios from the WHU-CD dataset.
The S2Look-CD [
45] dataset consists of 5000 bi-temporal very high-resolution images taken from three types of satellites, Gaofen Satellite (GF), SuperView Satellite (SV), and Beijing-2 Satellite (BJ-2), between 2017 and 2020, with a wider perspective to provide richer information about changes. The imaging area covers a wide range of rural areas throughout the world with a variety of complex features. There are 1024 × 1024 pixels in each image, and the image resolution is 0.5~0.8 m. We cropped each image into 512 × 512 pixels with a 50% overlap on each side (256 pixels for horizontal and vertical, respectively) to obtain 45,000 images. Our next step was to divide the images according to the ratio of 7:1:2, resulting in 31,500, 4500, and 9000 pairs of images for training, validation, and testing, respectively.
Figure 6 illustrates six different scenarios from the S2Looking-CD dataset.
3.2. Experimental Details
3.2.1. Evaluation Metrics
In order to evaluate the CD methods, we use the following eight metrics: Precision, Recall, F1-score, mIOU, OA, and Kappa. In these metrics, a higher Precision denotes a lower false detection rate, whereas a higher Recall indicates a lower miss detection rate. F1-score, mIOU, OA, and Kappa values range from 0 to 1, with higher values representing a stronger performance. Furthermore, we consider IOU_0 and IOU_1, which represent IOU for constant pixels and changing pixels, respectively. To be more specific, we calculate evaluation metrics as follows:
According to the formula above, stands for True Positive, stands for False Positive, stands for True Negative, and stands for False Negative. Note that is an intermediate variable in the calculation of Kappa.
3.2.2. Parameter Settings
All of our experiments are conducted using the Pytorch DP framework, which provides high performance computing. A single NVIDIA Tesla A100 GPU is used with a GPU memory of 80 GB. During model training, the batch size is set to 24, and the maximum training epoch for each model is 400. We utilize AdamW as an optimizer with an initial learning rate of 0.00035 to avoid a small learning weight, and a weight decay rate of 0.001. In order to prevent overfitting during training, we use the early stop method.
3.3. Ablation Experiment
Table 2 shows the results of ablation experiments, which evaluate the effectiveness of the progressive context-aware aggregation and MMDR modules. Due to ASPP’s ability to improve the receptive field, the F1-score of the network can be increased from 89.71% to 90.84% when ASPP is added to the baseline. As a result of adding the MMDR module to the baseline, our performance has been further enhanced, and its F1-score has reached 91.65%. This is because in a dense reconstruction strategy, prior feature information can be taken into consideration as part of the fusion process, resulting in more abundance of feature information.
In order to better understand the effectiveness of deep convolution and self-attention embedding in our progressive context-aware aggregation module, we examine the effects of four different embedding modes (C-C-C-C, C-C-C-T, C-T-T-T, C-C-T-T), where C represents deep convolution and T represents self-attention. The results of the experiments indicate that simple deep convolution (C-C-C-C) has the least ideal performance among the four, but its Precision value can still reach 92.52%. By replacing the last layer of deep convolution with self-attention (C-C-C-T), the Recall value is significantly improved, reaching 91.19%. Due to the self-attention mechanism, global change information can be captured effectively, reducing missed detections. Based on this, we try to substitute all the last three layers with the self-attention mechanism (C-T-T-T), but the results are not satisfactory, with F1-score and Recall indexes of 91.68% and 90.81%, respectively. This is because only one layer of deep convolution cannot fully extract shallow local variation features. Therefore, the advantages of the self-attention mechanism are not maximized to obtain robust global changing features. Finally, when we replace the second layer with deep convolution (C-C-T-T), our F1-score and mIOU are both the highest, at 92.02% and 91.64%, respectively. As a result, we have established that reasonable embedding deep convolution and self-attention can significantly enhance the performance of BCD.
3.4. Visual Comparative Experiments
We select eight popular CD methods for visual and quantitative comparison experiments in order to demonstrate the effectiveness of the proposed method, including four fully convolutional-based methods (FC-EF [
46], FC-Siam-conc [
46], FC-Siam-diff [
46], CDNet [
47]), one LSTM-based method (LUNet [
48]), two Transformer-based methods (IFNet [
49], and BITNet [
40]), and one CNN-Transformer combined method (MSCANet [
50]).
Figure 7 illustrates the BCD results of various methods for five different scenarios using the LEVIR-CD dataset, including small building changes, medium building changes, large building changes, and dense building changes. Our method detects all real changes without false positives in almost all scenarios. Despite only a small building change in the scene of the first row, the change in texture is not significant, resulting in poor results for other methods, while our method detects the main body of the changing building, as well as its boundary information. Due to illumination effects, all three FC-based methods, CDNet, LUNet, and MSCANet, miss one building change in the fourth row. Some edges are hidden in shadows, making it difficult to determine the exact boundary. Although all methods detect the actual change region well in the scene in the fifth row, it is clear that the building edges extracted by the other eight methods have a significant number of false detected regions, whereas our method extracts more accurate details of the building edges.
Figure 8 illustrates the BCD results for various methods on the SYSU-CD dataset. In the first row of
Figure 1, although our method has some missed detection areas, there is only a small amount of false detection. In contrast, the false detections of FC-Siam-diff, IFNet, BITNet, and IFNet are particularly prominent, misjudging road and vegetation changes as building changes, respectively. Due to the fact that buildings are being constructed, most methods perform poorly in the second and third rows. Despite the presence of false and missing detections in the results of our method, they are less than in other methods. Because the actual change labels have a sinuous texture, all methods cannot effectively extract the building edge information in the fourth row. Additionally, CDNet, LUNet, IFNet, BITNet, and MSCANet all incorrectly identify vegetation changes as building changes on the left side. In the fifth row, FC-Siam-conc incorrectly detects the change in vegetation as a change in buildings, resulting in a large area of red false positive pixels, while our method has the lowest number of false detections.
Figure 9 displays the visualized BCD results for the five scenes we collected from the WHU-CD dataset. There has been a drastic change in the bi-temporal images in the first row. As a result, CDNet, LUNet, IFNet, and BITNet have been unable to produce optimal detection results. LUNet incorrectly detects parking spaces as building changes. BITNet does not produce false detections, but it completely misses detecting the area where the change actually occurs. As can be seen in the third row, FC-Siam-diff, CDNet, LUNet, and IFNet all have false detections due to the shadow created by the light, with LUNet being the most obvious example. The fourth row contains relatively more irregular building variations, and all methods perform well, but our method extracts delicate edges. The fifth row illustrates that IFNet cannot recognize large building changes effectively, while other approaches also have a large number of error detections. However, our method is capable of identifying the actual change area almost perfectly.
Figure 10 illustrates the BCD results of various methods for five different scenarios using the S2Looking-CD dataset. Due to cloud, lighting, or shooting angle influences on this dataset, all methods do not perform as well as the first three datasets. Although there are only small building variations in the first row, the strong cloud interference leads to unsatisfactory detections by the comparison methods. The second and third rows represent expansion scenes of buildings, which are missed by many methods (LUNet, IFNet) due to the fact that the buildings in the pre-temporal images are already under construction. Our method can detect whether buildings are newly constructed additions more effectively because our self-attention mechanism maintains the global validity of feature information. In the fourth and fifth rows, we can observe that the illumination of the post-temporal images is extremely weak, which undoubtedly adds to the difficulty of BCD. Especially in the fifth row, the changed areas are numerous and dense. There is a large number of false positive and false negative results with the other eight comparative methods. In contrast, our method distinguishes unchanged and changed areas effectively regardless of lighting conditions.
3.5. Quantitative Comparative Experiments
Table 3 reports the overall comparison results of eight evaluation metrics on the LEVIR-CD dataset. Our method achieves the most inspiring results in terms of Precision (93.41%), F1-score (92.02%), IOU_0 (98.07%), IOU_1 (85.22%), mIOU (91.64%), OA (98.26%), and Kappa (91.04%). Notably, MSCANet obtains the highest Recall (91.85%), indicating that the effective combination of CNN and Transformer can improve the network’s ability to perceive building changes. Our proposed method has a slightly lower Recall value than LUNet, IFNet, BITNet, and MSCANet, but our F1-score value is still 2.7% higher than BITNet, demonstrating that our method has the strongest comprehensive performance.
Table 4 presents the quantitative comparison results of our method and the comparison methods on the SYSU-CD dataset. Among the comparison methods, FC-EF achieves the best Recall value, with a Precision and F1-score of 64.58% and 75.13%, respectively, which are 20.79% and 5.53% lower than our Precision and F1-score. Benefiting from its multi-scale feature fusion strategy, IFNet obtains an F1-score of 80.98%, exceeding ours by 0.32%. The MMDF module can reduce the loss of key information in the process of multi-scale feature fusion. Therefore, we have the highest Precision value of 85.37%, which is 4.75% higher than the second ranked BITNet.
Table 5 shows that we outperform the other eight methods in terms of Precision, F1-score, IOU_0, IOU_1, mIOU, OA, and Kappa, achieving 91.44%, 89.22%, 98.97%, 80.55%, 89.76%, 99.01%, and 88.71%, respectively, from the WHU-CD dataset. Furthermore, these results confirm the validity of the progressive context-aware aggregation and MMDR modules. Since the LSTM module in LUNet is not sensitive to building changes before and after the earthquake, its overall index is unsatisfactory. The results indicate that FC-Siam-diff achieves the highest Recall value of 94.30%, which is 7.18% higher than our method. In spite of the fact that our method does not achieve the highest Recall due to our focus on preventing false positives, our IOU_0 and IOU_1 values reach 98.97% and 80.55%, respectively, indicating that our proposed method is capable of correctly identifying unchanged and changed areas.
Table 6 shows the comprehensive comparison results of all methods from the S2Looking-CD dataset. All methods achieve lower performance than those from the above three datasets due to the presence of a large number of side-looking images and irrelevant changes, such as seasonal and light variations. Moreover, the S2Looking dataset contains fewer instances of changing buildings than the three other datasets. The IOU_0 values of all methods are therefore over 98%, whereas the IOU_1 value is below 50%. Under cloud interference, CDNet’s performance is poor in terms of precision, which indicates that contraction and expansion blocks have difficulty detecting building changes. It should be noted that our Precision and F1-score values are still higher than the other comparison methods, achieving 69.68% and 65.36%. This shows that our proposed method is also capable of handling difficult datasets.
Figure 11 illustrates the results of various methods for eight indicators in the form of cumulative distribution curves. LUNet, IFNet, BITNet, and our method perform exceptionally well, with all eight index values ahead of the other comparison methods. We outperform all comparison methods except the Recall metric. Although LUNet, IFNet, BITNet, and MSCANet narrowly outperform our method in the Recall metric, our Precision metric is 5% higher than even the second place BITNet. Additionally, the highest F1-score value indicates that we are capable of achieving comprehensive BCD results.
Figure 12 illustrates the results of eight metrics in box diagrams. Each box has a horizontal line that represents the midline. The lines below and above the box indicate the minimum and maximum values, respectively. According to the box diagrams, the Precision metric distribution has the greatest difference. This suggests that our method is most effective at distinguishing actual changing pixels. Despite our lower Recall value than LUNet, IFNet, BITNet, and MSCANet, the discrepancy is not very significant. Additionally, we find that the data distributions for the other six metrics are ideal, exceeding those of other methods by a wide margin. Therefore, our proposed method demonstrates a high degree of reliability among the nine methods tested.
3.6. Computational Efficiency Experiment
To assess the efficiency of various methods, we use two metrics: the number of parameters (Params) and floating points of operations (FLOPs) to calculate the computational efficiency of various methods. Note that as the number of Params and FLOPs of the model decreases, the complexity and computation cost of the model decrease as well.
Each method is tested on two images of the same size (1 × 3 × 224 × 224 pixels), and comparative results are presented in
Table 7. Due to the simplicity of the model, the parameters for the three FC-based methods as well as CDNet are the smallest. BITNet’s light backbone also contributes to its impressive results in terms of model efficiency. Because of the deep stacking convolutional networks, IFNet performs poorly in terms of model efficiency. We note that both our Parms and FLOPs values are the highest, reaching 61.41 M and 77.01 G, respectively. Due to the deep convolutions and large global receptive fields of our method, we are compelled to increase its capacity.
4. Discussion
This paper describes the development of a progressive context-aware aggregation module capable of extracting local–global feature information from bi-temporal images. We investigate ways to stack the two so as to utilize the strengths of both deep convolution and self-attention. We can see that in four different combinations of ablation experiments, deep convolution always comes first. This is because during the initial stage of feature extraction, the network mainly focuses on shallow local change information, where deep convolution is effective. Self-attention has the advantage of capturing the correlation between the long spatial and temporal positions of different features, which is critical for extracting high-level semantic information. Our final embedding combination is C-C-T-T, which means that we extract the superficial representation information of changing buildings through two layers of deep convolution, and then apply two layers of self-attention to obtain high-level semantic information about the changing buildings, while ensuring global validity of the feature information.
Figure 13 illustrates the attention maps generated by the four stages of the progressive context-aware aggregation module. Stage1 and Stage2 employ deep convolution to obtain shallow change features, while Stage3 and Stage4 utilize self-attention to excavate deep change semantic information. Deep convolution is capable of extracting shallow change features regardless of small or large changes. In this way, we will be able to locate and focus on the actual area of change. Integrating the a priori change information into Stage3 and Stage4 will further enhance the global validity of the feature information.
There is no doubt that in all examples of the four datasets, the progressive context-aware aggregation module detects actual changes in the image from shallow to deep levels. Finally, all attention is focused on the changing area. Even though these analyses serve as post hoc CD explanations, they may provide evidence that our proposed method is based on credible information and that CD predictions are based on factors relevant to buildings.
5. Conclusions
In this paper, we propose an effective BCD method. The progressive context-aware aggregation module enables the network to extract rich local–global feature information from bi-temporal images more effectively through the reasonable stacking of deep convolution and self-attention. In order to make the process of fusing feature information more reasonable, the MMDR module can group the extracted feature information based on pre- and post-temporal sequences, and learn the key change information of the prior groups through multi-level dense fusion. Extensive experimental results demonstrate that our proposed method outperforms other eight methods on the LEVIR-CD, SYSU-CD, WHU-CD, and S2Looking-CD datasets. In each of the four datasets discussed above, our precision values reached 93.41%, 85.37%, 91.44%, and 69.68%, respectively.
The results obtained with the proposed method have been demonstrated to be inspiring. Nevertheless, all of these results are based on labeled datasets, which are very labor- and time-intensive to collect and label. In contrast, unlabeled data are easier to obtain, so our main research focus will be on performing our method on unlabeled data with self-supervised BCD.