1. Introduction
Building information has wide applications in many domains, such as urban planning, cadastral mapping, geographic information censuses, and land-cover change analysis [
1,
2,
3]. Building feature extraction algorithms have a profound effect on promoting intelligent city construction [
4,
5,
6]. With the emergence of advanced sensors and platforms, multi-modal remote sensing data, such as LiDAR and aerial images, can be obtained. These remote sensing data provide accurate spatial information and abundant spectral features. For instance, high-spatial-resolution remote sensing images contain a fine-grained, 2D, geometric structure and texture, while high-precision, 3D spatial information can be acquired via LiDAR technology. Multi-modal feature fusion is conducive to improving the accuracy and efficiency of building detection.
Feature construction using remote sensing data is vital for improving method performance. However, building feature extraction is challenging due to many factors, such as the complexity of scenes, shadow, multiple scales, occlusion, illumination, and diverse shapes. Some traditional methods apply spectral features and establish the morphological index to distinguish buildings from the background [
7]. However, these indicators can significantly change according to season and environment. Additionally, the diversity of geographical objects brings discriminative difficulties when applying shallow features due to intra-class spectral variation and inter-class similarity. In addition, some methods establish geometric structure information, such as corners, height variation, and normal vectors, to extract buildings using 3D point clouds [
8,
9,
10]. Recently, multi-modal feature-fusion-based methods have been developed and effectively improved building extraction by combining 2D and 3D information [
11,
12,
13]. Nevertheless, traditional approaches based on prior knowledge are suitable for specific data and are vulnerable to parameter setting. Moreover, these methods cannot obtain object-level information, such as position, size, and count for each building, by semantic segmentation only. Therefore, automatic building instance segmentation has high algorithm complexity and is still a challenge in remote sensing data processing.
Deep learning algorithms have shown great potential in target detection, semantic segmentation, and classification using image data in recent years. Instead of relying on prior knowledge, deep neural networks can learn multi-level features. In particular, these methods provide an automatic and robust solution for the automatic extraction of buildings. Convolutional neural networks (CNNs), the popular neural network frameworks in deep learning, have been widely used in remote sensing image processing. Compared with traditional approaches, CNNs can extract hierarchical features from shallow level to deep level with semantic representation by stacking convolutional blocks. Instead of over-reliance on manual designs, a deep convolution network can automatically complete multiple tasks and flexibly construct modules to achieve different requirements.
Some algorithms based on CNNs have achieved excellent building detection and extraction performance. For instance, Wen et al. [
12] constructed a detection framework for rotated building objects using the improved Mask RNN [
13]. Meanwhile, atrous convolution and inception blocks were introduced into the backbone network to optimize feature extraction. This network obtained object-level segmentation results from complex backgrounds. Similarly, Ji et al. redesigned a U-Net structure and created a WHU building dataset [
14] for building extraction and instance change detection using Mask R-CNN. Moreover, some methods use a one-stage, anchor-free instance segmentation framework to improve speed and segmentation accuracy. For example, Zhang et al. proposed a one-stage change detection network combined with a spatial-channel attention mechanism for newly built buildings [
15]. Multi-scale, built-up change regions were effectively detected in the public LEVIR and AICD datasets. Wu et al. improved the detectors based on CenterMask [
16], obeying a one-stage detection paradigm, and established an attention-guided mask branch to segment building instances [
17]. Experimental results showed that the method achieved better speed and accuracy than state-of-the-art methods.
Although deep learning methods provide various feature optimization strategies and can achieve excellent performance in processing remote sensing data, some issues still should be addressed: (1) Most remote sensing datasets are mainly used for the semantic segmentation of buildings or natural scenes. An instance segmentation dataset of buildings needs to be constructed for the training and evaluation of the model. (2) Many methods apply remote sensing data with a single modality, such as optical remote sensing images. However, some spectral information is similar to the buildings, especially information pertaining to roads, cars, and artificial ground. Misclassification often exists in semantic segmentation networks. (3) The detectors based on the region proposal framework require a number of predefined anchors and a filtering mechanism, which significantly reduces the training and inference efficiency of the model. (4) Buildings display significant shape differences and scale variability. Some anchor-free detectors cannot accurately regress the location due to fixed grid computation in convolution. The main features of some buildings with small sizes are omitted after the down-sampling operation. Large-scale buildings often occupy most of the area in the sample patches, while the global context is insufficient due to the limited receptive field [
18]. These factors could cause inaccurate detection and segmentation.
To address the above problems, we designed a new convolutional neural network (CNN) framework for object-level building extraction. Some novel modules were developed to integrate multi-modal remote sensing data advantages, including 2D, high-spatial-resolution images and 3D LiDAR point cloud data. The main contributions of this study are summarized as follows:
We constructed an end-to-end instance segmentation CNN, combining anchor-free detection and semantic segmentation methods. Meanwhile, a local spatial–spectral perceptron was developed to optimize and fuse multi-modal features. This module can interactively compensate for spectral and spatial information in the convolutional operators and effectively recalibrates the semantic features. Furthermore, a cross-level global feature fusion module was constructed to enhance long-range context dependence;
An adaptive center point detector, based on the CenterNet, was proposed for multi-scale buildings and the complex shapes of buildings, introducing the explicit deformable convolution under supervised learning to enhance the size regression ability and the central point semantic intensity;
We created a building instance segmentation dataset using high-resolution aerial imagery and LiDAR data. This dataset provides highly precise instance labels for model training and evaluation.
4. Model Loss Function
The multi-task learning model adopts different losses, including semantic segmentation loss and object regression loss, to optimize network training. Softmax function is used to normalize predicted results in the segmentation task. We used the binary cross-entropy loss
Lseg for pixel-wise segmentation. Following the CenterNet loss function [
21],
Lk and
Loff denote focal loss and offset loss for the center point regression, respectively.
To predict deformable offset under supervised learning, we established self-supervision and strong-supervision loss functions for the scale and position regression. In self-supervision,
E1~
E4 are the extreme points located on the boundary of the bounding box. Hence, these coordinates are satisfied with geometric relationships. As presented in
Figure 4, predicted central coordinates denote the midpoints for
E1~
E4 in abscissa and ordinate. The network applies the smooth L
1 function to constrain the above relationship, as defined in Equation (19). Similarly, in strong supervision, the ACPD applies offset to regress the scale of the bounding box for each object, as defined in Equation (20), where
Su and
Sv denote the width and height of the bounding box in ground truth, respectively. Finally, the total loss can be expressed by Equation (21), where constant coefficient
is used to adjust the loss proportion in the multiple-task training (referring to CenterNet [
21]
,
=
= 1,
= 0.1 in the experiments).
5. Experiments
5.1. Datasets Description
Many public, open building extraction datasets have been created to train advanced deep neural network models and verify their performance or accuracy [
14]. However, these building datasets mainly serve the semantic segmentation of buildings with a single data source. Therefore, to train the proposed model and verify its effectiveness, we created an open building instance segmentation dataset using multi-modal remote sensing data (BISM), including high-spatial-resolution multispectral images and LiDAR data.
Table 1 displays the metadata information for the BISM dataset.
Generally, the BISM dataset covers 60 km
2 in Boston, Massachusetts, the United States, and comprises approximately 39,527 building objects, accounting for 23.39% of the total experimental area. The experimental area consists of various features, as shown in
Figure 5. Some details are shown in the yellow rectangle for close-up inspection. Category imbalance brings challenges to the reasonable design of the model structure. Additionally, these buildings exhibit diverse textures and colors with complex geometric shapes. The above factors can enhance the potential of different models for building automatic interpretation and evaluating their generalization ability. Multispectral aerial orthoimages were obtained in 2013 from the United States Geological Survey (USGS) [
42] with 0.3 m spatial resolution and red–green–blue–near-infrared (RGB-NIR) channels. Thirty orthoimages with a size of 5000 × 5000 pixels were integrated and cropped into a mosaic image of 26,624 × 24,576 pixels. LiDAR point data (.las format) were derived from the National Oceanic and Atmospheric Administration (NOAA) [
43] in 2013 with an estimated point spacing of 0.35 m, vertical accuracy of 5.2 cm, and horizontal accuracy of 36 cm.
5.2. Data Preprocessing
In the BISM datasets, noise points and outliers were removed for the 3D LiDAR point cloud data using the open-source software Cloud Compare [
44]. LiDAR products were generated for different input strategies, such as normalized difference vegetation index (NDVI), digital elevation model (DEM), digital surface model (DSM), and normalized digital surface model (nDSM), as shown in
Figure 5. The cloth simulation filter (CSF) [
45] algorithm was used to generate nDSM and DEM. Finally, the products derived from the LiDAR point cloud data were rasterized and resampled to a spatial resolution of 0.3 m.
LIDAR point cloud data and images were geographically registered in the same projection coordinate system to reduce spatial shift. Due to the oblique errors of photogrammetry, we used DEM derived from LiDAR to complete the orthorectification for images. However, the edges of some buildings are not accurate in DSM due to the sparsity of point cloud data and the inherent errors of some interpolation algorithms. Some buildings still have oblique facades and shadows. Therefore, in the ground truth, we comprehensively considered the spatial relationship between the image and DSM to draw the correct vector boundary for the buildings. Concretely, we manually edited polygon vectors (.shp format) using ArcGIS software by visual interpretation. The results referenced the open street map (OSM).
For model training, the entire dataset was cropped into 2496 tiles with a size of 512 × 512 pixels. These tiles were divided into several subsets, including the training subset (1747 tiles), validation subset (500 tiles), and test subset (248 tiles). The data augmentation strategy was applied to the model training to increase the number of samples and enhance the model generalization ability. These sample patches were processed by fundamental image transformation, such as rotation 180° counterclockwise, adding random noise, and mirror transformation along the vertical or horizontal directions. As a result, the training data subset was increased to 5241 tiles. We used the minimum bounding box to mark the location and range of each building object. In addition, a subset (3000 tiles) was created in distinct areas from the WHU dataset [
14] for the subsequent experimental comparison. The WHU subset was augmented and divided into the training subset (4200 tiles), validation subset (600 tiles), and test subset (300 tiles).
5.3. Experimental Configuration and Metrics
In the model training phase, all experiments were completed on the Keras/Tensorflow platform using the configuration of 3 × 32 GB RAM, NVIDIA Tesla V100 GPU. Each network model was trained with 400 epochs with an initial learning rate of 0.001 and a batch size of 16. The Adam algorithm was applied to optimize training parameters with a momentum rate of 0.9. The learning rate decreased if the validation accuracy did not improve every five iterations. The weight parameters in ResNet50 were initialized by the pre-trained model in the public dataset ImageNet. Other network layers were initialized by the Xavier [
45] method. Several experiments were completed to verify the performance of the proposed method.
The proposed network model uses a multi-task learning paradigm to obtain the results of instance segmentation. The accuracy of the results is affected by segmentation and detection. As a result, we evaluated its performance in an ablation experiment. Intersection over union (IOU), F
1-score, and average accuracy (AP) are widely used in the evaluation of semantic segmentation and object detection. These metrics can be calculated by other evaluation parameters, including TP, FP, FN, precision, and recall. In the regression detection task, an object is marked as a positive sample (TP) when the IOU (between the predicted bounding box and its ground truth) is greater than the threshold; otherwise, it is a false positive (FP). If the object is not identified, it is marked as a false negative (FN). In the experiment, the threshold of IOU was set to 0.5. Precision and recall were defined by Equations (22) and (23). Therefore, AP was calculated by Equation (24). Similarly, we only counted the number of pixels for each object within the bounding box to achieve the metrics in the semantic segmentation task. The predicted probability of each pixel was obtained by the Softmax function. The F
1-score was applied to the segmentation task, as defined by Equation (25).
5.4. Ablation Study on Multi-Modal Data
To verify the influence of multi-modal data on the model prediction, we arranged seven groups using different input strategies in the BISM dataset, as shown in
Figure 6. When RGB-NIR-NDVI was input, we only retained the backbone network, and others were removed. If the input contained LiDAR products, backbone and branch networks were reserved. Multispectral images were fed into the backbone network, while DEM/DSM/nDSM was fed into the branch network.
We first completed the comparative analysis in the experiment with RGB, RGB–NDVI, and RGB–NIR. Compared with inputs only using RGB, it can be observed in
Figure 6 that the accuracy was improved by 0.3% AP and 1.6% F
1-score when using RGB-NIR. Similarly, RGB-NDVI as input achieved a slight increase compared to RGB-NIR. However, when DEM was introduced into the model, the prediction accuracy decreased by 2% AP and 1.5% F
1-score. Probably, the DEM could not represent the height variation of the buildings, and noise was introduced into model training. In contrast, when RGB-NDVI-DSM was fed into the network, the prediction accuracy was significantly increased by 1.5% AP and 6.7% F
1-score compared to RGB, which indicates that LiDAR features can increase the accuracy of segmentation. The accuracy did not obviously change using RGB-NDVI-nDSM. However, the last group with DEM and nDSM achieved better results than the other groups, with 89.8% AP and 86.3% F
1-score. Therefore, we used RGB-NDVI-nDSM-DEM as multi-modal data in subsequent experiments.
5.5. Contributions of Modules in LSSP and CLGF
Based on the above analysis, the LSSP fuses local features from different modal information, while CLGF integrates global context. Hence, we compared and analyzed the complementary ability of these two modules using the BISM dataset. Firstly, in the experiments with the LSSP, we verified the model performance using the spatial perceptron (SPA) and the spectral perceptron (SPE) in the LSSP, respectively. Meanwhile, some hyperparameters were determined by quantitative analysis and comparison. Each sub-module was applied to the model separately, and others were removed. Input data included multispectral images and LiDAR products. The feature maps from the four stages were fused through FPN. The fused feature maps with a size of 128 × 128 were used for regression and upsampled to 512 × 512 for the segmentation. ResNet50 and the CenterNet detectors were combined as the backbone model (BM) for the comparative analysis.
The SPA module is subject to two parameters,
l and k.
l determines some influence on optical features when the LiDAR feature is decomposed into multiple spatial representations. For testing module sensitivity,
l is set to 2/3, 1/4, 1/8, or 1/16 of the input channel numbers, and k is set to 3, 5, or 7.
Table 2 presents the accuracy of the results using different modules. It can be observed that the AP and F
1-score improved when
l was increased from 1/16 to 1/4, but the performance remained stable and even decreased when
l was set at 2/3. Probably, more parameters were introduced to the network structure, which increased the difficulty of training optimization. Similarly, as the k increased, AP did not show regular changes, but the F
1-score increased with a large k. Hence, based on the best computation efficiency and performance,
l = 1/4 and k = 5 were set in the experiments.
Figure 7 shows the impact of the ablation studies on different modules with the accuracy of the results. Although AP presented a slight increase of 1.3%~1.6% when using BM + SPA or BM + SPE compared to BM, segmentation results significantly increased with an F
1-score about 5% higher than BM. In general, the LSSP module improved prediction accuracy from 89.8% AP to 91.6% AP and 86.3% F
1-score to 90.3% F
1-score. Compared with BM, detection accuracy using CLGF slightly increased by 0.4% AP, but segmentation accuracy improved by 3.8% F
1-score. In addition, detection and segmentation tasks had a large accuracy deviation between AP and F
1-score when using BM or BM + SPE, while other combinations had relatively small variations. This indicates that the SPA made more contributions to the segmentation task than the SPE. In general, the detection accuracy was higher than the segmentation accuracy since the ROI region had an impact and limitation on the segmentation results. As a result, the above analysis demonstrates that the proposed modules can improve the prediction accuracy, and their combination gains more.
Figure 8 displays the results of feature variation and instance segmentation. Test image A and test image B contain buildings with different scales and very similar textures to roads. To analyze the influence of the LSSP on the shallow encoders, we used feature maps via stage 2 to generate a heatmap by calculating the mean value along the channel dimension. Visually, it can be observed in the heatmap that the LSSP enhanced the boundary information and regional feature. Although the prediction range of some buildings is inaccurate, multi-scale buildings are correctly segmented. In contrast, there exist some false negatives using BM in image A. Additionally, as shown in image B, BM has a weak detection ability for large buildings. The prediction results were easily subject to the road feature, as shown in the details, which implies that the LSSP can effectively fuse spatial–spectral features to distinguish heterogeneous features.
For the CLGF,
Figure 9 displays the heatmap overlapped on the raw images, revealing some variation. The background features exhibited a weak response after using CLGF, especially for roads and ground. Large-scale building areas have inconsistent feature responses in local regions, as shown in the heatmaps marked by black ellipses. In contrast, CLGF alleviated this heterogeneity and recalibrated feature distribution. The result indicates that this module can assist the network in filtering out redundant information and enhancing semantic correlations.
5.6. Center Point Detector Accuracy Analysis
In this experiment, the multispectral and DSM images were fed into the model, and other modules were removed. In the experiment, we set a threshold of 0.5 for the final center points. As shown in
Table 3, compared with BM, AP increased by about 2.2% in the BISM dataset for BM + AP, which implies that the ACPD can improve the detector accuracy. Meanwhile, segmentation accuracy improved by 1.9% F
1-score.
The prediction results from three images were visualized, as displayed in
Figure 10, to visualize the result performance. Dense and small-scale buildings exist in test image A with shadows and cement surfaces. BM + ACPD achieved better building segmentation results. However, some buildings were not detected using BM, as shown by the green ellipse.
Test image B has large-size and tall buildings with vegetation distribution. These buildings exhibit various shapes where some central positions do not exist in themselves. As shown in the green ellipse of
Figure 10, the CenterNet detector presented multiple centers that deteriorated the position precision. In contrast, the ACPD exhibited better performance in large-size buildings. The problem of multiple centers was alleviated, and semantic information was enhanced.
Test image C contains narrow, long buildings. Compared with CenterNet detections, the ACPD identified these central positions with less deviation. Background features using BM were misclassified where central points had a weak response, as shown in the red ellipse. As a result, the ACPD can improve center regression ability, especially for buildings with complex shapes.
Furthermore, we selected other typical samples in the test data and conducted comparative experiments with the ACPD and CenterNet to verify the detection ability of the proposed module for complex-shaped buildings.
Figure 11 shows examples of buildings with various irregular boundaries. Although the predicted center of some buildings had a slight deviation, such as in the first and last rows, and presented FN for some small buildings, the ACPD could better correct the central positions for large-scale buildings than the CenterNet detector. In addition, it can be observed that the same building in the sixth column presented multiple prediction centers via CenterNet, which deteriorated the semantic segmentation results, as shown in the third column. The buildings in the second and last rows were not wholly detected since many FT samples, such as some roads, were misclassified. In contrast, the ACPD optimized the center point feature and reduced the interference of background information. The proposed method can significantly suppress FT samples and improve the semantic segmentation performance compared to the results of the third and fourth columns.
5.7. Comparisons with State-of-the-Art Methods
This section compares the proposed model with other, state-of-the-art instance segmentation methods. BISM and WHU datasets [
14] were used to verify the generality of different methods. Since the WHU dataset only contains RGB images, we used the ACPD and CLFG modules with a backbone network. For comparison and analysis, the framework of all methods adopted ResNet50 as the primary network structure, combining different types of detectors. As mentioned, we selected advanced algorithms, including Mask RCNN [
13], PANet [
19], SOLOv2 [
22], and CenterMask [
16], to verify the advantages of the proposed network. All experimental configurations were kept the same.
Table 4 and
Table 5 show the accuracy in two datasets using different methods. The test datasets were classified into three scales by AP and F
1-score, small (s), middle (m), and large (
l), to evaluate multi-scale performance. These methods generally achieved better results in the BISM dataset than in the WHU dataset, indicating that multi-modal data contribute to building detection and segmentation performance.
In the BISM dataset, the proposed model performed better in multi-scale AP and F1-score than the others, especially for large-scale buildings. Although the inference time was not the best, the speed was higher than in the two-stage detection methods. PANet was superior to other methods but had low APl with a long inference time and a low F1-score for large-scale buildings. The CenterNet detection framework enabled CenterMask to achieve the best inference time and high-precision APl. However, this network had poor performance for small-scale buildings. SOLOv2 had the lowest detection accuracy with an APl of 88.3% but outperformed other methods in small-size buildings. A large number of buildings introduces more parameters and increases the difficulty of the model training. In addition, the prediction of global masks, depending on the manual setting, is not suitable for a number of dense buildings. Mask RCNN had the lowest segmentation accuracy with an F1-score of 86.4% due to poor performance on small-scale buildings, which indicates that low spatial resolution is not conducive to high-precision segmentation for small buildings.
In the WHU dataset, although our method did not achieve the best results in APm and F1-scorem, the total accuracy outperformed other methods. CenterMask obtained better detection results with 82.1% AP, while SOLOv2 had a higher F1-score of 78.2% compared to others. Although PANet achieved the best performance in medium-size buildings with an APl of 83.8% and 82.3% F1-scorem, it presented weak prediction ability in small-scale and large-scale buildings. Regarding detection efficiency and accuracy, the one-stage method had advantages over the two-stage methods for building instance segmentation.
In addition, we removed the LSSP and LiDAR products with the encoder branch to verify the model’s generalization ability, using only RGB images as input in the BISM dataset. In
Table A1 of
Appendix A, it can be seen that PANet had a F
1-score 1.2% higher than ours in the segmentation task. On different scales, it was observed that the proposed model had poor performance in small-scale buildings, with an F
1-score
s of 79.8%. Nevertheless, our method achieved the best performance in the detection task compared with other methods with 88.5% AP. In addition, comparing the accuracy shown in
Table 4, AP decreased by 4.4%, and the F
1-score decreased by about 10%, which indicates that the LSSP fusing multi-modal features can improve the segmentation accuracy significantly.
Four typical areas were visualized in the test dataset to analyze the performance of the results using different methods. As shown in the first column of
Figure 12a,
Figure 12a,b belong to the BISM dataset, while test images C and D come from the WHU dataset. Test image A contains some buildings with complex shapes. The proposed methods achieved better performance than other models in complex-shape buildings, as illustrated in
Figure 12b. Obviously, other networks had a weak ability to regress position for narrow, long buildings. In addition, Mask RCNN and PANet were sensitive to shadow changes misclassified as small holes in the roofs. Although there were accurate results for some buildings with a relatively regular shape using SOLOv2, it had poor performance in “T”-shape buildings.
Test image B covers relatively small-scale buildings with an amount of vegetation, as shown in the second row of
Figure 12a. These methods obtained better segmentation results than test image A, but there were more false negatives for the CenterMask in the detection task. In contrast, test image C contains large-scale industrial plants that cover over 50% of the area in one patch. Moreover, some buildings were closely arranged, as displayed in the third row of
Figure 12b. Our method and Mask RCNN outperformed the others, especially for large-size buildings. PANet misdetected many buildings and could not regress size accurately for the large building regions. Although SOLOv2 and CenterMask identified most buildings, the segmentation results were incomplete for some large-scale buildings. In test image D, there are sparsely distributed small buildings surrounded by cement ground and vegetation in the suburbs. Our method exhibited better results than others for small-size buildings. Mask RCNN and SOLOv2 were sensitive to the road feature.
6. Discussion
To further verify the performance of the proposed method, we completed a comparative experiment with traditional methods using the commercial software ENVI 5.3. In the object-based segmentation process of ENVI, we used the method based on edge detection to create objects and the support vector machine (SVM) algorithm to classify them. The results of semantic segmentation are presented in
Figure A1 of
Appendix A. The proposed method requires building large-scale training sample datasets. In contrast, the object-based method can segment the building area simply and efficiently. However, the segmentation accuracy and performance are inferior to the proposed method. In addition, object-based segmentation methods cannot obtain end-to-end object-level extraction results and require post-processing such as clustering or vectorization. Hence, we used pixel-wise overall accuracy (OA) and F
1-score as evaluation indicators.
As shown in
Table A2 of
Appendix A, the proposed method had more than 9% OA and 13% F
1-score compared to ENVI.
Figure A1 of
Appendix A shows that the object-based method had good segmentation ability in small-scale buildings with regular texture. However, many misclassifications existed for roofs and roads with complex textures or similar colors. Therefore, it is difficult to distinguish buildings and other objects with similar spectral and spatial information using only shallow semantic features.
The above experiments confirmed that the proposed method can improve the performance of building instance segmentation with high efficiency. However, some issues can still be potentially explored and optimized. Some buildings cannot be distinguished in overlapping regions of the bounding boxes, which interferes with the automatic detection of building information. It is necessary to develop rotated-object detection and obtain the correct orientation for the buildings. LiDAR products are rasterized into 2D images containing only elevation variation and 2D spatial information. Hence, the effective combination of 3D spatial and spectral information can be explored. Furthermore, instead of 2D instance detection, 3D object-level building detection is a further research direction. In addition, creating large-scale training datasets takes time and costs money. In further research work, knowledge distillation or transfer learning combined with a semi-supervised training mode is worth exploring to reduce the dependence on supervised samples.
The proposed method cannot identify multi-story heights on the same building roof. LiDAR data can provide different elevation information. Thus, in further research work, we will continue to improve the detectors to enhance the sensitivity of CNN to 3D position information. Furthermore, the proposed module consumes a lot of computational memory due to the high-dimensional feature matrix operations. It is necessary to optimize the module structure further and reduce memory consumption. In addition, simultaneous acquisition of LiDAR and image data is not easy, with high operating costs, and the proposed method contains many parameters to train, which increases the algorithm’s complexity and brings difficulties to practical application. Further research will improve the encoders to provide users with flexible input modes using a lightweight network structure. Meanwhile, we will improve datasets and provide abundant building labels (including spatial blocks in multiple heights and different functional zones).