1. Introduction
Gross floor area (GFA), which can be calculated by the product of number of stories (NoS) and its base area (BA), is an important indicator to estimate useable area of buildings. The acquisition of GFA in a wide range is of high relevance for many applications, such as urban planning, population estimation, damage assessment in the aftermath of earthquakes. For example, the floor area ratio is one of the most important indicators for building density, which is defined as the ratio of the sum of all buildings’ GFA in interested region to the area of interested region. The acquisition of floor area ratio relies on the acquisition of every building’s GFA in the interested region. In other words, instance-wise GFA acquisition is the core problem for the floor area ratio acquisition.
By virtue of the capability of ground observation in a wide range with less time consumption, the remote sensing techniques have become an important way for GFA acquisition. In general, there are two processes for GFA acquisition with the help of remote sensing: the acquisition of BA and the acquisition of NoS. The former can be realized by segmenting the footprints of buildings in the remote sensing images, which has been widely researched and applied [
1,
2]. The latter is more complex, so that one might believe that the NoS information cannot be directly extracted from the remote sensing images [
3]. Under crude assumptions, building height and NoS can be roughly transformed into each other, so the extraction of building height is often the premise of NoS extraction by remote sensing [
4,
5]. Research of GFA estimation always focus on the building height acquisition because of its higher difficulty compared with BA acquisition, and the main difference between those research lies in the different ways to extract the building height. By virtue of the advantages of the active remote sensing, light detection and ranging (LiDAR) and synthetic aperture radar (SAR) data were used to extract building height in [
6,
7,
8]. The normalized digital surface model from optical stereo images was used to extract building height information in [
9,
10,
11,
12]. All the methods mentioned above for building height extraction have higher accuracy but are difficult to be used in a wide range because of the long processing time and high data acquisition cost. In order to overcome these shortcomings, much research try to extract GFA from monocular optical images because of its convenience for acquisition and processing. Among them, Refs. [
13,
14] extracted building shadows then measured its length from high spatial resolution optical monocular images. The height of buildings can be estimated from the length of building shadows based on the geometric models which consider the relative position of the sun, sensor, and buildings. These methods rely on the key assumption that the complete building shadows can be extracted from the images, which is not always tenable because the shadows can be shaded by other buildings. Besides, all of the GFA acquisition methods mentioned above rely on the artificial rules for the conversion between building height and NoS, which are not applicable for a wide range of applications. Besides all the methods mentioned above which separately extract the NoS and BA before extracting GFA, few methods estimated GFA by end-to-end regression from monocular optical images. Among them, Ref. [
15] extracted the area of building shadows then regressed GFA using the learned liner regression model, and this method also cannot overcome the shortage of shadow-based methods mentioned above. Ref. [
16] regressed the pixel-wise GFA, which is obtained by averaging the GFA of all buildings in given grids using deep convolutional neural network (CNN), whose spatial resolution is too low to get the building instance-wise GFA information.
To improve the shortcomings of the above NoS estimation methods, Ref. [
3] proposed the NoS R-CNN, which is a kind of deep neural network for jointly detecting building objects and estimating the NoS of detected building objects from monocular optical images without estimating the height of buildings in advance. The NoS R-CNN is modified from the Mask R-CNN [
17], which is a kind of instance segmentation network. Because the NoS R-CNN is designed for both building object detection task and NoS estimation task, the building footprint instance segmentation outputs of the network are only used in training stage to get the auxiliary loss and not used in inference stage. But if we reuse the building footprint instance segmentation outputs in inference stage to get the BA information and jointly use the NoS and outputs of detected buildings, the NoS R-CNN can be directly used for the building instance-wise GFA estimation. Avoiding extracting building height then designing the rules for converting building height to NoS, the NoS R-CNN treats the NoS as a kind of attribute of buildings to facilitate the end-to-end prediction from images. This kind of design inspired us to pose the question whether it is possible to separately estimate the NoS and BA in an end-to-end manner to estimate the GFA. Or is it even possible to get the end-to-end GFA estimation without separately extracting the NoS and BA in advance? To answer the above questions, we propose three methods for instance-wise building GFA estimation based on the NoS R-CNN from monocular optical images. Furthermore, we carried out experiments on our dataset to compare and analyze the results of the proposed methods.
The main contributions of this paper are as follows:
To the best of our knowledge, the proposed approach is the first one to directly estimate instance-wise gross floor area from monocular optical satellite images. Compared with existing related methods, there are three key innovations of our approach:
- (a)
Compared with methods which are based on LiDAR and SAR data [
6,
7,
8] or optical stereo images [
9,
10,
11,
12], our approach only uses monocular optical satellite images in inference stage, which is more convenient in terms of data acquisition and processing.
- (b)
Compared with the building shadow-based methods [
13,
14,
15] which only can be applied in limited simple scenarios, our approach is not limited to specific application scenarios and can be applied in a wide range.
- (c)
Compared with the CNN-based method [
16] which can only generate pixel-wise GFA with low spatial resolution, our method can jointly detect building objects and estimate instance-wised GFA of detected building objects, which provide finer-grained spatial information and can be used in more extensive downstream tasks.
We design three GFA estimation methods generated from different training and inference strategies in a unified network architecture (i.e., NoS R-CNN) in various degrees of end-to-end learning. The performances of the three methods are reported and compared based on the experiments results on our dataset.
The rest of this paper is organized as follows: in the second section, we describe the network architecture, loss function of the three proposed methods, the dataset and experiment configuration. Then, results on our dataset are reported and analyzed. The discussion is described in the fourth section. Finally, the conclusion is drawn in the last section.
4. Discussion
4.1. Comparison of Three Proposed GFA Estimation Methods
According to the definition of GFA, the performances of NoS and BA estimation are directly related to the performance of GFA estimation. In order to explore the difference of performances between three proposed methods, we analyzed the difference of performances for both NoS and BA estimation tasks. The GFA prediction of MBB and BABB depends on the prediction of both NoS and BA in inference stage, which can be obtained explicitly and are used for evaluating the performance of BA and NoS estimation for MBB and BABB. NoS and BA prediction are not necessary for GFA estimation of GBB in inference stage but are used in training stage to get the auxiliary loss for improving GFA estimation. Here the outputs of BA and NoS branch of GBB in inference stage are used for evaluating the performance of BA and NoS estimation of GBB. We use MAE in prediction mode A/B as metric in this section, and the performances of the three methods are shown in
Table 7, where “NoS”/“BA” indicate the NoS/BA estimation task.
We discuss MBB and BABB first. For mode A, the performances on BA estimation between MBB and BABB are close, but MBB is better in NoS estimation. So, the stronger ability of MBB for NoS estimation may be the reason for better performance in GFA estimation in mode A. For mode B, although MBB is worse than BABB on NoS estimation, but obviously better on BA estimation. The great advantage of MBB on BA estimation may be the reason for better performance on GFA estimation in mode B.
Then we discuss BABB and GBB. For mode A, the performances of GBB on NoS and BA estimation tasks are worse than BABB, but the gap is not so large. So, the performances of the two methods on GFA estimation task are close. For mode B, the performances of the two methods on BA estimation are basically consistent, but BABB is better than GBB on NoS estimation. Therefore, the better performance of BABB on NoS estimation may be the reason for better performance in GFA estimation in mode B.
It can be seen that NoS branches were trained identically for three methods, i.e., using the same loss weight, network architecture, and other hyper-parameters. However, their performances are different as for NoS estimation task. The possible reason might be that the total loss of multitasks are different for three methods, and the loss of different tasks can be influenced by each other. The same reason can also be applied for explaining the performance differences for BA estimation tasks between BABB and GBB.
4.2. Comparison between the Two BA Estimation Methods
The way for BA estimation is the main difference of BABB and MBB. For the BA estimation task, BABB follow the end-to-end fashion and MBB not. For further comparing the performances of BA estimation methods of both BABB and MBB, we plot the joint distribution of prediction and GT of BA in
Figure 8, which is consisted of four subfigures. The left/right column shows the results of MBB/BABB, and the top/bottom row of subfigures shows the results in prediction mode A/B. We plot the line
y = x in subfigures, in which the closer the points are, the less prediction errors the points have. It can be seen that prediction and GT show positive correlation in all four subfigures, and the shapes of distributions are roughly symmetrical along the
y = x. For buildings with small BA, the error is also small. With the increase of BA, the number of predictions with large error is increasing. From the
Figure 8a,b, it can be seen that the distributions of MBB and BABB in mode A are generally the same, which is consistent with the results in
Table 7. From the
Figure 8c, it can be seen that there are many underestimated predictions with small BA for MBB. It may be because buildings with small BA are relatively difficult to be completely detected for segmentation, which leads to the underestimation for BA. The same situation can also be found but not obvious for mode A, this may be because that there are less buildings with small BA in TP samples because of the difficulty for small BA building detection. The underestimation mentioned above mainly appears on small BA buildings, whose error are generally small, so it does not lead to large MAE. From the
Figure 8d, it can be seen that the underestimation mentioned above is not obvious for BABB, but the predictions of BABB are more scattered, especially for the buildings with large BA, which leads to relatively larger errors. So, the performance of BABB is worse than MBB for MAE.
4.3. End-to-End Fashion Is Not a Panacea
Compared with the traditional machine learning methods which depended on the hand-designed features or individual component modules, one of the most prominent advantages of deep learning is its inherent feature engineering capability based on the end-to-end learning. Many research [
28,
29,
30] showed the performance advantages of end-to-end design compared with non-end-to-end design and successes on novel tasks which are very difficult to be realized without end-to-end models, such as height estimation task from monocular images [
31,
32,
33]. In order to take full advantage of end-to-end design, three GFA estimation methods proposed in this paper use the end-to-end design in various degrees. For BA estimation task, BABB directly predicts BA from region feature, whereas MBB segment footprint then transforms it to BA. So, the degrees of end-to-end learning between the two methods is BABB > BA. For GFA estimation task, GBB directly predicts GFA from region feature whereas both BABB and MBB need to extract BA and NoS separately before obtaining GFA. Considering the above analysis, the degrees of end-to-end learning for the three proposed methods can be given as: GBB > BABB > MBB. BABB and BA which separately extract BA and NoS depend on the accuracy of both the subtasks. These two less end-to-end designed methods have larger risk of bad performance than GBB intuitively because large error of either of two subtasks will lead to bad performance of final results. According to experiment results in this paper, for both GFA and BA estimation task, there is an inverse relationship between model performance and degree of end-to-end learning. This result indicates that the more end-to-end design may be not the best choice for GFA and BA estimation tasks and not always better than less end-to-end design in any circumstance. We analyze the possible reason of the poor performance for the more end-to-end designed methods compared to the less ones in our experiments as follows:
Insufficient training data. Compared with the less end-to-end designed model, the more end-to-end designed methods generally rely on more training data for the satisfactory performance because of their data-driven mechanism. Although the training data used in our experiment is much more than that used in the previous research, yet it may be insufficient for the more end-to-end designed models to give full play to their advantages.
Inappropriate model design. Because this paper is the first attempt to directly estimate instance-wise GFA end-to-end using deep convolutional neural network to the best of our knowledge, there are not any model designs for reference. We basically follow the detection task pipeline inspired by the Mask R-CNN for GFA/BA prediction. This design may not be appropriate for GFA/BA estimation task. Or may be the more end-to-end model design is inappropriate for GFA/BA estimation task in nature.
4.4. Is the CNN Irreplaceable for This Task?
With the help of deep learning, CNN has achieved great success in many image processing tasks, such as semantic segmentation and object detection. In this paper, we attempted to use the three CNN-based methods to achieve the task of GFA estimation. There is a valuable question, is it possible to achieve the comparable performance for the GFA estimation task using the more lightweight model without CNN. To answer this question, two popular traditional regression models using the hand-craft feature, i.e., the multi-layer perception (MLP) and the random forest (RF), were introduced in the experiments in this section to answer this question.
The implementation of the above two methods are based on [
34]. For MLP, 5 hidden layers with 100 neurons of each layer were used. For RF, 100 trees whose depth were less than 10 were used. Other hyper parameters of the above two methods were kept with the default setting in [
34]. The dataset used in this section was kept with that in
Section 2.4. The hand-craft feature for every building insurance are described as follows: for every building instance, the segmentation GT of building footprint was used as the inner mask and the inner masks were dilated by 10 pixels to get the outer mask. The inner mask was used to extract the mean value and the standard deviation of every channel of the inputted image, then a feature vector with 6 elements was obtained as the inner feature of building instance. The outer feature can be obtained by using the outer mask to extract the feature of instance like the inner feature. The diagram of inner mask and outer mask is shown in
Figure 9. The inner and outer feature were concatenated and finally a feature vector with 12 elements was obtained as a feature of every building instance for the MLP and the RF. The performances of the CNN-based models proposed in this paper and the two traditional methods mentioned above using the hand-craft feature are shown in
Table 8. In
Table 8, CNN, MLP, and RF estimated GFA by separately extracting BA and NoS. CNN-ETE, MLP-ETE, and RF-ETE estimated GFA based on the instance feature in an end-to-end (ETE) manner.
The CNN-based methods show much better performance than MLP and RF on the BA, NoS and GFA estimation tasks. These results indicate that the tradition methods using the hand-craft features, i.e., MLP and RF, might be not suitable for these difficult tasks which are even extremely hard for ordinary people to achieve because of the limited information on NoS in monocular optical satellite images. Although the CNN-based methods seem to be more cumbersome, they indeed showed the irreplaceable capability because of its data-driven mechanism.
The performance of MLP in an end-to-end manner (MLP-ETE) is better than the MLP, which showed inconsistent conclusions for CNN-ETE and CNN. These results further showed the necessity and the value of experiments described in
Section 3.2 for comparing the performance of the proposed three methods.
5. Conclusions
In this paper, three instance-wise GFA estimation methods from monocular optical images are proposed for the first time, i.e., MBB, BABB, GBB. These three methods are based on NoS R-CNN and use the end-to-end design in various degrees. Compared with the existing GFA estimation methods, the proposed methods are low-cost, universal, and flexible. Experiments on our dataset from nine large cities in China were carried out in order to compare the performances of the three proposed methods. The results show that the building detection performances of the proposed three methods are almost equal to vanilla Mask R-CNN and the GFA estimation performance ranking is MBB > BABB > GBB, which is the reverse order of degrees of end-to-end learning of three methods. Results are analyzed in detail for exploring the reasons for the performance gap between the three methods, and we think that the more end-to-end designed methods are more difficult for BA/GFA estimation tasks. The quantitative and qualitative evaluations of the proposed methods indicate that the performances of proposed methods for accurate GFA estimation are promising for potential applications using large-scale remote sensing images. It will consume large cost in time, labor, and economy for the large-scale instance-wise GFA acquisition. With the development of remote sensing technique, the high resolution monocular optical satellite images have become more and more convenient to be obtained. Although the stereo images-based methods or shadow-based method can be used for GFA acquisitions, those methods cannot be applied in a wide range due to the high data acquisition cost or inherent defect of methods. Based on the methods proposed in this paper, the models can be trained on the existing data, then to be applied to the large-scale area without GFA information from the monocular optical satellite images to get the instance-wise GFA information in a rapid, cost-effective manner. We hope that this paper can provide new perspectives on related approaches and downstream tasks.