Precise Extraction of Buildings from High-Resolution Remote-Sensing Images Based on Semantic Edges and Segmentation
Round 1
Reviewer 1 Report
The work introduced a novel Dense D-LinkNet network for building edge detection which is very interesting. I think it is well described and posses a significant contribution in the field. Also, the paper is very well written and presented. Here are some of my comments
- Some of the equations are not rightly formated. Please have a look into that.
- The post-processing part proposed in the work is a bit ad-hoc. However, it improves the results. How do you select the parameters and how sensitive they are to performances?
- Some of the figures like 8 and onwards are hard to compare visually. Some very closed zoomed-in versions could be helpful to see the main differences in approaches among the methods.
Author Response
Response to Reviewer 1 Comments
Thank you very much for your generous help with our paper and the reviewers’ valuable comments concerning our manuscript entitled “Precise Extraction of Buildings from High-Resolution Remote Sensing Images Based on Semantic Edges and Segmentation” (remotesensing-1303136). We have carefully revised the paper according to these comments. The point-to-point response to the reviewers’ comments is listed below.
p.s. In the revised version, the red parts are the changes based on the reviewers’ comments. The line number mentioned in the following responses are all collected from our revised manuscript.
Point 1: Some of the equations are not rightly formated. Please have a look into that.

Response 1: Thanks very much for your comment. We carefully checked our formula and modified some wrong formulas (formula 2, 4, 5). At the same time, we have explained the ambiguity of some letters. Thank you for your correction.
Point 2: The post-processing part proposed in the work is a bit ad-hoc. However, it improves the results. How do you select the parameters and how sensitive they are to performances?
Response 2: Thanks very much for your comment. In my post-processing method, the choice of two parameters has a greater impact on the results, one is the threshold parameter of binarization, and the other is the parameter of the watershed algorithm.
The first is binarization parameter selection. The probability value of each pixel of the CNN activation output is reflected in the image as a grayscale image with a value of 0-255. Therefore, we need to choose an appropriate threshold to achieve binarization, which is the first step of post-processing, directly affects the final result. Choosing a high threshold to remove low-confidence pixels can reduce false detections. Choosing a low threshold can retain most of the pixel values ​​and reduce missed detection. In other papers, the median value of 127 is generally used as the threshold. Considering that our post-processing method can remove redundant pixels, more pixels are needed to ensure the results. Therefore, we choose 100 as the threshold, and obtain a higher accuracy than the threshold of 127.
Secondly, in the watershed algorithm, the generation of seed points affects the final building positioning accuracy. In this algorithm, we use polygon results to perform morphological erosion to obtain seed points. The number of iterations of the erosion algorithm is what we care about. When there are too many iterations, some small area polygons will disappear, causing some buildings to be lost. Too few iterations and too large seed points lead to unsatisfactory effects of the watershed algorithm. After many experiments, we chose 5 iterations and achieved the highest results.
At the same time, we have made a corresponding explanation in the paper.
From line 287 to line 290, we explained the binarization threshold parameter.
“The choice of binarization threshold is important. Considering that our post-processing method can remove redundant pixels and more pixels are needed to ensure the results. Thus, we choose the threshold of 100 instead of the threshold of 127 as usual.”
From line 321 to line 323, we explained the parameter of iterations of the erosion operation.
“The seed points come from the erosion operation of semantic polygons. In order to prevent the disappearance of seed points due to excessive erosion operations, we choose to iterate five times to obtain seeds after many experiments.”
Point 3: Some of the figures like 8 and onwards are hard to compare visually. Some very closed zoomed-in versions could be helpful to see the main differences in approaches among the methods.
Response 3: Thanks very much for your comment. Regarding this issue, we hope to show more details in the result graph so that readers can have an intuitive overall understanding of the capabilities of our model. We have also taken into account your problem of difficulty in comparing results visually, and we have modified this. In Figures 8,9,10,11, we have added red circles to mark some of the relatively large advantages of our model, so that readers can discover the advantages of our model. Thanks for your correction.
Author Response File: Author Response.docx
Reviewer 2 Report
The proposed method, "Precise Extraction of Buildings from High-Resolution Remote 2 Sensing Images Based on Semantic Edges and Segmentation," is an exciting method for solving locating buildings, especially building boundaries. There few aspects that need to be addressed before further processing of this article:
The proposed method makes use of the D-LinkNet core structure with additional skip connections. However, the process of adding these skip connections is not described. At what level were these connections added, and now different levels of new connection can help to improve the performance. I will suggest you add a more detailed analysis or description of these aspects of model development.
Another thing missing in the overall analysis of the proposed study was the comparison between core structure (LinkNet) and proposed methods D-LinkNet. As a reader, I wanted to know how the newly proposed method is different and practical from the base architecture. It will also be great to show memory/parameters-based comparisons to show how the newly designed system is different in size (parameters).
I will suggest improving your results section with more information. It is sparse and could be extended to summarize different aspects of the proposed method.
Author Response
Response to Reviewer 2 Comments
Thank you very much for your generous help with our paper and the reviewers’ valuable comments concerning our manuscript entitled “Precise Extraction of Buildings from High-Resolution Remote Sensing Images Based on Semantic Edges and Segmentation” (remotesensing-1303136). We have carefully revised the paper according to these comments. The point-to-point response to the reviewers’ comments is listed below.
p.s. In the revised version, the red parts are the changes based on the reviewers’ comments. The line number mentioned in the following responses are all collected from our revised manuscript.
Point 1: The proposed method makes use of the D-LinkNet core structure with additional skip connections. However, the process of adding these skip connections is not described. At what level were these connections added, and now different levels of new connection can help to improve the performance. I will suggest you add a more detailed analysis or description of these aspects of model development.
Response 1: Thanks for your comment. About this issue, we have added a paragraph to the paper to describe how the full-scale skip connection is combined with the model. In addition, we also added some simple expressions of multi-scale supervision and edge guidance modules to supplement the details of our model development.
From line 187 to line 194:
“We mark the five encoding layers of the encoder as E1, E2, E3, E4, E5 from top to bot-tom, and mark the five decoding layers of the decoder as D1, D2, D3, D4, D5 from top to bottom, simultaneously. In Figure 2, we have made the corresponding mark. Each decod-ing layer combines low-level edge features and high-level semantic features through channel concatenate. So, the D1 is the combination of E1, D2, D3, D4, D5. D2 is the com-bination of E1, E2, D3, D4, D5. D3 is the combination of E1, E2, E3, D4, D5. D4 is the com-bination of E1, E2, E3, E4, D5. D5 is the combination of E1, E2, E3, E4, and the features of E5 after a series of dilated convolutions.”
Point 2: Another thing missing in the overall analysis of the proposed study was the comparison between core structure (LinkNet) and proposed methods D-LinkNet. As a reader, I wanted to know how the newly proposed method is different and practical from the base architecture. It will also be great to show memory/parameters-based comparisons to show how the newly designed system is different in size (parameters).
Response 2: Thanks for your comment. About this issue, in our article, in the line 164-168, we briefly introduced the relationship between D-LinkNet and LinkNet. “D-LinkNet is built with the LinkNet [30] for road extraction tasks and adds dilated convolution layers in its center part.”
D-LinkNet is other`s work (Literature 29). They are based on LinkNet (Literature 30). In its central part, which is the next layer after the last layer of the encoder, a series of dilated convolutional layers are added. The dilated convolution effectively increases the size of the receptive field of the neural network while maintaining the details. Improving the receptive field is a very effective means for CNN to retain spatial context information.
In our model, we retained the core structure of D-LinkNet, and based on its shortcomings, added full-scale skip connections, multi-scale supervision, and edge guidance modules to further improve performance. So, our main comparison object is D-LinkNet, not LinkNet. Compared with LinkNet, DDLNet has dilated convolutional layers, full-scale skip connections, multi-scale supervision, and edge guidance modules.
And the table is the memory/parameters-based comparisons table.
Model |
Params |
DDLNet |
60.782M |
D-LinkNet |
31.098M |
LinkNet |
21.643M |
DDLNet is a multi-task network. Its advantage lies in the combination of low-level edge features and high-level semantic features, as well as multi-scale supervision. Multi-task learning and the reuse of multi-scale features, resulting in an increase in model parameters. But the increase in parameters leads to better results, we think is meaningful.
Point 3: I will suggest improving your results section with more information. It is sparse and could be extended to summarize different aspects of the proposed method.
Response 3: Thanks for your comment. About this issue, we have modified and summarized the results section. We have added three paragraphs of summary, which are the summary of the semantic edge detection results, the summary of the semantic segmentation results, and the final summary of the overall method.
The summary of the semantic edge detection results is from line 456 to line 463
“The capability of DDLNet for semantic edge detection tasks is demonstrated on two different data sets. RCF does not perform well on our dataset. BDCN can effectively extract the building boundary and ensure the integrity of the boundary, but its edge is blurred and insufficient in accuracy. DexiNed can produce more accurate and visual edges, but its edge integrity is difficult to guarantee. DDLNet can achieve edge integrity beyond Dex-iNed and BDCN and effectively extract the boundaries of buildings. It means that it is ef-fective to provide more high-level semantic features for low-level edge features to realize semantic edge extraction.”
The summary of the semantic segmentation results is from line 491 to line 496
“The capability of DDLNet for semantic segmentation tasks is demonstrated on two different data sets. U-Net3+ and DDLNet achieve boundary IoU greatly surpassed the ac-curacy of U-Net and D-LinkNet that proves that the full-scale skip connection is effective in improving the boundary of the polygon from semantic segmentation. The result of DDLNet proves that making full use of low-level edge information proved to be helpful in extracting buildings from high-resolution remote sense images.”
The final summary of the overall method is from line 530 to line 538
“In summary, we conducted comparative experiments on two different datasets with other SOTA models to verify whether our methods could obtain high-quality results. Ex-periments confirmed that our model DDLNet had better results than other SOTA models in both semantic edge detection tasks and semantic segmentation tasks and all evaluation metrics, which not only indicated that our models have a good performance in building extraction but also indicates that the edge guidance module and full-scale skip connection are conducive to the automatic extraction of buildings in a network. What`s more, our postprocessing method is effective and further improved results of building extraction that helps to improve the vectorization of the result.”
Round 2
Reviewer 1 Report
Thanks for duly addressing the points.
Reviewer 2 Report
Authors have answered all questions.