Efficient Depth Map Creation with a Lightweight Deep Neural Network
Round 1
Reviewer 1 Report
The paper addresses the problem of lightweight depth estimation from stereo images. It uses asymmetric convolution filters to reduce the parameters and increase the accuracy. While this is an interesting approach to investigate, the paper lacks details and analysis of the approach. More precisely:
- After reading the paper, many design choices remain unclear. Why has the right patch been extended horizontally? The paper says to extend the street (Line 124), is it therefore something dataset specific?
- The method uses horizontal asymmetric filters, but not vertical ones, why? Are the horizontal more important?
- An ablation study missing to show the advantage of adding the pooling layers and the asymmetric filters. Page 6 mentions a Table 2 but I cannot find it.
- Table 1: References missing, and differences between the three Luo’s variants are unclear. The approach has only been compared to methods from 2016-2018. More recent ones are missing.
- g. Yan Wang, Zihang Lai, Gao Huang, Brian H. Wang, Laurens van der Maaten, Mark E. Campbell, Kilian Q. Weinberger. Anytime Stereo Image Depth Estimation on Mobile Devices. ICRA 2019.
- Experiment has been performed on 4 Nvidia GeForce RTX 2080 Ti, which I would not call specifically lightweight.
- It might be helpful to have some visuals results with ground truth if available.
- In the conclusion it is not clear which variation of Luo et al. from Table 1 has been used for the stated comparison.
- Missing related work about asymmetric convolutional filters, e.g.:
- Ding X , Guo Y , Ding G , et al. ACNet: Strengthening the Kernel Skeletons for Powerful CNN via Asymmetric Convolution Blocks. ICCV 2019.
- Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CVPR 2016.
Author Response
Dear reviewer,
We deeply appreciate your valuable comments.
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors use the padding and pooling layers to prevent overlapping of the features to the center. They also propose a network that is applied an asymmetric filter to further reduce the amount of calculation.
The algorithm should be explained in more detail. For example, is the CNN used for supervised learning? If so, the method of sample tag should be explained. The training process should be explained in detail, and the characteristics of the proposed algorithm should be emphasized to strengthen the contribution of the study.
The definitions of corresponding parameters in the Experiment are imprecise. For example, the meaning of the numbers in brackets in Table 1 should be explained. It is recommended that the pixel error should be explained.
It is recommended to provide more quantitative comparisons to highlight the contribution of research. The results in Figure 4 should be expressed in quantitative terms.
The conclusion mentioned, "The proposed technique facilitates good disparity matching with a 10% less error for a 45% lower calculation". Whether 10% less and 45% lower are obtained from Table 1 should be explained in the Experiment.
In its current form, the paper was not presented clearly.
Author Response
Dear reviewer,
We deeply appreciate your valuable comments.
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors have addressed most of my concerns in the revised paper and improved the clarity of the paper.
However, my main concern regarding the positioning of the work remains. Abstract talks about embedded computer vision algorithms, however, this remains as future work in the conclusion. The proposed algorithm reduces the computation runtime and achieves better accuracy than other methods. However, the number of parameters is comparable to the others in Table 1. The method is lightweight compared to the method closest in accuracy but absolutly and with respect to all the methods in Table 1 it is more on the medium side.
Minor:
Line 219 several Siemens architectures -> Siamese
Line 113 that extracts disparity and a fast speed and a small parameter -> rewrite
Author Response
We deeply appreciate your valuable comments.
Please see the attachment.
Thank you.
Author Response File: Author Response.pdf
Reviewer 2 Report
The results in Figure 4 are presented by visual verification. It is recommended to provide in quantitative terms. The conclusion mentioned, "… matching with a 10% less error for a 45% lower calculation compared to the existing method proposed by Luo et al. with (37 x 37) input sizes.…" the result is not only compared with Luo’s, but also MC-CNN-fast, Patrick’s, and AnyNet. If we compare with MC-CNN-fast, the runtime is reduced |0.11-013|/0.13=15%, or we compare with AnyNet, it is |0.20-013|/0.13=54%. The results presented by the research should be more clearly stated. It is recommended to provide more quantitative comparisons to highlight the contribution of research.
Author Response
We deeply appreciate your valuable comments.
Please see the attachment.
Thank you.
Author Response File: Author Response.pdf
Round 3
Reviewer 2 Report
I have no further suggestions.