A Voxel Generator Based on Autoencoder
Round 1
Reviewer 1 Report
Authors have successfully developed a voxel generator called VoxGen based on an autoencoder. It adopts the modified VGG16 and ResNet18 for improving the effectiveness of feature extraction and mixes the deconvolution layer with the convolution layer in the decoder for generating and polishing the output voxels. Moreover, it allows users to select the fully-flatten layer or the global average pooling layer for optimizing the quality of generated voxels or reducing the computation cost of model training.
Paper is nicely written.
Author Response
Dear Reviewer:
We would like to thank you very much for your comment on this work.
Sincerely,
Tyngyeu
Reviewer 2 Report
This paper implemented a voxel generator called VoxGen based on the autoencoder framework. It consists of an encoder to extract image features and a decoder to map feature values to voxel models. The novelty of the method is ordinary.
1 The main stucture of VoxGen are modified VGG16 and ResNet18, 34, or 50 for feature extraction. It is a very common structure. How about DensNet101 for this voxel generation?
2 The authors used a 3D dataset ShapeNet that provides multiple-viewpoint color images for training VoxGen to genarator 3D voxel. It is introduce the dataset in the experiment section, such as training and testing numbers.
3 Why the VoxGen adopts one channel in its encoder and decoder. Only for considering the cost? Or because the 3D voxel task?
4 It is better to show more 3D objects results for visualization and comparison from differen methods.
5 The result comparisons are preferred to contain more recent published articals.
Author Response
Dear Reviewer:
Thank you very much for your comments. Our response to your comments is described as follows.
1 The main structure of VoxGen is modified VGG16 and ResNet18, 34, or 50 for feature extraction. It is a very common structure. How about DensNet101 for this voxel generation?
Response: Densnet's full name is Densely Connected Convolutional Network. Unlike the Resnet network, it does not enhance the learning ability of graph features by deepening and widening the network while it passes the feature output of each layer to each subsequent layer as input to achieve feature reuse. Since the features from shallow to deep layers are output to the last convolutional layer, the utilization of features can be effectively improved and the incidence of gradient disappearance can be effectively reduced. However, feature reuse requires a larger amount of memory. If the load is exceeded, the memory of the GPU will explode when training the model. Furthermore, although densely connected blocks are easy to do gradient backpropagation, it is computationally complex. Considering limited computing resources, we did not use Densnet in this study.
The performance comparison between DensNet and ResNet has been discussed in past research [1] which shows that DensNet outperforms ResNet in some cases. Our experimental results in this paper also show that VGG16 outperforms ResNet for generating car voxels, and ResNet34 and 50 are unnecessarily more effective than ResNet18 for voxel generation. Therefore, the effectiveness of DensNet for voxel generation is worth studying in the future.
[1] Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger, Densely Connected Convolutional Networks, 2016. https://arxiv.org/pdf/1608.06993.pdf
2 The authors used a 3D dataset ShapeNet that provides multiple-viewpoint color images for training VoxGen to generate a 3D voxel. It introduces the dataset in the experiment section, such as training and testing numbers.
Response: In this paper, we used four object classes including car, chair, lamp, and table in ShapeNet database. We used 80% of these voxels of the four object classes for training and 20% for testing. The total numbers of training and test data are 481,896 and 120,164, respectively. We have revised the paper as previously described from the 279th line to the 282nd Line on page 8.
3 Why the VoxGen adopts one channel in its encoder and decoder? Only for considering the cost? Or because the 3D voxel task?
Response: Thanks for this question. We have done the experiment using three channels in the encoder and the decoder of VoxGen. However, our experimental result shows that using three channels contributes no significant performance improvement but increases the computation cost of model training and inferring. Therefore, we finally decides to adopt one channel in the structure of VoxGen.
4 It is better to show more 3D object results for visualization and comparison from different methods.
Response: Thanks for this comment. We have added the visualization comparison with another approach as shown in Figure 8 on page 11.
5 The result comparisons are preferred to contain more recent published articles.
Response: Thanks for the comment. Recently, Peng K. [24] et al. used a transformer-based encoder-decoder called TMSNet which outperforms previous methods for 3D reconstruction. This method uses 2D CNN encoders for extracting multiple-viewpoint image features and passes the extracted features to two transformer encoders to generate 3D feature vectors. Finally, it uses 3D CNN decoders to map the 3D vectors into voxel features and fuses these voxel features to generate the output voxel. However, this method is not effective enough for reconstructing the detail and not-smooth edge of object voxels. Our experimental results show that VoxGen performs better for the lamp and table classes but worse for the car and chair classes. It seems that these two methods have different strengths for different object classes. Generally speaking, VoxGen has a deeper and wider encoder network and a loss function, which can prevent the large-number sample from dominating the output voxel. In contrast, the strength of TMSNet is effective for aggregating the position information from multiple-view images. As previously described, we have revised related work and performance evaluation as shown from the 141st line to the 148th line on page 3 and from the 351st line to the 358th line on page 10.
Reviewer 3 Report
The motivation of this manuscript is clear. The quality of the generated 3D object models remains a big improvement space. Accordingly, the author designed and implemented a voxel generator called VoxGen based on the autoencoder framework in this manuscript. It consisted of an encoder that extracts image features and a decoder that maped feature values to voxel models. The main characteristics of VoxGen were exploiting modified VGG16 and ResNet18 to enhance the effect of feature extraction and mixing the deconvolution layer with the convolution layer in the encoder to enhance the feature of generated voxels. The experimental results have shown that VoxGen outperforms related approaches in terms of the volumetric IOU values of generated voxels. But there are still the following comments:
1, the innovation is general, by ‘exploiting modified VGG16 and ResNet18 to enhance the effect of feature extraction and mixing the deconvolution layer with the convolution layer in the encoder to enhance the feature of generated voxels.’ A global framework needs to be shown for illustrating this contribution.
2, “VoxGen provides an option of using a global average pooling layer or a fully flattened layer in the flatten layer of the encoder architecture. This option allows users to minimize the time cost of training the voxel generator or optimize the accuracy of 3D model reconstruction” a global average pooling layer or a fully flattened layer in the flatten layer are mature technologies that already exist.
3, Whether related work can be divided into sub headings?
4, I think figure2 and figure 3 are too simple to illustration.
5, There are too few formulas and without theoretical support for contributions.
6, “the batch size (Batch Size) was 128”, but your device is only “Nvidia GeForce® GTX 3090 GPU,” for your model. Please show me details.
7. What data are used for training and testing?
Author Response
Dear Reviewer:
Thank you very much for your comments. Our response to your comments is described as follows.
1, the innovation is general, by ‘exploiting modified VGG16 and ResNet18 to enhance the effect of feature extraction and mixing the deconvolution layer with the convolution layer in the encoder to enhance the feature of generated voxels.’ A global framework needs to be shown for illustrating this contribution.
Response: Thanks for the comment. We have evaluated the impact of modified VGG16 and ResNet18 to enhance the effect of feature extraction in Section 4.1, and have measured the effect of mixing the deconvolution layer with the convolution layer in Section 4.2. According to these evaluation results, we have concluded which combination of encoder and decoder is suitable for what kind of object class and also have shown the contribution of this paper.
2, “VoxGen provides an option of using a global average pooling layer or a fully flattened layer in the flatten layer of the encoder architecture. This option allows users to minimize the time cost of training the voxel generator or optimize the accuracy of 3D model reconstruction” a global average pooling layer or a fully flattened layer in the flatten layer are mature technologies that already exist.
Response: Thanks for the comment. As we know, no related work support users to select the global average pooling or the fully flattened layer in the same architecture of voxel generation although they are mature technologies. Moreover, we evaluated and compared the effect of using the global average pooling or the fully-flatten layer in different encoder-encoder combinations, and finally gave users the principle for selecting which kind of flatten layer, as shown in Section 4.3.
3, Whether related work can be divided into subheadings?
Response: After careful consideration, we did not divide the related work into sub-headings because of the length and integrity of the content.
4, I think figure2 and figure 3 are too simple to illustrate.
Response: Thanks for this comment. Figure 2 and Figure 3 are the whole structure of VoxGen which consists of an encoder based on modified VGG16 or ResNet and a decoder. We have detailed the structure of the encoder and the decoder as shown from page 4 to page 7.
5, There are too few formulas and without theoretical support for contributions.
Response: Thanks for this comment. Most related works such as 3D-R2N2[15], V3DOR[16], Pix2Vox[17], and TMVNet[24] cited in this paper used only the performance evaluation to support their contribution. According to the experience of these referenced papers, we also support our contributions with a series of experiments that evaluate the impact of different architectural designs for the encoder and the decoder and compare our proposed method with related methods in terms of accuracy (i.e. IOU value) of voxel generation.
6, “the batch size (Batch Size) was 128”, but your device is only “Nvidia GeForce® GTX 3090 GPU,” for your model. Please show me the details.
Response: When using only a single Nvidia GeForce® GTX 3090 GPU for training, the GPU memory is insufficient to temporarily store the whole training dataset. To address this problem, we use the fit_generator method in the Tensorflow/Keras framework for training the proposed voxel generator. The characteristic of this training method is to avoid sending the entire data set to the GPU memory for temporary storage but to transmit only 128 training data in each batch to the GPU for computation until the entire data set are sent to the GPU for computation in the same EPOCH. Although it will increase the cost of memory copy between CPU and GPU, it can effectively avoid the situation of training failure caused by insufficient GPU memory.
- What data are used for training and testing?
Response: In this paper, we used four object classes including car, chair, lamp, and table in ShapeNet database. We used 80% of these voxels of the four object classes for training and 20% for testing. The total numbers of training and test data are 481,896 and 120,164, respectively. We have revised the paper as previously described from the 279th line to the 282nd Line on page 8.
Reviewer 4 Report
Article ID: applsci-1949918
A Voxel Generator based on Autoencoder
The paper describes the design, and test results of a voxel generator (VoxGen) implemented using the autoencoder artificial neural network. The VoxGen autoencoder was developed to reduce an effort and minimise development costs of 3D models voxels generation.
The paper presents a study, results and discussion for a different encoder, decoder models and flatten layers including performance evaluation.
Authors presented a better result than obtained for 3D-R2N2, V3DOR-AE and Pix2Vox described in literature.
This is a good quality well-structured paper presenting valid research.
There are a few minor and major problems in the paper that need to be clarified and corrected before the publication. Some of additional issues are summarised below (MA=major, MI=minor, x/z means page/line):
MI 1/18: always explain the acronym or shortcut when used for the first time, e.g IOU in abstract (1/19), VR/AR (1/25), LiDAR (1/44)
This paper is suitable for MDPI publication however a few minor corrections are needed before the publication.
Author Response
Dear Reviewer:
Thank you very much for your comments. Our response to your comments is described as follows.
1. MI 1/18: always explain the acronym or shortcut when used for the first time, e.g IOU in abstract (1/19), VR/AR (1/25), LiDAR (1/44)
Response: These acronyms have been explained by full name when they are used for the first time as shown on pages 1 and 2.
Reviewer 5 Report
The authors have developed a voxel generator based on auto-encoder and by modifying VGG16 and ResNet18 for improving the effectiveness of feature extraction. The authors also propose to mix the deconvolution layer with the convolution layer in the decoder part for generating the output voxels. Experimental results prooves the effectiveness of their proposed method based on the modification of Resnet and VGG networks. However the proposed "VoxGen" network should be better explained: - more justification of their choices - add more equations related to their modification - improve the text and the quality of figures etc...
For this reason i accept the paper with major corrections.
Author Response
Dear Reviewer:
Thank you very much for your comment. We have revised the paper by adding more discussion and figures on related work, proposed framework, and performance evaluation according to all the reviewer's comments as shown in the revised version.
Reviewer 6 Report
This paper, focusses on design and implement a voxel generator based on the auto-encoder framework with the rapid development of graphics processors and deep learning models.
This type of work is suitable for publishing in this reputed Jouranal. However, some poins that can be improve the quality of the paper are given below.
1. The abstract of the papre is not written properly. (Also include performance analysis )
2. How to extract the feature of the preprocessed input image and then transform the extracted feature.
3. Compaire this feature extraction technique with other feature extraction techniques.
Author Response
Dear Reviewer:
Thank you for your comments—our response to your comments as previously described as follows.
1. The abstract of the paper is not written properly. (Also include performance analysis )
Response: Thanks for the comment. In this paper, we have done a series of experiments to evaluate the impact of each architecture design in the proposed voxel generator, and have compared our methods with related work. We have given a detailed analysis for explaining the experimental results in Section 4. Therefore, we described only the brief conclusion for our experiments due to considering the abstract length.
2. How to extract the feature of the preprocessed input image and then transform the extracted feature.
Response: As shown in the 184th line on page 4, we detailed the structure of the encoder based on modified VGG16 and Resnet18/34/50, which is responsible for extracting the feature of the preprocessed input image and mapping the extracted feature to the latent code. In this paper, we increased the width of the fully connected layer in the VGG16 and Resnet18/34/50 to involve more image features in voxel generation and provide an option for users to select the fully-flatten layer or the global average pooling for minimizing the computation cost or optimizing the accuracy of the 3D model reconstruction as described in the 213th line on page 5.
- Compare this feature extraction technique with other feature extraction techniques.
Response: In Section 4.1, we have compared the feature extraction architecture of the proposed method with that of V3DOR-AE in terms of the IOU value of generated voxels.
Round 2
Reviewer 2 Report
The authors have modified the manuscript according with the comments.
Reviewer 3 Report
All of my comments are improved. I think it is able to publish!
Reviewer 5 Report
I always believe that methods could be better explained but taking into the response to reviewers' questions and corrections made by the authors, I accept the paper.
Reviewer 6 Report
I agree to accept this article in present form.