PointBLIP: Zero-Training Point Cloud Classification Network Based on BLIP-2 Model
Round 1
Reviewer 1 Report (New Reviewer)
Comments and Suggestions for AuthorsIn fact, the creation of manually annotated data for the 3D point clouds is a tedious and time consuming process. Thus, zero and few shot learning for 3D point clouds is a very interesting topic. The authors present the PointBLIP architecture which combines features extracted from texts and images for the 3D point cloud classification task. Compared to the SoTA methods the authors use the BLIP architecture instead of the CLIP to extract the text features. Additionally thy propose different feature measurement strategies depending on the zero or few shot application, resulting in improved results comparing to the SoTA methods.
Furthermore, the authors render images from different perspectives of the point cloud using a ray tracing method instead of a simple projection of the point cloud. This is a very interesting approach. In fact the comparison of projected and ray tracing images demonstrates that the latter represent the geometric characteristics of the objects better.
However, is there a difference on the execution time between them ?? It would be possible to apply this technique using larger objects i.e., more than 4 views ? Why do you select only 4 views, is there missing parts due to the number of views ??
Additionally in lines 285 - 286 the authors describe the max-max similarity. Utilizing the maximum similarity between the point features and the text features the authors omit all the features except from those with the maximum similarity score. Do you think that this influences the performance of PointBLIP ?? The authors presented that max-max similarity is better than the the average. However, selecting the top 5 for example similarities could be better than the max-max similarity or not ??
Finally in Lines 494-495 the authors argue that the performance of PointBLIP does not rely on the feature extraction capability if the VLP model. They also state that BLIP2 performs worse than CLIP in the zero-shot classification scenario. Did you try to combine the CLIP algorithm instead of the BLIP, with the proposed similarity measurements ??
After all, the proposed manuscript is very interesting and informative.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report (New Reviewer)
Comments and Suggestions for AuthorsPlease find my comments attached. I enjoyed reading your manuscript and learned a lot about the clearly introduced method. I strongly appreciated your discussion. However, I am missing a concrete "real" application and the related difficulties that may arise (computational time etc). Thanks for clarifying here.
Comments for author File: Comments.pdf
Author Response
Thank you very much for taking the time to review this manuscript. One potential application of this research is in the 3D scene understanding task where classification of unfamiliar objects is required,such as indoor navigation, SLAM, or 3D object detection processes. The zero-shot and few-shot point cloud classification methods in our study can be utilized in these scenarios. Meanwhile, the real-time rendering of ray tracing point cloud images may pose a potential challenge, which is current challenges of this work and worthy of further research.
This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper presented a zero-training point cloud classification network based on a visual-language model. This paper described the two classification methods: zero-shot and few-shot classification. The zero-shot classification method is compared to some full-trained methods. The few-shot method is compared with relevant methods, testing K-shot classification performance. An ablation study is followed.
In general, this is a very good research paper almost ready to be published.
Some comments are given:
1. Typos and possible minor improvements on writing, see: line 65, ‘large’ was typed as ‘lasrge’; Table 1 caption, ‘overall’ as ‘overrall’, and an indication of ‘accuracy’ is probably missing.
2. Some terms should be explained, for example, the ‘N-way learning’ in 4.4.1, is it referring to the number of target classes?
3. Some terms should be better explained. In line 79-86 is introducing Max-Max and Max-Min Similarity strategies for 2 different methods. They are better explained in 3.3 , but the explanation here is easily misunderstood. It’s suggested to rearrange the writing in this paragraph.
Comments on the Quality of English LanguageEnglish is not bad, only minor editing must be done.
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors propose PointBLIP, a new method for zero-few shot point cloud classification exploiting VLMs. At the beginning of the manuscript, I was really impressed by the innovation and the idea by the authors. The performances claimed at the very beginning were super impressive so I went deeply though the manuscript. It was very difficult to follow, logically, what was done to achieve the results. I'm skeptical about the idea of using ray tracing for simulated images, and then use them as the base for experiments. This part is completely obscured. Second, apart from the first contribution that I read in the introduction (the introduction of PointBLIP), the others should be condensed into one, that in the field would be more than enough !
Unfortunately, looking at the results, the comparison with other methods in SoTA methods reveal shallow improvements, and dealing with classification means showing in detail where the performance increase and, foremost, why.
The state they are pioneer, and I would be more than happy of this. By now, to me, the experiments are not replicable, and following the paper both formulas and architectures comes from previous paper.
All in all, a very good idea, but it lacks of proves to be published in Remote Sensing. Another comment is about if the manuscript is appropriate for the journal, since I cannot see how the method can be usable in domains like earth observations or environment modeling.
Comments on the Quality of English LanguageEnglish is fine, maybe redundant sometimes
Reviewer 3 Report
Comments and Suggestions for AuthorsThe paper develops a novel zero-training network PointBLIP for point cloud classification. The framework addresses the gap between point cloud and image data based on the ray tracing rendering. It also proposes distinct feature measurement strategies for zero-shot and few-shot classification tasks respectively. Meanwhile, experimental results on four benchmark datasets (including synthetic and real scan datasets) have proven the performance of the proposed network. Overall, this study has high comparable performance to full-training methods. It is recommended to be published after minor revisions.
Comments on the Quality of English LanguageThe quality of English language in the provided text is generally good. However, there are some spelling errors, such as line 65 "lasrge", line 358 "pointNLIP", and etc. Please check all the paper and make further improvements.