1. Introduction
Point cloud represents one of the most commonly used formats for 3D data, comprising a set of points sampled from a scene. In various 3D computer vision applications, point clouds serve as either the sole data source [
1,
2,
3,
4,
5,
6,
7] or essential data [
8,
9] for understanding 3D scenes. Point cloud classification is a fundamental task in 3D scene understanding. Simultaneously, classifying point clouds in an open-world scenario or for unknown new categories is a hot spot issue [
10,
11]. Achieving this level of application requires a significant amount of human-labor data annotations for deployed 3D systems. Despite the increasing availability of point cloud data facilitated by the advancement of 3D scanning technologies, the valid point cloud data volume remains significantly insufficient. In addition, annotating point cloud data is notably more challenging compared to 2D image data due to its scattered and unordered nature [
10], posing difficulties in collecting point cloud datasets for deep learning applications.
Visual-language pre-trained (VLP) models, learning from image–text pairs [
12,
13,
14], have revolutionized 2D computer vision recognition over the last few years. Benifiting from existing large-scale pre-training data, these models exhibit exceptional understanding of the open world at the 2D image level [
11,
13,
14]. Inspired by this, many downstream recognition tasks can be adapted by the VLP model, and this extension also applies to the domain of point cloud classification. Some recent works have explored how to transfer knowledge structures to point cloud classification tasks [
11,
15,
16]. Those transferred approaches utilizing VLP models follow a common process: (1) Encoding the projected point cloud images and textual prompt separately as a single feature. (2) Aligning image–text pair features and determining the category that corresponds the most by cosine similarity. Typically, the point cloud is projected into multi-view depth images and all image features are aggregated into a single feature with predefined view weights.
We identify several limitations in those VLP-based methods: (1) Prompt Ambiguity: The choice of textual prompts for each category may involve predefined templates or generation from a large language model. However, the selection of specific textual prompts for classification relies on human expertise. (2) Image Domain Gap: VLP-based methods project point clouds as depth images. Nevertheless, depth images significantly differ from the image domain of the VLP model training dataset. (3) View Weight Confusion: Point clouds are often observed from multiple viewpoints during projection as images, and encoded image features are aggregated into a single feature through view weights. Predefined view weights require manual fine-tuning rather than automatic adjustment, making it challenging to determine which viewpoint is more beneficial for classification without prior knowledge. (4) Feature Deviation: Encoded features of objects with unique shapes may deviate from common shapes in the same category, as a single image encoder may not focus on their distinctive characteristics that distinguish them from other categories.
In this work, we introduce PointBLIP, a zero-training point cloud classification network. PointBLIP is built upon the BLIP-2 visual-language pre-trained model [
14], enabling it to handle zero-shot and few-shot point cloud classification tasks. Unlike previous methods, PointBLIP proposes novel and improved approaches in input data construction and feature similarity comparison to address the aforementioned challenges.
To improve the quality of input data, we employ the ray tracing method to render point clouds into simulated images instead of projecting them into depth maps, thereby making the input images more closely aligned with the image domain of VLP models. Simultaneously, we utilize a large language model to generate shape-specific and more discriminative textual prompts. We treat multiple textual prompts for the same object category as a semantic description set for textual input, collectively enriching the descriptive features and eliminating the need for manual selection.
PointBLIP conducts comparisons between multiple image features and multiple text features. The various image features are derived from multiple projection perspectives of the point cloud and the encoding process of the BLIP-2 image encoder. Simultaneously, the multiple text features originate from the semantic description set on the textual side. Diverging from previous approaches that use predefined view weights to aggregate a single feature, we directly compare the similarities between multiple image features and multiple text features without any weight adjustments. We conceptualize the process of comparing multiple features as occurring microscopically in a feature grid. In order to measure reliable feature similarity from the feature grid, we employ distinct strategies tailored to zero-shot and few-shot classification tasks. The selection between these strategies depends on whether the object is compared with features that have an explicit semantic description.
PointBLIP boosts baseline’s performance and even surpasses state-of-the-art models. In zero-shot point cloud classification tasks, PointBLIP surpasses state-of-the-art methods by 1% to 3% on three benchmark datasets, including synthetic dataset ModelNet and real scanning dataset ScanObjectNN. In few-shot point cloud classification tasks, PointBLIP shows an enhancement of approximately 20% compared to other VLP-based methods under similar conditions. To our best knowledge, we propose the first zero-training few-shot point cloud classification network. It is worth noting that as a zero-training model, PointBLIP remains comparable to full-training few-shot classification models on two standard datasets, ModelNet40-FS and ShapeNet70-FS.
Our contributions are summarized as follows:
We introduce PointBLIP, a zero-training point cloud classification network based on a visual-language pre-trained model.
We improve the input data quality for VLP-based method. By employing ray tracing rendering, we address the gap between point cloud and image data. Additionally, we introduce a shape-specific textual prompt-generation method.
We employ distinct feature-measurement strategies tailored to zero-shot and few-shot classification tasks. A Max-Max-Similarity strategy entails comparing the similarities between images and prompts for zero-shot classification tasks, while a Max-Min-Similarity strategy compares the similarities between point cloud images and example images for few-shot classification tasks.
Comprehensive experiments conducted on benchmark datasets demonstrate that PointBLIP surpasses state-of-the-art performances. PointBLIP surpasses previous work in both zero-shot and few-shot classification tasks. Moreover, in the few-shot classification task, PointBLIP remains comparable to full-training few-shot classification methods.
3. Method
In
Section 3.1, we revisit Bootstrapping Language–Image Pre-training with frozen unimodal models (BLIP-2) and present the essential components upon which PointBLIP relies. In
Section 3.2, we delineate the methods employed for enhancing input data quality. In
Section 3.3, we elucidate the procedures through which PointBLIP executes zero-shot point cloud classification. Finally, in
Section 3.4, we elaborate on the methods employed by PointBLIP for conducting few-shot point cloud classification.
3.1. A Revisit of BLIP-2
BLIP-2 is a versatile and computationally efficient vision–language pre-training method that leverages off-the-shelf pre-trained vision and language models [
14]. It comprises two stages: the vision-and-language representation stage and the vision-to-language generative stage. The former aligns image and text representations, while the latter generates linguistic interpretations for images. In this study, we primarily explore the cross-modal capabilities of the vision-and-language representation stage, serving as a feature encoder.
We elucidate the feature-encoding process during inference in detail. For each single image, a fixed number of encoded features are extracted from the BLIP-2 image encoder instead of a single feature. The image encoder employs a set number of learnable query embeddings as input, interacting with image features from the frozen CLIP [
12] image encoder through cross-attention layers. Subsequently, the image encoder produces a set of learned queries as the feature representation of an image. It is worth noting that, unlike traditional image encoders, BLIP-2 encodes an image into multiple features, with each feature representing a semantic aspect of the image. The image-encoding process can be formulated as
where
I is an input image,
n is query number, and
c is the embedding dimension.
In contrast, the text encoder encodes all words into output embeddings but focuses solely on the [CLS] token as a single classification feature. The text-encoding process can be formulated as
where
T is a descriptive sentence for the corresponding image.
We construct the PointBLIP network using the image and text encoders in BLIP-2. The fundamental distinction between PointBLIP and previous work lies in encoding one image into multiple features, while previous work encodes one image into a single feature. We adopt this approach for the following reasons:
- (1)
Enhancement in feature descriptive capabilities. Encoding into multiple features is advantageous for extracting more extensive and comprehensive information from an image. Multiple features imply that the encoder interprets the image from different semantic perspectives.
- (2)
Advantageous for filtering out irrelevant information. Since the interpretations of multiple features differ, there is an opportunity to independently extract the features of interest while filtering out irrelevant ones. In contrast, the traditional image-encoding process encodes all image information, including noise, into a single feature.
3.2. Prompting for Image and Text
To address the 3D model gap and generate meaningful textual prompts, we introduce two novel approaches to constructing input images and textual prompts for our method.
3.2.1. Ray Tracing for Point Cloud
Despite the existence of methods to improve image quality [
11], projecting point clouds as depth maps still results in a model gap between point cloud and VLP model training images. Since VLP model training data predominantly originate in the real world, we contend that transforming point clouds into stereoscopic, clearly outlined 2D shapes is necessary. Therefore, we introduce the ray tracing method to render point clouds into simulated images.
In this process, each point in the point cloud is represented as a white sphere with a radius of
r, and the surface of the sphere undergoes diffuse reflection of rays. We use parallel and inclined light sources to illuminate these spheres. To enhance the clarity of the complete outline, rays undergo a finite number of diffuse reflections on the spheres. For each point cloud object, we generate rendered images from four different perspectives around the object to obtain a comprehensive view. A comparison between the projection method and our rendering method is shown in
Figure 1.
3.2.2. Comparative Textual Prompts
Taking inspiration from PointCLIPv2 [
11], we utilize a large language model GPT-3 [
39] to generate 3D-specific text with category-wise shape characteristics as textual input. Since original point clouds lack texture information, we argue that textual input should distinguish different categories based on shape. We introduce two rules in command to generate distinctive descriptive sentences: (1) Specify all categories when providing commands to GPT-3 as input. (2) Request GPT-3 to offer the most distinctive shape description.
Figure 1.
Visualization results comparing projection and ray tracing on the ModelNet dataset. The visualizations on the left, with white backgrounds, depict the outcomes obtained through realistic projection in PointCLIPv2, whereas those on the right showcase our visualizations utilizing ray tracing. The point cloud images generated through ray tracing exhibit a closer resemblance to the visual style observed in the real-world scene.
Figure 1.
Visualization results comparing projection and ray tracing on the ModelNet dataset. The visualizations on the left, with white backgrounds, depict the outcomes obtained through realistic projection in PointCLIPv2, whereas those on the right showcase our visualizations utilizing ray tracing. The point cloud images generated through ray tracing exhibit a closer resemblance to the visual style observed in the real-world scene.
An example of generating a textual prompt for the airplane category in the ModelNet dataset is illustrated as follows:
Question: The following object categories: airplane, bathtub, bed… (list all category names in ModelNet). Describe the shape differences between the airplane and other categories in one short sentence.
GPT-3 Answering: The airplane stands out with its elongated, winged structure and tail, distinctly different from the predominantly static and boxy forms of the other categories.
In our work, we generate a set of descriptive sentences for each category as textual prompts and all sentences will be used for the classification of a category. Use “CLASS” as the name for a target category to be classified, we adopt the following three sentence-generation templates:
- (1)
CLASS.
- (2)
Answering from GPT-3: The following object categories: … (list all category names). Describe the shape differences between the CLASS and other categories in one short sentence.
- (3)
Answering from GPT-3: The following object categories: … (list all category names). Use one sentence to describe. In what aspects does CLASS look different from other categories in terms of shape?
3.3. Zero-Shot Point Cloud Classification
The zero-shot point cloud classification process is illustrated in
Figure 2. Following the data-prompting methods detailed in
Section 3.2, we generate simulated point cloud images formed from
V observation perspectives and textual prompts for
L target categories. The BLIP-2 image encoder produces
n learned queries for each perspective image, and the image feature set
(where
) is extracted from one point cloud, where
c is the embedding dimension. For the textual branch, assuming each category contains
q textual prompts, the text feature set
(where
) is extracted from the textual prompts for all categories. Our objective is to determine the most likely category for the source point cloud.
Previous methods typically compare the cosine similarity between single and aggregated features of images and text, e.g., PointCLIPv2 calculates the final zero-shot classification logits by weighted summing multi-view image features to a single feature, formulated as
where
,
, and
represents the view weights. However, our approach contains an additional dimension for feature comparison.
and
cannot be directly used to calculate similarity following Equation (
3), and
may cause view weight confusion without prior knowledge.
We take two steps to address this issue. Firstly, we establish a minimal unit called
feature grid between the image feature set
and the text feature set
. For the zero-shot classification task, we define feature grid as a similarity matrix comparing the cosine similarity between image features from a certain perspective and all text features for a specific category, formulated as
The feature grid is represented as a cosine similarity matrix. For the
L classification task, each point cloud can generate
feature grids. We employ a strategy called Max-Max-Similarity to calculate similarity from each feature grid. The process of the Max-Max-Similarity strategy is illustrated in
Figure 3a. Max-Max-Similarity calculates the maximum values for both rows and columns in the feature grid matrix, which will be treated as the basis for the next classification step. The aim of Max-Max-Similarity is to provide the maximum similarity level between a simulated point cloud image and a specific category.
Secondly, we form a larger similarity matrix
G formed by similarity values from feature grids to obtain the most relevant category, which can be formulated as
We search the maximum value in matrix
G and take the category corresponding to this maximum value as the classification result, formulated as
where
represents the function searching for the category index of the maximum value in the matrix.
3.4. Few-Shot Point Cloud Classification
The process of few-shot point cloud classification is illustrated in
Figure 4. In this scenario, a limited number of point clouds from “unseen” categories are labeled as a reference, and our method aims to recognize new, unlabeled query point clouds under this few-shot condition. The labeled example point clouds constitute the support set
, encompassing
N classes with
K examples for each class, where
represents an example point cloud. Our objective is to identify unlabeled point clouds based on the support set
.
For an unlabeled query point cloud, we generate simulated images from V observation perspectives of this point cloud and extract image features using the BLIP-2 image encoder as (where ). Simultaneously, example images generated from the support set are encoded into (where ) following the same process. In the traditional few-shot classification procedure, we need to calculate the feature similarity between query images and example images for matching. However, the BLIP-2 image-encoding process introduces a challenge. Since each query and example image produces n features instead of a single feature, we need to measure a unique similarity among two sets of image features. Some image features may potentially describe irrelevant information, such as background and texture, which is not suitable for comparison. There is no explicit way to determine which feature represents category-discriminative characteristics and is beneficial for comparison.
To address this issue, we also establish a feature grid to compare multiple image features. In the few-shot classification task, the construction of the feature grid and measurement strategy differ from the zero-shot method in
Section 3.3. This process is based on the following theory: if two objects belong to the same category, the similarity of their most challenging-to-match feature will still be higher than for other categories.
Specifically, we define this feature grid as a similarity matrix comparing the cosine similarity between query and example image features, formulated as
For each query point cloud in the classification process,
feature grids can be constructed. We calculate a similarity from each feature grid, representing the matching degree between the query image and example image. We employ a Max-Min-Similarity strategy to calculate this similarity. The process of the Max-Min-Similarity strategy is illustrated in
Figure 3b. We calculate the maximum value for each row in the feature grid, representing the maximum matching level between each query image feature and example image features. Then, we compute the minimum value from this collection of maximum values, representing the maximum similarity level of the most challenging-to-match features. We treat this value as the similarity between the query image and example image.
Next, we utilize the similarities from feature grids to form a matrix
G reflecting the similarities between all query images and all example images. The whole process can be formulated as
Finally, we search for the maximum value in matrix
G and take the category corresponding to this maximum value as the classification result, which can be formulated as
where
represents the function searching for the category index of the maximum value in
G.