Language-Level Semantics-Conditioned 3D Point Cloud Segmentation

Liu, Bo; Zeng, Hui; Dong, Qiulei; Hu, Zhanyi

doi:10.3390/rs16132376

Open AccessArticle

Language-Level Semantics-Conditioned 3D Point Cloud Segmentation

by

Bo Liu

¹,

Hui Zeng

^2,3,*

,

Qiulei Dong

¹

and

Zhanyi Hu

¹

Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China

²

Beijing Engineering Research Center of Industrial Spectrum Imaging, School of Automation and Electrical Engineering, University of Science and Technology Beijing, Beijing 100083, China

³

Shunde Innovation School, University of Science and Technology Beijing, Foshan 528399, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(13), 2376; https://doi.org/10.3390/rs16132376

Submission received: 25 May 2024 / Revised: 18 June 2024 / Accepted: 26 June 2024 / Published: 28 June 2024

(This article belongs to the Special Issue New Perspectives on 3D Point Cloud II)

Download

Browse Figures

Versions Notes

Abstract

:

In this work, a language-level Semantics-Conditioned framework for 3D Point cloud segmentation, called SeCondPoint, is proposed, where language-level semantics are introduced to condition the modeling of the point feature distribution, as well as the pseudo-feature generation, and a feature–geometry-based Mixup approach is further proposed to facilitate the distribution learning. Since a large number of point features could be generated from the learned distribution thanks to the semantics-conditioned modeling, any existing segmentation network could be embedded into the proposed framework to boost its performance. In addition, the proposed framework has the inherent advantage of dealing with novel classes, which seems an impossible feat for the current segmentation networks. Extensive experimental results on two public datasets demonstrate that three typical segmentation networks could achieve significant improvements over their original performances after enhancement by the proposed framework in the conventional 3D segmentation task. Two benchmarks are also introduced for a newly introduced zero-shot 3D segmentation task, and the results also validate the proposed framework.

Keywords:

three-dimensional point cloud; semantic segmentation; zero-shot learning

Graphical Abstract

1. Introduction

Point cloud semantic segmentation is a fundamental problem in the computer vision and computer graphics communities. Inspired by the tremendous success of deep neural networks (DNNs) in the 2D image analysis field, recently, DNNs have been introduced to various 3D point cloud-processing tasks [1,2,3]. However, due to the irregular structure of 3D point clouds, the existing DNNs, which were usually designed to work at regular grid inputs (e.g., image pixels), could not straightforwardly be applied to 3D point clouds.

To tackle this problem, a large number of DNN-based methods have been proposed to design new network architectures for 3D point cloud segmentation, which could be roughly divided into two categories from the perspective of point cloud representation: projection-based methods and point-based methods. Projection-based methods [1,4,5,6,7] project irregular 3D point clouds into regular 3D occupancy voxels or 2D image pixels, where conventional 3D or 2D convolutional neural networks (CNNs) could be applied directly. Despite their decent performances, the projection-based methods usually suffer from information loss in the projection process. The point-based methods [2,3,8,9,10,11,12,13,14] directly process 3D points by extracting point features with multi-layer perceptrons (MLPs) and then aggregating neighborhood point features via a specially designed point feature aggregator, an elaborate convolution operation, and transformers. Since the point-based methods directly operate on 3D points, the key to the point-based methods is how to successfully aggregate local and long-range geometrical information. However, due to the lack of large-scale datasets (like ImageNet for 2D image analysis) in the 3D point cloud-processing field, both projection-based and point-based methods generally suffer from the data-hungry problem due to the adoption of DNN architectures.

The data-hungry problem, in the conventional point cloud-processing tasks, could be alleviated by data augmentation techniques. Hand-crafted rules (e.g., random rotating and jittering) were used as general data-augmentation operations by many works [2,8] in the point cloud-processing field. PointAugment [15] introduced an auto-augmentation framework, which took an adversarial learning strategy to jointly optimize a point cloud augmentation network and a classification network. Following Mixup [16], an effective augmentation technique in the 2D image analysis field, some works [17,18,19] mixed samples in the 3D space to augment 3D point clouds. However, these approaches mainly focused on shape-level point cloud augmentation and mainly aimed at the 3D shape classification task, and they are generally not quite effective for the point cloud-segmentation task because point cloud segmentation is a point-level discrimination task.

The data-hungry problem, on the other hand, is reflected by the fact that novel object classes usually appear in the point cloud processing tasks of real 3D scenes, and human annotators cannot annotate all potential object classes for model learning. To tackle this novel class problem, some works [20,21,22] introduced zero-shot learning to recognize novel class 3D point cloud shapes, which is constrained to the 3D shape-level point cloud classification tasks. Very recently, zero-shot learning has been introduced to the 3D point cloud segmentation tasks. By the powerful zero-shot ability of large-scale visual–language pre-trained models, Wang et al. [23] introduced CLIP [24] to align point cloud features with semantically visual features so that unseen class points could be inferred. Besides, some works [25,26] related point cloud features to semantic prototype features by learning a projection network to achieve zero-shot segmentation for unseen class points.

In this work, we propose a unified framework called SeCondPoint to simultaneously tackle the problems of conventional point cloud segmentation and zero-shot point cloud segmentation, where language-level semantics are introduced to condition the modeling of the point feature distribution, as well as the pseudo-feature generation. The proposed SeCondPoint firstly employs a conditional generative adversarial network to model the point feature distribution conditioned on semantic information, which is adversarially learned from the real distribution of point features extracted by an arbitrary existing point cloud-segmentation network. A feature-geometry-based Mixup approach is further proposed to facilitate the distribution learning. After training the generative model, a large number of point features could be generated from the learned distribution conditioned on the specific semantic information. With these generated point features, a semantics-enhanced point feature classifier can be trained for point cloud segmentation. As illustrated in Figure 1, the advantages of the semantics-enhanced classifier include the following: (1) For a class with a relatively small number of point samples in the training 3D point clouds, a great number of point features could be generated from the learned distribution conditioned on the corresponding semantic information of this class. Hence, the decision boundary between this class and its neighboring class in the feature space could be improved to be better generalized to the testing samples. (2) Due to the semantics-conditioned modeling, point features of novel classes could be generated conditioned on their corresponding semantic information, which endows the learned classifier with the ability to segment novel class objects.

In sum, the key contributions of this work include the following:

We propose a unified framework (SeCondPoint) for both conventional 3D scene semantic segmentation (C3DS) and zero-shot 3D scene semantic segmentation (Z3DS), where language-level semantics are introduced to conditionally model the point feature distribution and generate point features. Any existing C3DS networks could be seamlessly embedded into the proposed framework for either improving their C3DS performances or classifying novel class objects.
Under the framework of zero-shot learning, we introduce the task of zero-shot 3D scene semantic segmentation (Z3DS) and construct two benchmarks for algorithmic evaluation. Two baselines are also established for comparison. The two benchmarks introduced and methodology could be of help for more research in this direction.

The remainder of this paper is organized as follows. Firstly, we review some related works in Section 2. Then, we introduce the proposed framework, datasets, and experimental materials in Section 3. In Section 4 and Section 5, extensive results and some in-depth discussions are provided, respectively. Finally, we conclude this paper and outline some future works in Section 6.

2. Related Works

2.1. Deep Learning for Point Cloud Segmentation

Existing DNN-based methods for point cloud segmentation could be roughly classified into projection-based methods and point-based methods.

Projection-based methods: The multi-view-based method [5] projects 3D point clouds onto 2D planes from multiple virtual camera views. Similarly, the spherical-representation-based method [6] projects a 3D point cloud onto a 2D sphere image. The 3D convolution-based methods [1,4,7,27] divide 3D point clouds into multiple regular sets of occupancy voxels and then elaborate effective convolution operations on them. However, the performances of such projection-based methods are affected by the resolution of regular data and 3D object occlusion.

Point-based methods: PointNet [2] is the pioneering work to directly process 3D point clouds using shared MLPs and max-pooling layers. Based on PointNet, a vast number of point-based networks have been proposed. These methods could be further divided into pointwise MLP methods, point convolution methods, graph-based methods, and transformer-based methods. Pointwise MLP methods [8,9,10,28,29,30,31,32,33,34,35] use MLPs as the basic unit and implement different local aggregation operations for capturing the local geometry of 3D point clouds. Among them, RandLA-Net [9] proposes a hierarchical local-feature-aggregation module and an efficient random sampling method, enabling the model to directly process large-scale point clouds. SCF-Net [36] further proposes a global-contextual-feature-learning module to enhance global context. Point convolution methods [11,12,13,37,38,39,40,41,42] implement effective convolution operations to transform and aggregate local geometrical features for 3D point clouds. Graph-based methods [43,44,45,46] treat 3D point clouds as a graph structure and use graph convolution operations to capture underlying shapes and geometric structures of 3D point clouds. DGCNN [45] is a typical graph-based method, which proposes a novel dynamic graph-updating approach and an edgeconv operation for local feature aggregation. Transformer-based methods [3,14,47] use the powerful attention mechanism in the transformer to aggregate long-range context information.

2.2. Data Augmentation for Point Clouds

Hand-crafted rule-based augmentations (e.g., random rotating and jittering) are usually used to augment 3D point clouds by many existing methods [2,8,45]. Recently, Li et al. [15] proposed the PointAugment framework to automatically augment point clouds, where a point cloud augmentation network was used to augment point clouds, and a classification network was adopted to classify the augmented point clouds. By training the two networks adversarially, the learned classification network was expected to be enhanced. PointMixup [17] mixes different parts of two point clouds by solving an optimal transport problem to augment the point clouds. To reduce the loss of the structural information of the original samples in PointMixup, Lee et al. [18] proposed a shape-preserving augmentation technique, called RSMix, which could partially mix two point clouds while preserving the partial shape of the original point clouds. Our proposed framework also aims to augment the training data; however, it is significantly different from these related works mainly in two aspects: (1) our proposed framework augments point features by modeling its distribution with generative models, while the related works augment point clouds in 3D space; (2) the related works are designed mainly for the 3D-shape classification task by point-cloud-level augmentation, while ours is proposed for the 3D scene segmentation task by point-level feature distribution modeling.

2.3. Zero-Shot Semantic Segmentation

Inspired by zero-shot learning [48,49], zero-shot semantic segmentation on images has received increasing attention recently. Xian et al. [50] proposed a semantic projection network to relate visual features and semantic prototype features, and then infer novel class visual features with semantic prototypes. Bucher et al. [51] and Gu et al. [52] proposed similar methods, which combined a deep visual segmentation model with a generative model to generate visual features from semantic features for unseen classes and trained a visual feature classifier to segment unseen class pixels in images. Very recently, zero-shot semantic segmentation has also been applied to point clouds. Wang et al. [23] employed the powerful zero-shot ability of large-scale visual–language pre-trained models to achieve zero-shot segmentation on point clouds by aligning point cloud features with semantically visual features extracted by CLIP [24], so that unseen class points could be inferred by semantically visual features. Similar to Xian et al. [50], Chen et al. [25] and He et al. [26] trained a semantic projection network to align point cloud features and semantic prototype features to achieve zero-shot segmentation for unseen class points. Our proposed framework also aims at zero-shot segmentation on point clouds, but our framework is different from the related works in that (1) we achieved the zero-shot ability by modeling the point feature distribution of unseen classes with a generative model and (2) our framework can handle both conventional and zero-shot segmentation for point clouds simultaneously.

3. Materials and Methods

3.1. Methods

We begin by introducing the task of conventional 3D scene semantic segmentation (C3DS), and that of zero-shot 3D scene semantic segmentation (Z3DS). At first, mathematical symbols and some notions used in this work are listed in Table 1 to facilitate the subsequent description and understanding. Generally speaking, the task of C3DS is to classify each point in a 3D point cloud scene into a semantic label that belongs to a close set given the task. Due to the scene complexity, accurate classification of points needs a deep understanding of the scene context. Note that the closed set classification in C3DS is somewhat impractical since realistic scenes often contain diverging classes. Hence, the goal of Z3DS is to enable models to deal with unseen classes. In Z3DS, models are trained with a set of classes (namely seen classes) that have labeled data for training, and then the models are required to infer testing data belonging to another set of classes (namely unseen classes). Considering that unseen classes are open set, Z3DS models can handle more realistic scenes. A more formal definition is given below.

Suppose we are given a 3D point cloud scene set

P

at the training stage; for the sake of clarity, it can also be represented as a 3D point set

D_{t r}^{S} = {(x_{n}, y_{n})}_{n = 1}^{N}

, where

x_{n}

is a point in a 3D point cloud scene and

y_{n}

is the label of

x_{n}

, belonging to a seen class label set

Y^{S}

, and N is the number of points in the dataset. Usually,

x_{n}

is a

(3 + a)

-D feature vector with 3-D X-Y-Z Euclidean coordinates and a-D features (e.g., 3-D R-G-B color features). In the deep learning community, a feature extractor

u (\cdot)

is usually used to transform the original

(3 + a)

-D point feature vectors into more high-level point feature vectors. Hence, in this paper,

x_{n}

is also used to refer to the feature vector after the processing of the feature extractor

u (\cdot)

. Suppose we are additionally given a semantic feature set

E = {e_{y} ∣ y \in Y}

, where

e_{y}

is the semantic feature of class y and Y is a class label set that not only includes the seen class label set

Y^{S}

, but also includes an unseen class label set

Y^{U}

, where

Y^{S}

has no intersection with

Y^{U}

. At the testing stage, the task is to segment the testing 3D point cloud scenes by classifying each point in these scenes, i.e., classifying a corresponding 3D point set, here denoted by

D_{t e} = {x_{m}}_{m = 1}^{M}

, then we have the following definitions:

C3DS: Suppose the labels of the points in the testing dataset

D_{t e}

belong to seen classes

Y^{S}

; given the training dataset

D_{t r}^{S}

and the semantic feature set

E

, the goal is to learn a mapping

f : D_{t e} \to Y^{S}

.

Z3DS: Suppose the labels of the points in the testing dataset

D_{t e}

belong to unseen classes

Y^{U}

; given the training dataset

D_{t r}^{S}

and the semantic feature set

E

, the goal is to learn a mapping

f : D_{t e} \to Y^{U}

.

Generalized Z3DS (GZ3DS): Suppose the labels of the points in the testing dataset

D_{t e}

belong to the whole class set Y; given the training dataset

D_{t r}^{S}

and the semantic feature set

E

, the goal is to learn a mapping

f : D_{t e} \to Y

.

Note that, since the 3D scene semantic segmentation problem is actually a point-level classification problem, here we represent the point cloud set as a corresponding point set, and then segmenting 3D scenes is equal to classifying 3D points. It is assumed in C3DS that testing points only belong to the seen class label set (i.e., those classes with training samples), that is to say, C3DS is unable to handle novel class objects that often appear in realistic scenarios. Z3DS assumes that we have the prior that testing points belong to only the unseen class label set (i.e., those classes without training samples). Although this prior is not always available in practice, Z3DS can demonstrate in theory the ability of a model to infer unseen class objects and is of practical value with the aid of some additional modules (like a novel class detector). GZ3DS simultaneously deals with seen classes and unseen classes, which is a more practical setting.

In this paper, we propose the SeCondPoint framework to tackle all three kinds of 3D point cloud semantic segmentation tasks above, which utilizes language-level semantics to condition the modeling of the point feature distribution and the point feature generation. As shown in Figure 2, the proposed SeCondPoint consists of three parts: a backbone segmentation network for extracting point features from input 3D point clouds, a language-level semantics-conditioned feature-modeling network for learning the point feature distribution, and a semantics-enhanced feature classifier for classifying both seen class and unseen class points in 3D point clouds for semantic segmentation. By sampling an arbitrary number of point features from the learned feature distribution, the semantics-enhanced feature classifier can be well trained to segment both seen class and unseen class points. Here, we would point out that an arbitrary existing (also novel) segmentation network could be used as the backbone network under the SeCondPoint framework. Since our main goal is to propose the novel SeCondPoint framework, rather than a novel segmentation network, we simply use existing segmentation networks as backbone networks here. In the following, we describe the language-level semantics-conditioned feature-modeling network and the semantics-enhanced feature classifier, respectively.

3.1.1. Language-Level Semantics-Conditioned Feature Modeling-Network

Conditional generative adversarial networks (CGANs) have received much attention recently due to their excellent ability to model conditional data distributions. They generally consist of a generator and a discriminator; the generator can synthesize new samples given a condition and a Gaussian noise, and the discriminator is used to adversarially train the generator to make its generated data follow the pre-defined conditional distributions. Here, we adopt this idea to model 3D point features conditioned on language-level semantics. Suppose point features of input point clouds are given by a segmentation backbone; our goal is to learn a point feature conditional generative network, which can model the conditional distribution of point features conditioned on the semantic features. To this end, we first introduce language-level semantic features of both seen and unseen (novel) object classes. There exist many models for extracting language-level semantics in the literature [53,54,55]; here, we use the semantic embeddings of the object class names extracted by the existing language model [53], considering that our current work is to show the usefulness of semantics-conditioned feature modeling in 3D scene segmentation, rather than comparing different semantics.

The architecture of the feature conditional generative network is shown in Figure 2, where the input of the generator is the concatenation of a semantic feature of an object class and a Gaussian noise, the output of the generator is a synthesized point feature of the corresponding object class, and the discriminator is to discriminate the real point feature extracted by the backbone network from that generated by the generator. Formally, suppose we are given a backbone network

u (\cdot)

, a generator

G e n (\cdot)

, and a discriminator

D i s (\cdot)

. For a given input 3D point with its corresponding class label y, we firstly extract point feature

x \in R^{b}

of the input 3D point with the backbone network

u (\cdot)

. Then, the generator

G e n (\cdot)

is used to generate a fake point feature

\bar{x}

conditioned on the corresponding semantic feature

e_{y}

, and the discriminator

D i s (\cdot)

is to discriminate the generated fake point feature

\bar{x}

from the real point feature

x

. In order to learn a point feature distribution that could not be discriminated from the real one by the discriminator, the generator is adversarially trained with the discriminator under the framework of the Wasserstein generative adversarial network (WGAN) [56] as follows:

\begin{matrix} L_{w g a n} = & min_{G e n} max_{D i s} E_{(x, y) \in D_{t r}^{S}} [D i s (x, e_{y})] - E_{y \in D_{t r}^{S}} [D i s (\bar{x}, e_{y})] \\ - λ E_{(x, y) \in D_{t r}^{S}} [(| | \nabla_{\hat{x}} D i s (\hat{x}, e_{y}) {| |}_{2} - 1)^{2}] \end{matrix}

(1)

where

D_{t r}^{S}

is the labeled training point feature set extracted by the backbone network from the training 3D point clouds. The first two terms in (1) are the original objectives of WGAN, which aim to minimize the Wasserstein distance between the distributions of the generated point features and those of the real point features, and

\bar{x}

is generated by the generator

G e n (\cdot)

conditioned on the corresponding semantic feature

e_{y}

and a standard Gaussian noise

z \sim N (0, I)

, i.e.,

\bar{x} = G e n (e_{y}, z)

. The third term in (1) is used as the gradient penalty for the discriminator, where

\hat{x} = α x + (1 - α) \bar{x}

with

α

sampled from a uniform distribution, i.e.,

α \sim U (0, 1)

is used to estimate gradients, and

λ

is a hyper-parameter for weighting the gradient penalty term, which is usually set to 10, as suggested in [56]. By optimizing the Min–Max objective, the generator can finally learn the conditional distribution of real point features of each class conditioned on its semantic feature. In other words, an arbitrary number of point features for each class could be synthesized by sampling features from the learned conditional distribution.

To further improve the feature distribution modeling of the proposed generative network, here we propose a feature-geometry-based Mixup training approach, which can increase the discriminability between adjacent object classes in the feature space. Generally, in 3D scene semantic segmentation, feature learning is mainly based on geometrical information. That is to say, geometrically adjacent points generally have similar features in the learned feature space. At the same time, due to the feature similarity between geometrically adjacent object classes, confusions often happen among such classes. Inspired by Mixup [16], we designed a model-learning approach as follows. Given a point feature dataset by the backbone, we firstly compute the class feature centers

{X_{0}, \dots, X_{c}, \dots, X_{C}}

, where

X_{c} = \frac{1}{n_{c}} \sum_{i = 1}^{n_{c}} x_{i}^{c}

is the class feature center of class c,

n_{c}

is the number of point features belonging to class c, and C is the number of classes. Then, we compute the Euclidean similarity matrix A between these feature centers, which represents the semantic similarity between classes. Next, we mix up semantically similar samples as follows: given a point feature

x^{c}

from the class c, we first randomly find one of the closest I classes to the class c according to the similarity matrix A, denoted by

\bar{c}

, and then, we sample a point feature from

\bar{c}

, denoted by

x^{\bar{c}}

; finally, an intermediate feature sample

x^{\hat{c}}

is synthesized by interpolating between

x^{c}

and

x^{\bar{c}}

with a scalar

β

sampled from the uniform distribution

U (0, 1)

as:

\begin{matrix} x^{\hat{c}} = & β * x^{c} + (1 - β) * x^{\bar{c}} \\ e^{\hat{c}} = & β * e^{c} + (1 - β) * e^{\bar{c}} \end{matrix}

(2)

where

x^{\hat{c}}

is the synthesized point feature,

e^{c}

and

e^{\bar{c}}

are the corresponding semantic features of

x^{c}

and

x^{\bar{c}}

, and

e^{\hat{c}}

is synthesized in the same way as

x^{\hat{c}}

. According to (2), we can mix up a large number of samples that are located between two geometrically adjacent classes in the feature space. Finally, we add these interpolated samples to the original training set for training the feature generative network. By such training, the model is expected to improve its ability to discriminate adjacent object classes. Here, we use a hyper-parameter

γ

to control the scale of interpolated samples, which is defined as the ratio of the interpolated samples to the real samples. This algorithm is summarized in Algorithm 1.

Algorithm 1 Feature-geometry-based Mixup training

Input:: $D_{t r}^{S}$ , initialized $G e n (\cdot)$ and $D i s (\cdot)$ ;
Output:: Trained $G e n (\cdot)$ and $D i s (\cdot)$ ;

1:: Compute class feature centers ${X_{0}, \dots, X_{c}, \dots, X_{C}}$ with $D_{t r}^{S}$ ;
2:: Compute similarity matrix A between class feature centers;
3:: for n = 1 to N do
4:: Randomly sample a feature $x^{c}$ from class c;
5:: Randomly select a class $\bar{c}$ from the closest I classes to class c according to A;
6:: Randomly sample a feature $x^{\bar{c}}$ from class $\bar{c}$ ;
7:: Mix up a sample with $x^{c}$ and $x^{\bar{c}}$ based on Equation 2;
8:: end for
9:: Train $G e n (\cdot)$ and $D i s (\cdot)$ with mixed N samples combined with $D_{t r}^{S}$ ;
10:: return Trained $G e n (\cdot)$ and $D i s (\cdot)$ ;

3.1.2. Semantics-Enhanced Feature Classifier

Once the language-level semantics-conditioned feature-modeling network is trained, a large number (K) of point features for each object class could be generated conditioned on the corresponding semantic feature

e_{y}

and K different random noises

{z_{k}}_{k = 1}^{K}

sampled from the standard Gaussian distribution

N (0, I)

. Specifically, we generate point features according to:

\bar{x} = G e n (e_{y}, z)

(3)

where

\bar{x}

is the generated point feature. In the following, we describe the feature generation and classifier learning in three different tasks, i.e., C3DS, Z3DS, and GZ3DS.

Conventional 3D scene semantic segmentation: The testing points are assumed to come from only seen classes in C3DS. Hence, we generate a large number of point features for each seen class in

Y^{S}

conditioned on seen class semantic features according to (3). The set of generated point features and corresponding labels is denoted by

{\bar{X}}^{S}

. Then, we train a semantics-enhanced classifier

f_{s} (\cdot)

with

{\bar{X}}^{S}

as follows:

L_{s} = E_{(\bar{x}, y) \in {\bar{X}}^{S}} [C E (S o f t m a x (f_{s} (\bar{x})), y)]

(4)

where

C E (\cdot)

and

S o f t m a x (\cdot)

are a cross-entropy loss function and a softmax function, respectively. Note that

f_{s} (\cdot)

could be any classifier, not constrained to a linear classifier, as in existing segmentation networks. After training the classifier, given a real testing point feature

x

, we predict its label y by:

y = arg max_{y \in Y^{S}} S o f t m a x (f_{s} (x))

(5)

In sum, for a given testing 3D point cloud, we achieve point cloud semantic segmentation by classifying the feature of each 3D point extracted by the backbone network via the learned semantics-enhanced feature classifier.

Zero-shot 3D scene semantic segmentation: Thanks to the language-level semantics-conditioned point feature modeling, the proposed framework has the flexible ability to segment novel class objects in 3D point cloud scenes (including Z3DS and GZ3DS) if their corresponding semantics are available, which is an impossible function for existing segmentation networks. In Z3DS, the task is to classify only unseen class points. To this end, we sample a large number of point features for each unseen class in

Y^{U}

from the learned conditional distributions conditioned on the unseen class semantic features according to (3). That is to say, the semantic feature

e_{y}

is sampled from the unseen class label set

Y^{U}

, which is different from those in C3DS, where the semantic feature

e_{y}

is sampled from the seen class label set

Y^{S}

. Then, we train a semantics-enhanced classifier

f_{u} (\cdot)

in a similar way to (4) and classify the real testing unseen class points, as performed in (5).

Generalized zero-shot 3D scene semantic segmentation: The testing points could be from either seen classes or unseen classes in GZ3DS. Hence, according to (3), we generate a large number of point features for every class in Y conditioned on all semantic features, and the semantics-enhanced classifier

f_{g} (\cdot)

training and the point feature classifying are similar to those in C3DS and Z3DS. The only difference among C3DS, Z3DS, and GZ3DS is that their conditions and classification space are different, which in turn demonstrates the flexibility of our proposed framework.

3.1.3. Implementation Details

As for the backbone networks, in the C3DS tasks, we validated the proposed SeCondPoint framework with 3 typical 3D point cloud segmentation backbones: DGCNN [45], RandLA-Net [9], and SCF-Net [36]. The architectures of the networks used are totally the same as those in their original papers since we directly used their public codes and models. We chose the three networks not only because their codes and models are public and commonly used, but also because they are representative and classical networks. Specifically, the DGCNN is a graph-based architecture and is usually used to process block-sized small-scale point clouds, while RandLA-Net and SCF-Net are two different point-based networks, which can be directly used to deal with large-scale point clouds. In the Z3DS tasks, we used the DGCNN [45] and RandLA-Net [9] to validate the proposed SeCondPoint framework given that they have totally different architectures. Other network backbones could be adapted seamlessly to Z3DS in the same way.

In the feature generative network, the input of the feature generator includes a semantic feature and a standard Gaussian noise, and it outputs a synthesized point feature. It is implemented by a three-layer fully connected neural network (FCNN), whose input unit number and latent unit number are 600 (300-D for semantic features and 300-D for Gaussian noise) and 1024, respectively, in all tasks, and its output unit number depends on the corresponding point feature dimensionality, i.e.,

256 / 32 / 32

for DGCNN/RandLA-Net/SCF-Net, respectively. The input of the feature discriminator contains a semantic feature and a point feature, and it outputs a 0/1 classification result. It is also implemented by a three-layer FCNN, whose output unit number and latent unit number are 1 and 1024, respectively, and its input unit number is equal to the sum of the corresponding point feature dimensionality and the semantic feature dimensionality, i.e.,

556 / 332 / 332

for DGCNN/RandLA-Net/SCF-Net, respectively. The generator and the discriminator are adversarially trained for 20 epochs, with a batch size of 32, a learning rate of

0.0005

, and an Adam optimizer. After training the generative network, the generator is used to synthesize any large number of point features of a specified class with its semantic feature and standard Gaussian sampling, which are then used for semantics-enhanced classifier training.

As for the semantics-enhanced classifier, different tasks have different class spaces, e.g., GZ3DS classifies points in the whole class space and Z3DS classifies points in the unseen class space. However, they share the same architecture and implementation. The input and the output of the classifier are a point feature and a class logit, respectively. It is implemented by a one-layer FCNN, whose input unit number and output unit number are, respectively, the corresponding point feature dimensionality and the class number depending on the backbone network and the task, e.g., the point feature dimensionality is

256 / 32 / 32

for DGCNN/RandLA-Net/SCF-Net, respectively. The classifier is trained from scratch for 10 epochs, with a batch size of 4096, a learning rate of

0.0001

, and an Adam optimizer.

3.2. Datasets and Setup

3.2.1. Datasets

Two public 3D point cloud indoor scene datasets, S3DIS [57] and ScanNet [58], were used to evaluate the proposed SeCondPoint framework. S3DIS includes 3D scans of 272 rooms distributed in 6 areas from 3 different buildings. Each point in the 3D scans is represented with X-Y-Z Euclidean coordinates and R-G-B color features, and it is annotated with one of the semantic labels from 13 categories, including ceiling, floor, wall, beam, column, window, door, chair, table, bookcase, sofa, board, and clutter. ScanNet includes 1513 scanned and reconstructed point cloud scenes. Among them,

1201 / 312

point clouds are used for training/testing, respectively. Each point in a 3D point cloud is represented with X-Y-Z Euclidean coordinates and is labeled with 1 of 21 semantic categories where 20 semantic categories are used for evaluation and the remaining one (‘unannotated’ class) is often ignored. Here, we used two indoor scene datasets to validate our proposed framework because indoor scenes often have more complex semantic object classes, so the corresponding tasks can show the difficulty to achieve the zero-shot ability and generalization ability of the models. For small-scale point cloud-based networks (e.g., DGCNN), a whole 3D point cloud scene is usually split into several blocks of size 1 m × 1 m in the X-Y plane firstly, and then, 4096 points are randomly sampled from each block as the network inputs. For large-scale point cloud-based networks (e.g., RandLA-Net and SCF-Net), the original 3D point cloud scenes are directly processed after a downsampling pre-processing. The semantic features are provided by word2vec embeddings [53] pre-trained on the Google News dataset, which can be directly downloaded from the Internet. The semantic feature of each given class is determined simply by (1) recording its class name and (2) retrieving its semantic feature among all word2vec embeddings according to its name, which is represented by a 300-D vector.

In the C3DS tasks, we strictly followed the original dataset setting for both S3DIS and ScanNet. For the Z3DS tasks, we constructed two benchmarks by re-organizing S3DIS [57] and ScanNet [58]. In zero-shot learning, seen–unseen splitting is a key factor, and different splits usually mean different levels of difficulty. Since models are only trained on seen classes, the accurate inference of unseen classes mainly depends on their relations with the seen classes. In other words, the relations between seen and unseen classes should be taken into account when constructing the benchmark. In addition, by training with more seen classes, models can learn more prior knowledge to infer the unseen. Following these principles, we design different Z3DS tasks with different hardness by controlling (1) the unseen class number and (2) their similarity with the seen classes. Specifically, in S3DIS, we considered the 12 semantic categories as valid categories and ignored the ‘clutter’ class. Then, we split the 12 semantic categories into seen classes and unseen classes for three different levels of zero-shot hardness, i.e., 10/2, 8/4, and 6/6 as seen classes/unseen classes, respectively. Note that the 4 unseen classes in the 8/4 split contain the 2 unseen classes in the 10/2 split, and similarly for the 6/6 split and the 8/4 split. In the 10/2 split, window and table are selected as unseen classes since window is probably supported by wall and door, while table is probably supported by bookcase and sofa from the semantic perspective. The same consideration is reflected in the 8/4 and 6/6 splits. In ScanNet, although there are 20 semantic categories in total, we chose 19 semantic categories as valid categories and ignored the ‘other furniture’ category since it is semantically confused with the ‘furniture’ category. The 19 semantic categories are split into seen classes and unseen classes also in 3 different ways, i.e., 16/3, 13/6, and 10/9 splits as seen classes/unseen classes, respectively. Like in S3DIS, the 6 unseen classes in the 13/6 split contain the 3 unseen classes in the 16/3 split, etc., and their unseen class selection also considers the relations between seen and unseen classes as shown in Table 2, where the specific unseen classes under different settings in S3DIS and ScanNet are all reported and the corresponding remaining classes are the seen classes. We use the same training/testing dataset in both S3DIS and ScanNet with those in the C3DS task. The semantic features are the same as those in C3DS. It has to be emphasized that our two constructed benchmarks are only reorganized versions of S3DIS and ScanNet, and the original data are completely the same.

3.2.2. Evaluation Protocols and Comparative Methods

For C3DS, like most existing works [9,36,37,43,45], we used the overall Top-1 accuracy (

O A

), average per-class Top-1 accuracy (

m A C C

), and average per-class IoU (

m I o U

) to evaluate the 3D semantic segmentation performance on S3DIS. The overall Top-1 accuracy (

O A

) is the accuracy of all samples, independent of classes; hence, it is a metric that is inadequate for measuring severe class imbalance cases. The average per-class Top-1 accuracy (

m A C C

) takes the class imbalance problem into account; it first computes the accuracy of each class and then computes the mean of all class accuracies. The average per class IoU (

m I o U

) has the same consideration: it first computes the intersection over union (IoU) of each class and then computes their average. The models are tested in two different settings for a comprehensive evaluation: (1) testing on Area 5 and training on the remaining 5 areas (in short, Area-5 validation) and (2) 6-fold cross-validation. Semantic segmentation on ScanNet is usually regarded as a voxel-level semantic labeling task; hence, the overall Top-1 voxel accuracy (

O V A

) and average per-class Top-1 voxel accuracy (

m V A C C

) are used for evaluation, as performed in [31,44]. The computing process of

O V A

and

m V A C C

is similar to that of

O A

and

m A C C

as explained before, the only difference being that voxel-level accuracy is used here.

For Z3DS, according to the evaluation protocols in zero-shot learning and 3D semantic segmentation, we used the average per-class Top-1 accuracy and average per-class IoU on unseen classes to evaluate the Z3DS performance on S3DIS, denoted by (

m A C C_{u}

) and (

m I o U_{u}

), respectively, and only the Area-5 validation was used considering that Area-5 validation can better test the model’s generalization ability since 3D scenes in Area 5 have no overlap with those in other areas. On ScanNet, the average per-class Top-1 accuracy (

m A C C_{u}

) and average per-class voxel accuracy (

m V A C C_{u}

) on unseen classes are used. In GZ3DS, we first computed

m A C C_{u}

(also

m I o U_{u}

and

m V A C C_{u}

) and

m A C C_{s}

(also

m I o U_{s}

and

m V A C C_{s}

) on unseen classes and seen classes, respectively, and then, their harmonic mean

H A C C

(also

H I o U

and

H V A C C

) was computed to evaluate the overall performance by:

H A C C = \frac{2 \times m A C C_{s} \times m A C C_{u}}{m A C C_{s} + m A C C_{u}}

(6)

Here, we only describe the computation of the

H A C C

in (6), while the

H I o U

and

H V A C C

are computed in the same way. On S3DIS, the overall performance is evaluated by the

H A C C

and

H I o U

, while on ScanNet, it is evaluated by the

H A C C

and

H V A C C

. Note that the reason for using the harmonic mean is that many GZSL methods [48,49,59] generally suffer from the bias towards seen classes; in other words, their performances on seen classes are significantly higher than those on unseen classes, which cannot demonstrate the zero-shot ability on unseen classes. Hence, the harmonic mean is used here, as in those GZSL methods [48,49,59].

In C3DS, the proposed methods are compared with 10 existing state-of-the-art networks, including the DGCNN [45], RandLA-Net [9], SCF-Net [36], PointNet++ [8], SPGraph [43], PointCNN [37], ACNN [40], PointWeb [44], Point2Node [31], and PointGCR [32]. From the data augmentation perspective, we also adapted an existing data augmentation approach (RSMix [18]) to augment the three backbones used and compared our proposed framework with the adapted method. In Z3DS, we firstly established two Z3DS baselines by straightforwardly introducing two zero-shot learning methods [60,61] into Z3DS, and then, we compared the proposed framework with the two baselines.

4. Results

4.1. C3DS Results on S3DIS

Here, we evaluate the proposed SeCondPoint by embedding the DGCNN [45], RandLA-Net [9], and SCF-Net [36] into the proposed framework on S3DIS under both the Area-5 validation and the six-fold cross-validation settings. Since the pre-trained models of the DGCNN are not released by the authors, we first pre-trained the DGCNN with the public codes and the corresponding hyper-parameters provided by the authors. For RandLA-Net and SCF-Net, we directly used the pre-trained models released by the authors (https://github.com/QingyongHu/RandLA-Net, https://github.com/leofansq/SCF-Net (accessed on 24 May 2024)).The results of these pre-trained models and the corresponding enhanced models by the proposed framework in the Area-5 validation and the six-fold cross-validation are reported in Table 3 and Table 4, respectively. We also adapted a data augmentation method (RSMix [18]) originally proposed for 3D shape classification to handle the C3DS task, and the results of the DGCNN, RandLA-Net, and SCF-Net augmented by RSMix are reported in the two tables as well. In addition, we report the results of some state-of-the-art 3D segmentation methods, as shown in Table 3 and Table 4, for further comparison.

As noted from Table 3 and Table 4, five points are revealed. Firstly, the performances of all three backbone networks were significantly improved by the proposed framework in both the used validations in terms of the

m A C C

,

m I o U

, and

O A

. For instance, the

m A C C

of the DGCNN improved by

6.5 %

and

4.4 %

in the Area-5 validation and the six-fold cross-validation, respectively, indicating that the proposed framework is able to generate many augmented point features from the semantics-conditioned feature distributions for learning a better point classifier.

Secondly, compared with the state-of-the-art data augmentation method (RSMix [18]), from the perspective of data augmentation, the proposed framework improves all the backbone networks by larger margins in both validations used. In fact, it was noticed that RSMix even decreased the performances of the backbone networks in some cases. This is probably because RSMix was originally designed to augment single 3D object shapes for 3D shape classification, and it would hurt the structures of 3D scenes when applied to 3D scene segmentation.

Thirdly, the improvements of the

m A C C

are more significant than those of the

m I o U

and

O A

under the proposed framework for all three backbone networks. This is mainly because, for those classes, with a relatively small number of points in the original training point clouds, the diversity of their point features is largely augmented by the proposed feature-modeling network, and consequently, their accuracies improved. This is consistent with the results shown in Table 3 and Table 4, where some classes (e.g., sofa, column) that originally had low performances were significantly improved by the proposed framework. Both support the explanation in the caption of Figure 1.

Fourthly, the improvements in the Area-5 validation are more significant than those in the six-fold cross-validation for the three backbone networks in most cases. Considering that the validation on Area 5 is harder than that on the other areas (since 3D scenes in Area 5 have no overlap with those in the other areas), the larger improvements in the Area-5 validation further demonstrate that the proposed framework can help the model obtain more generalization ability.

Finally, the enhanced SCF-Net by the proposed framework outperforms the state-of-the-art methods significantly in most cases, especially in terms of the

m A C C

, by margins of about

4.2 %

and

3.8 %

in the Area-5 validation and the six-fold cross-validation, respectively.

4.2. C3DS Results on ScanNet

We also validated the proposed framework on ScanNet with RandLA-Net and SCF-Net. We did not conduct experiments with the DGCNN on ScanNet because it is necessary to split a whole point cloud into blocks and randomly sample points in the blocks when applying the DGCNN on ScanNet, but a unified splitting protocol does not exist in the literature; hence, a fair comparison with existing methods could hardly be made. For both RandLA-Net and SCF-Net, the pre-processing operations (only downsampling) are simple and unified. Since the results of RandLA-Net and SCF-Net on ScanNet are not directly reported by their authors, we firstly obtained their results by training them according to the public codes and the hyper-parameters provided by the authors, which are reported in Table 5. Then, we enhanced RandLA-Net and SCF-Net with the proposed SeCondPoint framework and a data augmentation technique (i.e., RSMix [18]), respectively, and the corresponding results are reported in Table 5. We also report the results of several state-of-the-art methods in Table 5 for further comparison. As seen from Table 5, firstly, the

O V A

of the enhanced RandLA-Net and SCF-Net by the proposed framework are close to those of the original methods, but the

m V A C C

of the two enhanced methods were significantly improved by about

6.1 %

and

7.5 %

, respectively. This is because those classes with a relatively small number of points originally are augmented by the proposed framework, and the significant increases in the accuracies of those classes result in a better average per-class accuracy. Secondly, the improvements achieved by the proposed framework are significantly better than those achieved by RSMix. Actually, RSMix has a negligible effect on both RandLA-Net and SCF-Net on ScanNet, which is probably caused by the fact that it is designed to augment 3D object shapes for 3D shape classification. Finally, both the enhanced SCF-Net and the enhanced RandLA-Net outperformed all the comparative state-of-the-art methods by large margins, demonstrating the effectiveness of the proposed framework.

4.3. Z3DS/GZ3DS Results on S3DIS

To demonstrate the Z3DS/GZ3DS ability of the proposed framework, here we evaluate the proposed framework for both Z3DS and GZ3DS by embedding existing segmentation networks (i.e., DGCNN [45] and RandLA-Net [9]) into the proposed framework on S3DIS with three different seen/unseen splits in the Area-5 validation. The enhanced DGCNN and RandLA-Net are denoted as DGCNN-Z3DS and RandLA-Net-Z3DS, respectively, for easy reading. We also adapted two baselines by straightforwardly embedding the DGCNN and RandLA-Net into two typical zero-shot learning methods [60,61]. Ref. [60] is a visual-to-semantic embedding-based zero-shot learning method, while [61] is a GAN-based method. For the sake of clarity, we denote them as 3D-V2S-embedding and 3D-GAN, respectively. The results of our proposed methods and the two baselines are all reported in Table 6. The results of the fully supervised model (the labels of seen classes and unseen classes are given in the training) on GZ3DS are reported so that the performance gap between the zero-shot segmentation methods and the corresponding fully supervised ones can be clearly seen.

Five observations can be made from Table 6. Firstly, in Z3DS, the accuracies of both the proposed framework and the baselines exceed

90 %

in the 10/2 split. This demonstrates that novel class inference could be realistic in the 3D-scene-segmentation task by introducing the language-level semantics at least when the number of novel classes is not very large.

Secondly, the

H A C C

and

H I o U

of 3D-V2S-embedding in GZ3DS are both 0, demonstrating that 3D-V2S-embedding severely suffers from a bias towards seen classes. This problem is partly caused by the fact that the feature learning of seen class objects and unseen class objects in 3D scenes are mutually affected, and since the unseen class objects have no supervised signal in the feature learning process, their features are easily biased towards those of seen classes. Note that this phenomenon is significantly different from that in generalized zero-shot 3D shape classification [22], where a relatively better result is achieved, showing the difference between the two tasks.

Thirdly, the proposed framework achieves significantly superior performances over the two baseline methods in all data splitting cases in both Z3DS and GZ3DS, especially when compared with 3D-V2S-embedding. These large improvements demonstrate the effectiveness of the proposed semantics-conditioned feature modeling and the feature-geometry-based Mixup training approach.

Fourthly, compared with the fully supervised model, the GZ3DS performances of the proposed framework in the 10/2 split are relatively lower, but still encouraging, considering the fact that the unseen class objects totally have no supervised signal in the training under the proposed framework. These results demonstrate that novel classes in realistic 3D scenes can be well handled at least when the number of novel classes is not very large.

Finally, the Z3DS and GZ3DS performances of the proposed framework (also 3D-V2S-embedding and 3D-GAN) decrease significantly as the number of unseen classes increases. This indicates that it is still a hard problem to infer a large number of unseen classes with a limited number of seen classes, and a considerable number of seen classes are helpful for achieving high-performance Z3DS/GZ3DS. We leave this direction as a future work.

4.4. Z3DS/GZ3DS Results on ScanNet

Here, we validate the proposed framework on ScanNet with three different seen/unseen splits in both Z3DS and GZ3DS. The employed backbone networks (i.e., DGCNN and RandLA-Net) and the two comparative baselines (i.e., 3D-V2S-embedding and 3D-GAN) are the same as those in the S3DIS benchmark. All the results are reported in Table 7. As seen from Table 7, the Z3DS performances of the proposed framework are significantly better than those of the two baselines; for instance, the

m A C C_{u}

of DGCNN+3D-V2S-embedding and DGCNN+3D-GAN in the 16/3 split are

57.7 %

and

60.4 %

, respectively, while that of the DGCNN-Z3DS reach

63.5 %

. Secondly, in GZ3DS, 3D-V2S-embedding severely suffers from the bias problem like in S3DIS, while the proposed framework outperforms the baselines by large margins; for instance, the

H V A C C

of RandLA-Net-Z3DS in the 16/3 split reaches

42.2 %

, while that of RandLA-Net+3D-V2S-embedding is 0, which demonstrates that the proposed semantics-conditioned feature modeling is helpful to alleviate the bias problem. Besides, the improvements achieved by the proposed framework over 3D-GAN further demonstrate the effectiveness of the proposed framework.

5. Discussion

5.1. Effect of Feature-Geometry-Based Mixup

Here, we analyze the effect of the feature-geometry-based Mixup approach. We conducted experiments in both the C3DS task and the Z3DS task using RandLA-Net as the backbone network in the Area-5 validation setting on S3DIS. The results of C3DS and Z3DS are shown in Figure 3A and Figure 3B, respectively, where ‘w’ means using feature-geometry-based Mixup training, while ‘w/o’ means not using it. Note that the Z3DS task is performed under the 10/2 split. As seen from Figure 3, the feature-geometry-based Mixup training clearly improves the model’s performances in terms of the

m A C C

,

m I o U

, and

O A

in both C3DS and Z3DS. This is mainly because, by interpolating between two adjacent classes in the feature space, the local discrimination between classes could be enhanced.

5.2. Effect of Number of Generated Features

We investigated the effect of the number (K) of generated features on the segmentation performance by conducting both C3DS and Z3DS experiments using RandLA-Net in the Area-5 validation setting on S3DIS. In C3DS, the K of each class was set to 10,000, 20,000, 50,000, 100,000, and 200,000. In Z3DS, the K of each unseen class was set to 10,000, 20,000, 50,000, 100,000, and 200,000, and the Z3DS task was performed under the 10/2 split. Figure 4A,B show the results of C3DS (

m I o U

,

m A C C

, and

O A

) and Z3DS (

m I o U

,

m A C C

), respectively. As seen from Figure 4, with the increase in the number of generated features, the segmentation performance increases accordingly at the beginning stage. This is reasonable because, with more diverse point features, the trained point classifier can be more discriminative. When the number of generated features reaches a certain scale, the segmentation performance becomes relatively stable.

5.3. Sensitivity to Hyper-Parameter

Here, we analyze the sensitivity of the proposed framework to its key hyper-parameter, i.e., the ratio of the interpolated samples to the real samples (

γ

). We conducted both C3DS and Z3DS experiments using RandLA-Net as backbone network in the Area-5 validation setting on S3DIS, with

γ = {0, 0.1, 0.3, 0.5, 0.7, 1.0}

, respectively. The results of C3DS (

m I o U

,

m A C C

,

O A

) and Z3DS (

m I o U

,

m A C C

) are shown in Figure 5A and Figure 5B respectively. From Figure 5, we can see that a moderate

γ

can improve the performance, while a too large or too small

γ

decreases the improvement. This is because a too small

γ

cannot take advantage of feature-geometry-based Mixup training, while a too-large

γ

would make the interpolated samples overwhelm the real samples, hence obfuscating the decision boundary of the classifier.

5.4. Inference Time Cost

Here, we investigate the inference time cost of the proposed framework and compare it with the baselines in the C3DS task in the Area-5 validation setting on S3DIS. The DGCNN, RandLA-Net, and SCF-Net were used as the baseline networks. Note that the DGCNN deals with small-scale point cloud blocks with only 4096 points, while RandLA-Net and SCF-Net both process large-scale whole 3D scenes; hence, they are consequently slower than the DGCNN. All models ran on an NVIDIA 2080ti GPU with 22 GB storage with the same 100 samples. The results (including the mean and std) are shown in Table 8. From Table 8, we can see that, although different baselines have very different time costs, each baseline and its enhanced version by our proposed SeCondPoint have almost the same time cost. This is because our proposed method only improved the training of the models, which has a same-architecture, but different-parameters segmentation head (i.e., a linear layer) compared with its baseline models, so the inference process and time cost were almost the same as the baseline models.

5.5. Visualization

To qualitatively demonstrate the effectiveness of the proposed framework, we visualize some segmentation results of the proposed framework and the comparative methods in the C3DS and the Z3DS/GZ3DS experiments in the Area-5 validation setting on S3DIS. In C3DS, the comparative methods used included the RandLA-Net backbone. In Z3DS, the comparative methods used included the conventional segmentation network (i.e., RandLA-Net trained with 10 seen classes) and 3D-V2S-embedding. The 10/2 seen/unseen split was employed in Z3DS. The results of C3DS, Z3DS, and GZ3DS are shown in Figure 6, Figure 7, and Figure 8, respectively. As shown in Figure 6, in C3DS, the bookcase (the purple one) in the top-row images and the door (the yellow–green one) in the bottom-row images are more accurately classified by RandLA-Net+SeCondPoint, demonstrating that the proposed framework can improve the segmentation performance of the backbone network. From Figure 7 and Figure 8, we can see that (1) the conventional segmentation model always wrongly classifies the unseen class objects into seen classes since it cannot recognize the unseen class objects; (2) in Z3DS, the unseen class objects, such as the table (the dark-gray one) in the top-row images and the window (the green one) in the bottom-row images, are well recognized by both the baseline 3D-V2S-embedding and the proposed SeCondPoint, demonstrating that the introduced language-level semantics are effective at novel class inference; (3) however, in GZ3DS, the segmentation of the unseen classes is significantly worse than that in Z3DS for both 3D-V2S-embedding and our proposed SeCondPoint, showing the severe bias problem, i.e., an unseen class is classified into a similar seen class due to model bias. Fortunately, compared with 3D-V2S-embedding, the proposed framework can significantly alleviate the bias problem, successfully inferring a large number of novel class (i.e., window and table) points.

6. Conclusions

In this paper, we propose a unified SeCondPoint framework for both conventional 3D scene semantic segmentation (C3DS) and zero-shot 3D scene semantic segmentation (Z3DS), where language-level semantics are introduced to condition the modeling of the point feature distribution and a large number of point features from both seen and unseen classes can be generated from the learned distribution. The proposed framework has the flexibility to not only enhance the performances of existing segmentation networks in C3DS by augmenting their feature diversity, but also endow the existing networks with the ability to handle unseen class objects in Z3DS. The C3DS experimental results on two public datasets demonstrate that the proposed framework is able to significantly boost the performances of existing segmentation networks. Two benchmarks were also introduced for the research of the introduced Z3DS task, and the results demonstrated the effectiveness of the proposed framework.

In the future, many directions are worthy of further exploration. At first, the introduced Z3DS task could be considered as a practical setting for 3D scene segmentation since novel classes are often encountered in realistic scenarios, then inspirations could be taken from recent advanced zero-shot learning methods and the characteristics of 3D scene data (such as geometrical knowledge); for example, the bias problem in GZ3DS is expected to be alleviated by resorting to some novel class detectors. In addition, in the current semantics-conditioned feature-modeling framework, simple word embeddings are used as semantic features and GAN is employed to model feature distributions. In fact, under the proposed framework, higher quality semantics, such as sentence embeddings that represent the descriptions of 3D objects and their spatial relationships and semantic attributes that have explicit semantic meanings pre-defined by experts, could also be used to achieve better performances. Currently, large language models are revolutionizing AI research; semantic features learned by these large language models with a vast number of corpora will be of great help to infer more unseen classes. Besides, the way to use language-level semantics is not necessarily constrained to modeling the feature distribution with GAN. Other generative models like VAEs and Diffusion models could also be used to model feature distributions. From the application perspective, our framework could directly be applied to more large-scale 3D datasets provided their semantic categories are known. Clearly, as our method is semantics-conditioned, class semantic diversity and inter-class semantic complexity will arguably affect the segmentation performance, hence finding or building a 3D scene dataset with more diverse semantic classes and more complex semantic relations between classes is much desired for further systematical tests. In a word, the present work introduced a promising semantics-conditioned framework for point cloud analysis, and its key ingredients are largely subject to many further improvements.

Author Contributions

Conceptualization: B.L., H.Z. and Z.H.; methodology: B.L., H.Z. and Q.D.; software and dataset: B.L.; validation: H.Z., Q.D. and Z.H.; draft: B.L., H.Z., Q.D. and Z.H.; review and editing: B.L., H.Z., Q.D. and Z.H.; funding acquisition: H.Z., Q.D. and Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62273034, No. U1805264, and No. 62376269) and the Scientific and Technological Innovation Foundation of Foshan (BK21BF004).

Data Availability Statement

No additional data are available.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Choy, C.; Gwak, J.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Yang, Y.-Q.; Guo, Y.-X.; Xiong, J.-Y.; Liu, Y.; Pan, H.; Wang, P.-S.; Tong, X.; Guo, B. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv 2023, arXiv:2304.06906. [Google Scholar]
Maturana, D.; Scherer, S. Voxnet: A 3d convolutional neural network for real-time object recognition. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Hamburg, Germany, 28 September–2 October 2015; pp. 922–928. [Google Scholar]
Lawin, F.J.; Danelljan, M.; Tosteberg, P.; Bhat, G.; Khan, F.S.; Felsberg, M. Deep projective 3d semantic segmentation. In Proceedings of the Computer Analysis of Images and Patterns: 17th International Conference, CAIP 2017, Ystad, Sweden, 22–24 August 2017; pp. 95–107. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent crf for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the IEEE International Conference on Robotics and Automation, Brisbane, QLD, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Meng, H.-Y.; Gao, L.; Lai, Y.; Manocha, D. Vv-net: Voxel vae net with group convolutions for point cloud segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8499–8507. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, A.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11105–11114. [Google Scholar]
Qiu, H.; Yu, B.; Chen, Y.; Tao, D. Pointhr: Exploring high-resolution architectures for 3D point cloud segmentation. arXiv 2023, arXiv:2310.07743. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.-E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Engelmann, F.; Kontogianni, T.; Leibe, B. Dilated point convolutions: On the receptive field size of point convolutions on 3d point clouds. In Proceedings of the IEEE International Conference on Robotics and Automation, Paris, France, 31 May–31 August 2020; pp. 9463–9469. [Google Scholar]
Peng, B.; Wu, X.; Jiang, L.; Chen, Y.; Zhao, H.; Tian, Z.; Jia, J. Oa-cnns: Omni-adaptive sparse cnns for 3d semantic segmentation. arXiv 2024, arXiv:2403.14418. [Google Scholar]
Wang, P.-S. Octformer: Octree-based transformers for 3d point clouds. ACM Trans. Graph. (TOG) 2023, 42, 1–11. [Google Scholar] [CrossRef]
Li, R.; Li, X.; Heng, P.-A.; Fu, C.-W. Pointaugment: An auto-augmentation framework for point cloud classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 6378–6387. [Google Scholar]
Zhang, H.; Cissé, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Chen, Y.; Hu, V.T.; Gavves, E.; Mensink, T.; Mettes, P.; Yang, P.; Snoek, C.G. Pointmixup: Augmentation for point clouds. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 330–345. [Google Scholar]
Lee, D.; Lee, J.; Lee, J.; Lee, H.; Lee, M.; Woo, S.; Lee, S. Regularization strategy for point cloud via rigidly mixed sample. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 15900–15909. [Google Scholar]
Nekrasov, A.; Schult, J.; Litany, O.; Leibe, B.; Engelmann, F. Mix3d: Out-of-context data augmentation for 3d scenes. In Proceedings of the 2021 International Conference on 3D Vision (3DV), Online, 1–3 December 2021; pp. 116–125. [Google Scholar]
Abdullah, M.T.; Rahman, S.; Rahman, S.; Islam, M.F. VAE-GAN3D: Leveraging image-based semantics for 3D zero-shot recognition. Image Vis. Comput. 2024, 147, 105049. [Google Scholar] [CrossRef]
Hao, Y.; Su, Y.; Lin, G.; Su, H.; Wu, Q. Contrastive generative network with recursive-loop for 3D point cloud generalized zero-shot classification. Pattern Recognit. 2023, 144, 109843. [Google Scholar] [CrossRef]
Cheraghian, A.; Rahman, S.; Campbell, D.; Petersson, L. Transductive zero-shot learning for 3D point cloud classification. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 923–933. [Google Scholar]
Wang, Y.; Huang, S.; Gao, Y.; Wang, Z.; Wang, R.; Sheng, K.; Zhang, B.; Liu, S. Transferring clip’s knowledge into zero-shot point cloud semantic segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 3745–3754. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Chen, R.; Zhu, X.; Chen, N.; Li, W.; Ma, Y.; Yang, R.; Wang, W. Bridging language and geometric primitives for zero-shot point cloud segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 5380–5388. [Google Scholar]
He, S.; Jiang, X.; Jiang, W.; Ding, H. Prototype adaption and projection for few-and zero-shot 3d point cloud semantic segmentation. IEEE Trans. Image Process. 2023, 32, 3199–3211. [Google Scholar] [CrossRef]
Graham, B.; Engelcke, M.; van der Maaten, L. 3D semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 9224–9232. [Google Scholar]
Jiang, M.; Wu, Y.; Zhao, T.; Zhao, Z.; Lu, C. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. arXiv 2018, arXiv:1807.00652. [Google Scholar]
Jiang, L.; Zhao, H.; Liu, S.; Shen, X.; Fu, C.; Jia, J. Hierarchical point-edge interaction network for point cloud semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 10432–10440. [Google Scholar]
Yang, J.; Zhang, Q.; Ni, B.; Li, L.; Liu, J.; Zhou, M.; Tian, Q. Modeling point clouds with self-attention and gumbel subset sampling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3323–3332. [Google Scholar]
Han, W.; Wen, C.; Wang, C.; Li, X.; Li, Q. Point2node: Correlation learning of dynamic-node for point cloud feature modeling. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
Ma, Y.; Guo, Y.; Liu, H.; Lei, Y.; Wen, G.-J. Global context reasoning for semantic segmentation of 3d point clouds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 2920–2929. [Google Scholar]
Cheng, M.; Hui, L.; Xie, J.; Yang, J.; Kong, H. Cascaded non-local neural network for point cloud semantic segmentation. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems, Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 8447–8452. [Google Scholar]
Deng, S.; Liu, B.; Dong, Q.; Hu, Z. Rotation transformation network: Learning view-invariant point cloud for classification and segmentation. In Proceedings of the IEEE International Conference on Multimedia and Expo, Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Deng, S.; Dong, Q.; Liu, B.; Hu, Z. Superpoint-guided semi-supervised semantic segmentation of 3d point clouds. arXiv 2021, arXiv:2107.03601. [Google Scholar]
Fan, S.; Dong, Q.; Zhu, F.; Lv, Y.; Ye, P.; Wang, F.-Y. Scf-net: Learning spatial contextual features for large-scale point cloud segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 14504–14513. [Google Scholar]
Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Online, 6–12 December 2020; pp. 820–830. [Google Scholar]
Wang, S.; Suo, S.; Ma, W.-C.; Pokrovsky, A.; Urtasun, R. Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2589–2597. [Google Scholar]
Xu, Y.; Fan, T.; Xu, M.; Zeng, L.; Qiao, Y. Spidercnn: Deep learning on point sets with parameterized convolutional filters. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 87–102. [Google Scholar]
Komarichev, A.; Zhong, Z.; Hua, J. A-CNN: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7413–7422. [Google Scholar]
Lei, H.; Akhtar, N.; Mian, A. Octree guided cnn with spherical kernels for 3D point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9623–9632. [Google Scholar]
Mao, J.; Wang, X.; Li, H. Interpolated convolutional networks for 3d point cloud understanding. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1578–1587. [Google Scholar]
Landrieu, L.; Simonovsky, M. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4558–4567. [Google Scholar]
Zhao, H.; Jiang, L.; Fu, C.-W.; Jia, J. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5565–5573. [Google Scholar]
Wang, Y.; Sun, Y.; Liu, Z.; Sarma, S.E.; Bronstein, M.M.; Solomon, J.M. Dynamic graph cnn for learning on point clouds. ACM Trans. Graph. (TOG) 2019, 38, 1–12. [Google Scholar] [CrossRef]
Wang, L.; Huang, Y.; Hou, Y.; Zhang, S.; Shan, J. Graph attention convolution for point cloud semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10296–10305. [Google Scholar]
Wu, X.; Lao, Y.; Jiang, L.; Liu, X.; Zhao, H. Point transformer v2: Grouped vector attention and partition-based pooling. Adv. Neural Inf. Process. Syst. 2022, 35, 33330–33342. [Google Scholar]
Xian, Y.; Lampert, C.H.; Schiele, B.; Akata, Z. Zero-shot learning—A comprehensive evaluation of the good, the bad and the ugly. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2251–2265. [Google Scholar] [CrossRef] [PubMed]
Liu, B.; Dong, Q.; Hu, Z. Hardness sampling for self-training based transductive zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 16499–16508. [Google Scholar]
Xian, Y.; Choudhury, S.; He, Y.; Schiele, B.; Akata, Z. Semantic projection network for zero- and few-label semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 8256–8265. [Google Scholar]
Bucher, M.; Vu, T.-H.; Cord, M.; Pérez, P. Zero-shot semantic segmentation. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Gu, Z.; Zhou, S.; Niu, L.; Zhao, Z.; Zhang, L. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1921–1929. [Google Scholar]
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics, Online, 6–11 June 2019. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein gan. arXiv 2017, arXiv:1701.07875. [Google Scholar]
Armeni, I.; Sener, O.; Zamir, A.; Jiang, H.; Brilakis, I.; Fischer, M.; Savarese, S. 3D semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1534–1543. [Google Scholar]
Dai, A.; Chang, A.X.; Savva, M.; Halber, M.; Funkhouser, T.; Nießner, M. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2432–2443. [Google Scholar]
Liu, B.; Hu, L.; Dong, Q.; Hu, Z. An iterative co-training transductive framework for zero shot learning. IEEE Trans. Image Process. 2021, 30, 6943–6956. [Google Scholar] [CrossRef] [PubMed]
Frome, A.; Corrado, G.S.; Shlens, J.; Bengio, S.; Dean, J.; Ranzato, M.A.; Mikolov, T. Devise: A deep visual-semantic embedding model. In Proceedings of the Conference on Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 2121–2129. [Google Scholar]
Xian, Y.; Lorenz, T.; Schiele, B.; Akata, Z. Feature generating networks for zero-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5542–5551. [Google Scholar]

Figure 1. The advantages of the proposed semantics-enhanced classifier. (A) Before enhancing, the decision boundary of the classifier is located within the distribution of one seen class, leading to poor performance on this class, and the classifier cannot discriminate the novel class from the other two seen classes since no samples are provided for learning. (B) The semantics-enhanced classifier pushes its decision boundary to a more appropriate location between the two seen class distributions, bringing better generalization, and can recognize novel classes.

Figure 2. The proposed SeCondPoint framework consists of three parts: a pre-trained backbone segmentation network for extracting point features from input 3D point clouds; a language-level semantics-conditioned feature-modeling network for learning point feature distributions conditioned on the corresponding semantic features; a semantics-enhanced feature classifier for classifying both seen class and unseen class points in 3D point clouds.

Figure 3. Comparative results (%) with/without feature-geometry-based Mixup training. (A) For C3DS and (B) for Z3DS.

Figure 4. Performances (%) under different numbers (K) of generated features. (A) For C3DS and (B) for Z3DS.

Figure 5. Performances (%) under different ratios (

γ

) of the interpolated samples to the real samples. (A) For C3DS and (B) for Z3DS.

Figure 5. Performances (%) under different ratios (

γ

) of the interpolated samples to the real samples. (A) For C3DS and (B) for Z3DS.

Figure 6. Visualization of the C3DS results of RandLA-Net and its enhanced version by the proposed SeCondPoint framework.

Figure 7. Visualization of the Z3DS results of the conventional segmentation method, the baseline 3D-V2S-embedding, and the proposed SeCondPoint framework.

Figure 8. Visualization of the GZ3DS results of the conventional segmentation method, the baseline 3D-V2S-embedding, and the proposed SeCondPoint framework.

Table 1. Symbols and notions.

Symbols	$P$	3D point cloud scene set
	$D_{t r}^{S}$	3D point dataset of seen classes, including pairs of points and labels for training, ${(x_{n}, y_{n})}_{n = 1}^{N}$
	$D_{t e}$	3D point dataset, including points for testing, ${x_{m}}_{m = 1}^{M}$
	${\bar{X}}^{S}$	generated 3D point dataset of seen classes, including pairs of points and labels for training
	$x_{n}$	3D point, also usually used as the feature of the 3D point
	$y_{n}$	class label of $x_{n}$
	$e_{y}$	semantic feature of class label y
	Y	whole class label set, including seen classes and unseen classes
	$Y^{S}$	label set of seen classes
	$Y^{U}$	label set of unseen classes
	$E$	semantic features of all class labels in Y
	$\bar{x}$	generated 3D point feature
	$\hat{x}$	interpolated 3D point feature for GAN training
	$z$	standard Gaussian noise for GAN training
	$x^{c}$	3D point feature from class c
	$X_{c}$	class feature center of class c
	A	Euclidean similarity matrix between class feature centers
	$u (\cdot)$	feature extractor for 3D point cloud
	$G e n (\cdot)$	generator in GAN to synthesize 3D point feature $\bar{x}$
	$D i s (\cdot)$	generator in GAN for adversarial learning
	$f_{s} (\cdot)$	semantics-enhanced classifier for seen classes
	$f_{u} (\cdot)$	semantics-enhanced classifier for unseen classes
	$f_{g} (\cdot)$	semantics-enhanced classifier for both seen and unseen classes
Notions	C3DS	conventional 3D scene semantic segmentation
	Z3DS	zero-shot 3D scene semantic segmentation
	GZ3DS	generalized zero-shot 3D scene semantic segmentation
	mACC	mean of accuracies of all classes
	mIoU	mean of Intersections over Union (IoUs) of all classes
	mVACC	mean of voxel-based accuracies of all classes
	OA	overall accuracy of all samples, independent of class
	OVA	overall voxel-based accuracy of all samples, independent of class
	HACC	harmonic mean of seen class mACC and unseen class mACC
	HIoU	harmonic mean of seen class mIoU and unseen class mIoU
	HVACC	harmonic mean of seen class mVACC and unseen class mVACC

Table 2. The unseen classes on S3DIS and ScanNet datasets.

Dataset	Split	Unseen Classes
S3DIS	10/2	window, table
	8/4	window, table, sofa, door
	6/6	window, table, sofa, door, column, board
ScanNet	16/3	refrigerator, bed, curtain
	13/6	refrigerator, bed, curtain, table, door, bathtub
	10/9	refrigerator, bed, curtain, table, door, bathtub bookshelf, cabinet, sofa

Table 3. Comparative C3DS results (%) on S3DIS (Area 5).

Method	$mACC$	$mIoU$	$OA$	Ceiling	Floor	Wall	Column	Window	Door	Table	Chair	Sofa	Bookcase	Board	Clutter
SPGraph [43]	66.5	58.0	86.4	89.4	96.9	78.1	42.8	48.9	61.6	84.7	75.4	69.8	52.6	2.1	52.2
PointCNN [37]	63.9	57.3	86.0	92.3	98.2	79.4	17.6	22.8	62.1	74.4	80.6	31.7	66.7	62.1	56.7
PointWeb [44]	66.6	60.3	87.0	92.0	98.5	79.4	21.1	59.7	34.8	76.3	88.3	46.9	69.3	64.9	52.5
Point2Node [31]	70.0	63.0	88.8	93.9	98.3	83.3	35.7	55.3	58.8	79.5	84.7	44.1	71.1	58.7	55.2
DGCNN [45]	56.3	49.8	85.7	91.3	97.4	77.9	9.1	52.4	20.3	70.3	72.0	24.6	50.7	35.8	45.8
DGCNN [45] + RSMix [18]	55.2 (−1.1)	48.1 (−1.7)	84.2 (−1.5)	87.5	92.7	74.2	9.0	41.8	11.8	67.7	69.0	26.6	56.1	44.3	45.2
DGCNN [45] + SeCondPoint	62.8 (+6.5)	52.7 (+2.9)	86.0 (+0.3)	91.0	97.3	77.4	16.2	56.8	31.1	68.6	69.0	32.3	53.4	44.6	47.4
RandLA-Net [9]	70.3	62.5	87.0	92.1	96.9	80.0	24.8	60.9	35.5	77.2	85.0	72.6	70.3	67.1	50.6
RandLA-Net [9] + RSMix [18]	70.8 (+0.5)	63.1 (+0.6)	87.5 (+0.5)	93.0	97.4	80.6	23.1	58.9	39.9	78.1	86.6	67.5	71.3	71.1	53.2
RandLA-Net [9] + SeCondPoint	74.6 (+4.3)	64.2 (+1.7)	87.6 (+0.6)	92.2	97.3	80.7	34.5	61.8	47.1	76.4	84.1	75.6	71.2	63.0	51.2
SCF-Net [36]	71.5	63.5	87.2	90.8	96.9	81.0	21.2	60.3	44.2	78.9	87.7	74.1	71.4	68.6	50.3
SCF-Net [36] + RSMix [18]	70.8 (−0.7)	62.2 (−1.3)	86.4 (−0.8)	89.6	95.1	79.9	22.0	62.3	34.3	76.7	87.8	70.2	70.7	70.7	49.9
SCF-Net [36] + SeCondPoint	74.2 (+2.7)	64.4 (+0.9)	87.6 (+0.4)	89.9	97.1	80.9	23.8	61.8	48.0	78.4	87.3	76.8	72.1	71.3	49.8

Table 4. Comparative C3DS results (%) on S3DIS (6-fold).

Method	$mACC$	$mIoU$	$OA$	Ceiling	Floor	Wall	Beam	Column	Window	Door	Table	Chair	Sofa	Bookcase	Board	Clutter
SPGraph [43]	73.0	62.1	85.5	89.9	95.1	76.4	62.8	47.1	55.3	68.4	73.5	69.2	63.2	45.9	8.7	52.9
PointCNN [37]	75.6	65.4	88.1	94.8	97.3	75.8	63.3	51.7	58.4	57.2	71.6	69.1	39.1	61.2	52.2	58.6
PointWeb [44]	76.2	66.7	87.3	93.5	94.2	80.8	52.4	41.3	64.9	68.1	71.4	67.1	50.3	62.7	62.2	58.5
Point2Node [31]	79.1	70.0	89.0	94.1	97.2	83.4	62.7	52.3	72.3	64.3	75.8	70.8	65.7	49.8	60.3	60.9
DGCNN [45]	68.5	57.2	85.8	93.2	94.8	76.6	37.3	33.1	50.7	62.3	62.4	63.4	25.2	47.0	45.6	52.9
DGCNN [45] + RSMix [18]	69.4 (+0.9)	58.8 (+1.6)	85.0 (−0.8)	86.7	89.7	76.4	56.4	34.9	51.3	60.1	63.4	65.1	28.8	49.7	50.2	52.2
DGCNN [45] + SeCondPoint	72.9 (+4.4)	58.4 (+1.2)	86.5 (+0.7)	93.3	96.4	75.1	38.3	35.5	49.6	66.4	62.1	68.1	29.5	47.4	44.7	52.9
RandLA-Net [9]	79.6	68.7	88.1	94.0	96.6	80.8	43.4	45.7	62.7	70.8	66.9	78.6	64.7	61.4	66.0	61.4
RandLA-Net [9] + RSMix [18]	81.0 (+1.4)	68.9 (+0.2)	87.5 (−0.6)	93.4	96.8	79.3	54.8	43.9	62.5	67.2	70.2	76.9	60.3	64.1	66.1	59.6
RandLA-Net [9] + SeCondPoint	82.1 (+2.5)	69.5 (+0.8)	89.0 (+0.9)	94.6	97.3	81.9	44.3	48.5	62.3	72.4	66.5	80.8	63.6	63.6	65.6	62.3
SCF-Net [36]	81.9	70.2	88.8	94.0	96.5	81.0	48.6	47.8	62.0	72.9	68.4	82.4	67.3	62.8	65.6	63.8
SCF-Net [36] + RSMix [18]	82.1 (+0.2)	70.7 (+0.5)	88.0 (−0.8)	92.6	95.7	80.2	62.8	46.6	65.2	68.2	70.2	81.4	66.4	64.9	64.8	60.3
SCF-Net [36] + SeCondPoint	82.9 (+1.0)	71.4 (+1.2)	89.5 (+0.7)	93.8	96.4	81.4	53.2	50.9	65.8	73.1	68.5	82.8	67.7	63.2	67.2	63.7

Table 5. Comparative C3DS results (%) on ScanNet.

Method	$mVACC$	$OVA$
PointNet++ [8]	-	84.5
PointCNN [37]	-	85.1
ACNN [40]	-	85.4
PointWeb [44]	60.6	85.9
PointGCR [32]	60.5	85.3
RandLA-Net [9]	59.5	85.7
RandLA-Net [9] + RSMix [18]	60.3 (+0.8)	85.6 (−0.1)
RandLA-Net [9] + SeCondPoint	65.6 (+6.1)	85.9 (+0.2)
SCF-Net [36]	61.4	86.0
SCF-Net [36] + RSMix [18]	61.5 (+0.1)	85.9 (−0.1)
SCF-Net [36] + SeCondPoint	68.9 (+7.5)	85.9 (−0.1)

Table 6. Comparative Z3DS results on the S3DIS dataset.

Method		Z3DS		GZ3DS
		${mACC}_{u}$	${mIoU}_{u}$	$HACC$	${mACC}_{s}$	${mACC}_{u}$	$HIoU$	${mIoU}_{s}$	${mIoU}_{u}$
10/2	DGCNN [45] + Fully Supervised	-	-	64.7	58.1	72.9	58.8	52.4	67.0
	DGCNN [45] + 3D-V2S-embedding [60]	92.1	86.1	0.0	52.2	0.0	0.0	39.1	0.0
	DGCNN [45] + 3D-GAN [61]	93.6	88.7	55.5	51.5	60.3	29.9	44.1	22.7
	DGCNN [45]-Z3DS(ours)	95.8	92.5	56.8	49.2	67.0	31.0	43.1	24.2
	RandLA-Net [9] + Fully Supervised	-	-	75.3	68.8	83.2	67.8	63.3	73.1
	RandLA-Net [9] + 3D-V2S-embedding [60]	93.5	87.8	0.0	70.3	0.0	0.0	57.3	0.0
	RandLA-Net [9] + 3D-GAN [61]	93.9	88.6	55.3	55.3	55.3	31.0	42.2	24.5
	RandLA-Net [9]-Z3DS(ours)	95.6	91.6	60.0	51.5	71.9	31.4	40.7	25.6
8/4	DGCNN [45] + Fully Supervised	-	-	58.0	64.5	52.7	52.7	58.2	48.2
	DGCNN [45] + 3D-V2S-embedding [60]	46.5	18.4	0.0	67.8	0.0	0.0	52.1	0.0
	DGCNN [45] + 3D-GAN [61]	46.3	18.1	46.0	57.4	38.4	9.6	51.8	5.3
	DGCNN [45]-Z3DS(ours)	48.9	19.7	45.4	58.4	37.1	10.2	51.6	5.6
	RandLA-Net [9] + Fully Supervised	-	-	71.4	70.7	72.2	64.6	65.6	63.6
	RandLA-Net [9] + 3D-V2S-embedding [60]	46.0	13.7	0.2	71.6	0.1	0.2	59.5	0.1
	RandLA-Net [9] + 3D-GAN [61]	52.1	23.9	39.2	37.5	41.0	9.4	29.6	5.6
	RandLA-Net [9]-Z3DS(ours)	55.3	26.0	41.9	60.0	32.1	14.1	49.4	8.2
6/6	DGCNN [45] + Fully Supervised	-	-	56.2	76.8	44.3	51.1	69.2	40.5
	DGCNN [45] + 3D-V2S-embedding [60]	30.3	11.3	0.0	76.6	0.0	0.0	60.0	0.0
	DGCNN [45] + 3D-GAN [61]	30.3	13.7	33.2	52.0	24.4	5.6	48.7	3.0
	DGCNN [45]-Z3DS(ours)	32.2	14.3	32.1	46.6	24.4	6.0	44.4	3.2
	RandLA-Net [9] + Fully Supervised	-	-	70.3	79.0	63.3	63.7	73.8	56.0
	RandLA-Net [9] + 3D-V2S-embedding [60]	33.0	8.9	1.4	79.5	0.7	1.1	64.7	0.6
	RandLA-Net [9] + 3D-GAN [61]	32.8	19.6	33.5	42.3	27.8	4.8	39.9	2.6
	RandLA-Net [9]-Z3DS(ours)	33.9	20.8	34.8	36.5	33.2	6.9	36.2	3.8

Table 7. Comparative Z3DS results on the ScanNet dataset.

Method		Z3DS		GZ3DS
		${mACC}_{u}$	${mVACC}_{u}$	$HACC$	${mACC}_{s}$	${mACC}_{u}$	$HVACC$	${mVACC}_{s}$	${mVACC}_{u}$
16/3	DGCNN [45] + Fully Supervised	-	-	39.1	42.3	36.5	43.2	45.0	41.6
	DGCNN [45] + 3D-V2S-embedding [60]	57.7	58.7	0.0	36.9	0.0	0.0	40.1	0.0
	DGCNN [45] + 3D-GAN [61]	60.4	63.5	24.7	43.4	17.3	26.4	46.9	18.4
	DGCNN [45]-Z3DS(ours)	63.5	65.4	29.1	40.9	22.5	31.1	43.8	24.2
	RandLA-Net [9] + Fully Supervised	-	-	57.5	60.9	54.4	58.1	61.5	55.1
	RandLA-Net [9] + 3D-V2S-embedding [60]	55.3	54.7	0.0	36.7	0.0	0.0	37.2	0.0
	RandLA-Net [9] + 3D-GAN [61]	60.5	62.1	40.9	58.0	31.6	41.0	58.3	31.6
	RandLA-Net [9]-Z3DS(ours)	63.9	65.7	42.1	58.8	32.8	42.2	59.0	32.9
13/6	DGCNN [45] + Fully Supervised	-	-	40.4	42.7	38.4	43.8	45.5	42.2
	DGCNN [45] + 3D-V2S-embedding [60]	21.9	23.7	0.0	47.2	0.0	0.0	51.0	0.0
	DGCNN [45] + 3D-GAN [61]	43.9	45.0	28.8	29.7	27.9	30.2	31.1	29.3
	DGCNN [45]-Z3DS(ours)	45.7	47.0	29.2	29.1	29.3	30.8	33.0	28.8
	RandLA-Net [9] + Fully Supervised	-	-	58.4	62.1	55.1	59.3	62.3	56.6
	RandLA-Net [9] + 3D-V2S-embedding [60]	22.5	22.5	0.2	62.5	0.1	0.1	62.6	0.1
	RandLA-Net [9] + 3D-GAN [61]	41.1	42.4	30.0	48.8	21.7	30.8	49.0	22.4
	RandLA-Net [9]-Z3DS(ours)	45.1	46.5	33.1	35.7	30.8	35.2	37.9	32.9
10/9	DGCNN [45] + Fully Supervised	-	-	41.4	40.4	42.4	44.5	42.3	46.9
	DGCNN [45] + 3D-V2S-embedding [60]	20.8	21.8	0.0	49.4	0.0	0.0	49.4	0.0
	DGCNN [45] + 3D-GAN [61]	26.8	27.2	20.6	34.7	14.7	20.8	37.7	14.4
	DGCNN [45]-Z3DS(ours)	27.4	27.5	22.0	34.8	16.1	22.6	38.0	16.1
	RandLA-Net [9] + Fully Supervised	-	-	59.9	59.1	60.7	60.5	59.2	61.8
	RandLA-Net [9] + 3D-V2S-embedding [60]	21.8	22.1	1.1	26.1	0.6	1.1	26.2	0.6
	RandLA-Net [9] + 3D-GAN [61]	29.2	29.5	20.6	60.8	12.4	20.8	62.4	12.5
	RandLA-Net [9]-Z3DS(ours)	30.9	31.2	23.4	59.1	14.6	24.0	60.4	15.0

Table 8. Inference time(s) of baselines and the enhanced versions by the proposed SeCondPoint.

Method	Mean	Std
DGCNN [45]	0.199	0.016
DGCNN [45] + SeCondPoint	0.202	0.006
RandLA-Net [9]	2.120	0.048
RandLA-Net [9] + SeCondPoint	2.100	0.121
SCF-Net [36]	14.923	0.369
SCF-Net [36] + SeCondPoint	14.862	0.257

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, B.; Zeng, H.; Dong, Q.; Hu, Z. Language-Level Semantics-Conditioned 3D Point Cloud Segmentation. Remote Sens. 2024, 16, 2376. https://doi.org/10.3390/rs16132376

AMA Style

Liu B, Zeng H, Dong Q, Hu Z. Language-Level Semantics-Conditioned 3D Point Cloud Segmentation. Remote Sensing. 2024; 16(13):2376. https://doi.org/10.3390/rs16132376

Chicago/Turabian Style

Liu, Bo, Hui Zeng, Qiulei Dong, and Zhanyi Hu. 2024. "Language-Level Semantics-Conditioned 3D Point Cloud Segmentation" Remote Sensing 16, no. 13: 2376. https://doi.org/10.3390/rs16132376

APA Style

Liu, B., Zeng, H., Dong, Q., & Hu, Z. (2024). Language-Level Semantics-Conditioned 3D Point Cloud Segmentation. Remote Sensing, 16(13), 2376. https://doi.org/10.3390/rs16132376

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Language-Level Semantics-Conditioned 3D Point Cloud Segmentation

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning for Point Cloud Segmentation

2.2. Data Augmentation for Point Clouds

2.3. Zero-Shot Semantic Segmentation

3. Materials and Methods

3.1. Methods

3.1.1. Language-Level Semantics-Conditioned Feature Modeling-Network

3.1.2. Semantics-Enhanced Feature Classifier

3.1.3. Implementation Details

3.2. Datasets and Setup

3.2.1. Datasets

3.2.2. Evaluation Protocols and Comparative Methods

4. Results

4.1. C3DS Results on S3DIS

4.2. C3DS Results on ScanNet

4.3. Z3DS/GZ3DS Results on S3DIS

4.4. Z3DS/GZ3DS Results on ScanNet

5. Discussion

5.1. Effect of Feature-Geometry-Based Mixup

5.2. Effect of Number of Generated Features

5.3. Sensitivity to Hyper-Parameter

5.4. Inference Time Cost

5.5. Visualization

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI