3.1. Methods
We begin by introducing the task of conventional 3D scene semantic segmentation (C3DS), and that of zero-shot 3D scene semantic segmentation (Z3DS). At first, mathematical symbols and some notions used in this work are listed in
Table 1 to facilitate the subsequent description and understanding. Generally speaking, the task of C3DS is to classify each point in a 3D point cloud scene into a semantic label that belongs to a close set given the task. Due to the scene complexity, accurate classification of points needs a deep understanding of the scene context. Note that the closed set classification in C3DS is somewhat impractical since realistic scenes often contain diverging classes. Hence, the goal of Z3DS is to enable models to deal with unseen classes. In Z3DS, models are trained with a set of classes (namely seen classes) that have labeled data for training, and then the models are required to infer testing data belonging to another set of classes (namely unseen classes). Considering that unseen classes are open set, Z3DS models can handle more realistic scenes. A more formal definition is given below.
Suppose we are given a 3D point cloud scene set at the training stage; for the sake of clarity, it can also be represented as a 3D point set , where is a point in a 3D point cloud scene and is the label of , belonging to a seen class label set , and N is the number of points in the dataset. Usually, is a -D feature vector with 3-D X-Y-Z Euclidean coordinates and a-D features (e.g., 3-D R-G-B color features). In the deep learning community, a feature extractor is usually used to transform the original -D point feature vectors into more high-level point feature vectors. Hence, in this paper, is also used to refer to the feature vector after the processing of the feature extractor . Suppose we are additionally given a semantic feature set , where is the semantic feature of class y and Y is a class label set that not only includes the seen class label set , but also includes an unseen class label set , where has no intersection with . At the testing stage, the task is to segment the testing 3D point cloud scenes by classifying each point in these scenes, i.e., classifying a corresponding 3D point set, here denoted by , then we have the following definitions:
C3DS: Suppose the labels of the points in the testing dataset belong to seen classes ; given the training dataset and the semantic feature set , the goal is to learn a mapping .
Z3DS: Suppose the labels of the points in the testing dataset belong to unseen classes ; given the training dataset and the semantic feature set , the goal is to learn a mapping .
Generalized Z3DS (GZ3DS): Suppose the labels of the points in the testing dataset belong to the whole class set Y; given the training dataset and the semantic feature set , the goal is to learn a mapping .
Note that, since the 3D scene semantic segmentation problem is actually a point-level classification problem, here we represent the point cloud set as a corresponding point set, and then segmenting 3D scenes is equal to classifying 3D points. It is assumed in C3DS that testing points only belong to the seen class label set (i.e., those classes with training samples), that is to say, C3DS is unable to handle novel class objects that often appear in realistic scenarios. Z3DS assumes that we have the prior that testing points belong to only the unseen class label set (i.e., those classes without training samples). Although this prior is not always available in practice, Z3DS can demonstrate in theory the ability of a model to infer unseen class objects and is of practical value with the aid of some additional modules (like a novel class detector). GZ3DS simultaneously deals with seen classes and unseen classes, which is a more practical setting.
In this paper, we propose the SeCondPoint framework to tackle all three kinds of 3D point cloud semantic segmentation tasks above, which utilizes language-level semantics to condition the modeling of the point feature distribution and the point feature generation. As shown in
Figure 2, the proposed SeCondPoint consists of three parts: a backbone segmentation network for extracting point features from input 3D point clouds, a language-level semantics-conditioned feature-modeling network for learning the point feature distribution, and a semantics-enhanced feature classifier for classifying both seen class and unseen class points in 3D point clouds for semantic segmentation. By sampling an arbitrary number of point features from the learned feature distribution, the semantics-enhanced feature classifier can be well trained to segment both seen class and unseen class points. Here, we would point out that an arbitrary existing (also novel) segmentation network could be used as the backbone network under the SeCondPoint framework. Since our main goal is to propose the novel SeCondPoint framework, rather than a novel segmentation network, we simply use existing segmentation networks as backbone networks here. In the following, we describe the language-level semantics-conditioned feature-modeling network and the semantics-enhanced feature classifier, respectively.
3.1.1. Language-Level Semantics-Conditioned Feature Modeling-Network
Conditional generative adversarial networks (CGANs) have received much attention recently due to their excellent ability to model conditional data distributions. They generally consist of a generator and a discriminator; the generator can synthesize new samples given a condition and a Gaussian noise, and the discriminator is used to adversarially train the generator to make its generated data follow the pre-defined conditional distributions. Here, we adopt this idea to model 3D point features conditioned on language-level semantics. Suppose point features of input point clouds are given by a segmentation backbone; our goal is to learn a point feature conditional generative network, which can model the conditional distribution of point features conditioned on the semantic features. To this end, we first introduce language-level semantic features of both seen and unseen (novel) object classes. There exist many models for extracting language-level semantics in the literature [
53,
54,
55]; here, we use the semantic embeddings of the object class names extracted by the existing language model [
53], considering that our current work is to show the usefulness of semantics-conditioned feature modeling in 3D scene segmentation, rather than comparing different semantics.
The architecture of the feature conditional generative network is shown in
Figure 2, where the input of the generator is the concatenation of a semantic feature of an object class and a Gaussian noise, the output of the generator is a synthesized point feature of the corresponding object class, and the discriminator is to discriminate the real point feature extracted by the backbone network from that generated by the generator. Formally, suppose we are given a backbone network
, a generator
, and a discriminator
. For a given input 3D point with its corresponding class label
y, we firstly extract point feature
of the input 3D point with the backbone network
. Then, the generator
is used to generate a fake point feature
conditioned on the corresponding semantic feature
, and the discriminator
is to discriminate the generated fake point feature
from the real point feature
. In order to learn a point feature distribution that could not be discriminated from the real one by the discriminator, the generator is adversarially trained with the discriminator under the framework of the Wasserstein generative adversarial network (WGAN) [
56] as follows:
where
is the labeled training point feature set extracted by the backbone network from the training 3D point clouds. The first two terms in (
1) are the original objectives of WGAN, which aim to minimize the Wasserstein distance between the distributions of the generated point features and those of the real point features, and
is generated by the generator
conditioned on the corresponding semantic feature
and a standard Gaussian noise
, i.e.,
. The third term in (
1) is used as the gradient penalty for the discriminator, where
with
sampled from a uniform distribution, i.e.,
is used to estimate gradients, and
is a hyper-parameter for weighting the gradient penalty term, which is usually set to 10, as suggested in [
56]. By optimizing the Min–Max objective, the generator can finally learn the conditional distribution of real point features of each class conditioned on its semantic feature. In other words, an arbitrary number of point features for each class could be synthesized by sampling features from the learned conditional distribution.
To further improve the feature distribution modeling of the proposed generative network, here we propose a feature-geometry-based Mixup training approach, which can increase the discriminability between adjacent object classes in the feature space. Generally, in 3D scene semantic segmentation, feature learning is mainly based on geometrical information. That is to say, geometrically adjacent points generally have similar features in the learned feature space. At the same time, due to the feature similarity between geometrically adjacent object classes, confusions often happen among such classes. Inspired by Mixup [
16], we designed a model-learning approach as follows. Given a point feature dataset by the backbone, we firstly compute the class feature centers
, where
is the class feature center of class
c,
is the number of point features belonging to class
c, and
C is the number of classes. Then, we compute the Euclidean similarity matrix
A between these feature centers, which represents the semantic similarity between classes. Next, we mix up semantically similar samples as follows: given a point feature
from the class
c, we first randomly find one of the closest
I classes to the class
c according to the similarity matrix
A, denoted by
, and then, we sample a point feature from
, denoted by
; finally, an intermediate feature sample
is synthesized by interpolating between
and
with a scalar
sampled from the uniform distribution
as:
where
is the synthesized point feature,
and
are the corresponding semantic features of
and
, and
is synthesized in the same way as
. According to (
2), we can mix up a large number of samples that are located between two geometrically adjacent classes in the feature space. Finally, we add these interpolated samples to the original training set for training the feature generative network. By such training, the model is expected to improve its ability to discriminate adjacent object classes. Here, we use a hyper-parameter
to control the scale of interpolated samples, which is defined as the ratio of the interpolated samples to the real samples. This algorithm is summarized in Algorithm 1.
Algorithm 1 Feature-geometry-based Mixup training |
- Input:
, initialized and ; - Output:
Trained and ;
- 1:
Compute class feature centers with ; - 2:
Compute similarity matrix A between class feature centers; - 3:
for n = 1 to N do - 4:
Randomly sample a feature from class c; - 5:
Randomly select a class from the closest I classes to class c according to A; - 6:
Randomly sample a feature from class ; - 7:
Mix up a sample with and based on Equation 2; - 8:
end for - 9:
Train and with mixed N samples combined with ; - 10:
return Trained and ;
|
3.1.2. Semantics-Enhanced Feature Classifier
Once the language-level semantics-conditioned feature-modeling network is trained, a large number (
K) of point features for each object class could be generated conditioned on the corresponding semantic feature
and
K different random noises
sampled from the standard Gaussian distribution
. Specifically, we generate point features according to:
where
is the generated point feature. In the following, we describe the feature generation and classifier learning in three different tasks, i.e., C3DS, Z3DS, and GZ3DS.
Conventional 3D scene semantic segmentation: The testing points are assumed to come from only seen classes in C3DS. Hence, we generate a large number of point features for each seen class in
conditioned on seen class semantic features according to (
3). The set of generated point features and corresponding labels is denoted by
. Then, we train a semantics-enhanced classifier
with
as follows:
where
and
are a cross-entropy loss function and a softmax function, respectively. Note that
could be any classifier, not constrained to a linear classifier, as in existing segmentation networks. After training the classifier, given a real testing point feature
, we predict its label
y by:
In sum, for a given testing 3D point cloud, we achieve point cloud semantic segmentation by classifying the feature of each 3D point extracted by the backbone network via the learned semantics-enhanced feature classifier.
Zero-shot 3D scene semantic segmentation: Thanks to the language-level semantics-conditioned point feature modeling, the proposed framework has the flexible ability to segment novel class objects in 3D point cloud scenes (including Z3DS and GZ3DS) if their corresponding semantics are available, which is an impossible function for existing segmentation networks. In Z3DS, the task is to classify only unseen class points. To this end, we sample a large number of point features for each unseen class in
from the learned conditional distributions conditioned on the unseen class semantic features according to (
3). That is to say, the semantic feature
is sampled from the unseen class label set
, which is different from those in C3DS, where the semantic feature
is sampled from the seen class label set
. Then, we train a semantics-enhanced classifier
in a similar way to (
4) and classify the real testing unseen class points, as performed in (
5).
Generalized zero-shot 3D scene semantic segmentation: The testing points could be from either seen classes or unseen classes in GZ3DS. Hence, according to (
3), we generate a large number of point features for every class in
Y conditioned on all semantic features, and the semantics-enhanced classifier
training and the point feature classifying are similar to those in C3DS and Z3DS. The only difference among C3DS, Z3DS, and GZ3DS is that their conditions and classification space are different, which in turn demonstrates the flexibility of our proposed framework.
3.1.3. Implementation Details
As for the backbone networks, in the C3DS tasks, we validated the proposed SeCondPoint framework with 3 typical 3D point cloud segmentation backbones: DGCNN [
45], RandLA-Net [
9], and SCF-Net [
36]. The architectures of the networks used are totally the same as those in their original papers since we directly used their public codes and models. We chose the three networks not only because their codes and models are public and commonly used, but also because they are representative and classical networks. Specifically, the DGCNN is a graph-based architecture and is usually used to process block-sized small-scale point clouds, while RandLA-Net and SCF-Net are two different point-based networks, which can be directly used to deal with large-scale point clouds. In the Z3DS tasks, we used the DGCNN [
45] and RandLA-Net [
9] to validate the proposed SeCondPoint framework given that they have totally different architectures. Other network backbones could be adapted seamlessly to Z3DS in the same way.
In the feature generative network, the input of the feature generator includes a semantic feature and a standard Gaussian noise, and it outputs a synthesized point feature. It is implemented by a three-layer fully connected neural network (FCNN), whose input unit number and latent unit number are 600 (300-D for semantic features and 300-D for Gaussian noise) and 1024, respectively, in all tasks, and its output unit number depends on the corresponding point feature dimensionality, i.e., for DGCNN/RandLA-Net/SCF-Net, respectively. The input of the feature discriminator contains a semantic feature and a point feature, and it outputs a 0/1 classification result. It is also implemented by a three-layer FCNN, whose output unit number and latent unit number are 1 and 1024, respectively, and its input unit number is equal to the sum of the corresponding point feature dimensionality and the semantic feature dimensionality, i.e., for DGCNN/RandLA-Net/SCF-Net, respectively. The generator and the discriminator are adversarially trained for 20 epochs, with a batch size of 32, a learning rate of , and an Adam optimizer. After training the generative network, the generator is used to synthesize any large number of point features of a specified class with its semantic feature and standard Gaussian sampling, which are then used for semantics-enhanced classifier training.
As for the semantics-enhanced classifier, different tasks have different class spaces, e.g., GZ3DS classifies points in the whole class space and Z3DS classifies points in the unseen class space. However, they share the same architecture and implementation. The input and the output of the classifier are a point feature and a class logit, respectively. It is implemented by a one-layer FCNN, whose input unit number and output unit number are, respectively, the corresponding point feature dimensionality and the class number depending on the backbone network and the task, e.g., the point feature dimensionality is for DGCNN/RandLA-Net/SCF-Net, respectively. The classifier is trained from scratch for 10 epochs, with a batch size of 4096, a learning rate of , and an Adam optimizer.