1. Introduction
In the wake of strikingly realistic randomly generated 2D images [
1,
2], there is a mounting expectation for generative models to replicate the same success when automatically synthesizing 3D objects. Demand for such models arises in various domains, ranging from rapid design and prototyping for the manufacturing sector to the entertainment industry [
3,
4,
5,
6,
7]. Hence, the research focus is shifting towards 3D deep generative models, as rich and flexible 3D representations are rapidly emerging. Many of these applications benefit from or require quick generation of the desired 3D object. One example is a machine learning model that fabricates levels in a video game to grant the user a unique experience [
5,
6]. Real-time computation allows the random generation of items or characters during the play. Tweaking the design of a 3D-printable object [
7] may also require multiple iterations of a generative model and would also benefit from quick inference. Any application in extended reality, such as generating human models from motion capture data [
4], requires the model to run at the same frequency as the frames. This requirement prevents the 3D generative model from widespread use in the above applications because popular generative models, such as transformers, are not parallelizable.
Point clouds are native to 3D laser scanning (LiDAR) and, therefore, are becoming ubiquitous while their processing remains challenging. In the last decade, machine learning research made considerable progress in processing point clouds for, among other things, classification and segmentation [
8,
9].
The point cloud representation is suitable for generating surfaces because it requires many fewer points than voxels for a comparable level of detail and, therefore, a lower theoretical bound on complexity. It is especially promising for virtual and augmented reality applications (AR/VR), as they typically have real-time constraints.
Point cloud autoencoders compress a point cloud into a lower-dimensional encoding that we refer to as a codewordDepending on the architecture, a decoder reconstructs from the codeword the input points individually or the general shape by generating an arbitrarily dense point cloud: the latter, usually called generators, suit downstream applications requiring flexible output. Autoencoders and, in particular, variational autoencoders (VAEs) [
10] were the earliest method proposed for point cloud generation [
11,
12]. Point cloud VAEs also found applications in shape interpolation [
13], feature and transfer learning [
14,
15], and shape inference [
16]. They offer great potential for prototyping. One reason is that the compressed codeword expresses the point cloud’s global semantic variables. This property allows for a compact data representation and facilitates data exploration and conditional generation. Another reason is that they are fully parallelizable and allow quick inference. Unfortunately, posterior collapse, a known issue for variational autoencoders, prevents them from competing with the state of the art in generation quality.
Recent works [
17,
18] have replicated the success of vector-quantized (VQ) variational autoencoders (VQVAEs) [
19] in 3D shape generation, setting the new state of the art. VQVAE proved very effective at encoding different types of grid-like data, such as images [
19], into discrete latent variables whose value corresponds to the index of a dictionary, named a codebook, of learned semantically meaningful vectors. They are not affected by the blurriness and posterior collapse issues [
19] typical of VAE models. Still, they need a second autoregressive model to approximate the complex discrete latent distribution to generate new samples. Because of that, they lose some advantages of variational autoencoders, such as parallel processing of the codes and data exploration through global semantic variables. For point clouds, the generation time is in the order of seconds [
17], making these models impractical for AR/VR.
We propose a novel VQVAE-based point cloud generative model with an innovative sampling strategy that allows high-quality generation in real time thanks to parallel computing. Our main contribution is threefold: first, we introduce a new architecture for a decoder that processes the encoded shape in a radically new way, yielding superior reconstructions. Secondly, we propose a filter on the decoded point cloud that reduces the gaps naturally occurring when generating the points’ coordinates in parallel. Thirdly, we introduce a VQVAE model with a new sampling technique for the discrete codes based on a second smaller VAE model that allows parallel generation, reducing computation time. The following paragraphs give more context and detail to each contribution.
Our generator translates codewords into mappings from a sampling space to the Euclidean space and uses these mappings to process the points of the reconstructed shape in parallel. Differently from previous work, which folds [
20], tears [
21] or patches together [
22,
23] 2D structures, we sample the initial position of the points from a higher dimension. When ultimately projecting them in Euclidean space, our mapping describes surfaces as thin regions of space, letting the model learn to concentrate the resulting distribution along the desired shape. This solution solves topological problems and artefacts that complex shapes may cause.
A crucial difference in the architecture of our generator compared to previous work is how the codeword and the sampling are joint, therefore, how the mapping is defined. The standard way of producing the mapping adds the codeword as semantic features to the initial spatial coordinates through concatenation [
20,
21,
22,
23].
Instead, we convert the initial sampling into a high-dimensional random generator with a learned covariance and compact support. We then formulate a tensor product between the output of the random generator and the codeword. Geometrically, the codeword defines a bounding box of the resulting distribution’s support, expanding the directions most relevant to the encoded input shape and contracting the irrelevant ones. Furthermore, we extend the model by reconstructing the point cloud as a learned interpolation of different point cloud representations with an attention mechanism. This solution trades computation time for performance while avoiding known problems with glueing different mappings [
21]. We show experimentally that the proposed solutions improve CAD models’ autoencoder reconstruction on the Shapenet dataset [
24].
Next, we turn our attention to mitigating the effects of the sampling inconsistency, which may cause gaps in the decoded surface, increasing the loss and degrading its gradients. We introduce a new filtering approach based on Laplacian filtering and apply it on the recovered surface to avert this occurrence. Our filtering approach significantly improves the homogeneity of the recovered surface compared to the commonly employed Laplacianfiltering [
21,
25]. Moreover, the proposed filter adds little overhead and is suitable for similar generation methods.
Finally, we address the main computational bottleneck of the VQVAE model. We replace the autoregressive sampling of the discrete latent space with a second variational autoencoder. By doing so, we generate discrete codes in parallel and reduce the overall inference time needed for generation. To make it even more robust, we use the vectors in the VQVAE codebooks as the latent variables’ embedding. The continuous latent space of the second encoding has a tighter bottleneck, which allows for data exploration techniques that are usually not possible in a VQVAE, such as disentanglement. Different metrics show that our random generation is up to the standards of competing models with much inferior computational cost.
In summary, this paper introduces a fast VQVAE for point clouds. The three main contributions of this work are the following:
A flexible architecture for a point cloud generator that improves speed, memory and performance over comparable generators. The main novelties are how the codeword determines the reconstructed shape and the use of components as a more robust alternative to patches for describing complex shapes.
A filter acting on the reconstructed surface to mitigate sampling errors that affect generators. The proposed filter further boosts the reconstruction quality and can be used with other point cloud generators.
A novel double encoding system where we couple the VQVAE model with a second variational autoencoder that learns the distribution of the discrete codes. Together the two autoencoders act as just one autoencoder, effectively combining the sharpness of the VQVAE with the fast generation and interpretability of the plain VAE.
A direct comparison with alternative generative models shows that our model is competitive with the state of the art in terms of standard metrics. Furthermore, we achieve realistic, diverse generated samples at a fraction of inference time compared to similarly performing models in the literature.
The paper is organized as follows. The next section lays out the background our model builds upon and reviews recent related works and the current state of the art.
Section 3 presents the proposed model, with a dedicated subsection for each of the three main contributions: the decoder architecture (
Section 3.1), the filter (
Section 3.2) and the double encoding (
Section 3.3). The last subsection (
Section 3.4) describes the encoder and the general training scheme.
Section 4 has two subsections dedicated to the experiments.
Section 4.1 defines the metrics to test autoencoder reconstruction and realistic generation quality.
Section 4.2 compares the results of the experiments with established and concurrent work. Conclusions are in
Section 5.
2. Related Work
Current point cloud feature extraction and segmentation strategies often alternate order-invariant methods such as dense layers acting on channels, symmetric functions such as max pooling and local operators defined on neighbourhoods calculated at runtime, effectively exploiting the flexibility of a high-dimensional channel space and the points’ local covariance [
8,
9]. On the other hand, strategies for point cloud generation are more diverse because they combine point cloud processing approaches with probabilistic modelling. In particular, the VAE and VQVAE probabilistic models assume a relevant role in this field thanks to their flexibility, allowing us to build on previous work with deterministic autoencoders.
2.1. Variational Autoencoders and Vector-Quantized Variational Autoencoders
Variational autoencoders (VAEs) [
10] assume that a latent variable
z with a known distribution
governs the dataset distribution
. A neural network functioning as a decoder approximates the conditional mean of
, learning the parameters through backpropagation. A stochastic neural network functioning as an encoder approximates the intractable
through amortized variational inference. The core architecture is thus an autoencoder that optimizes itself through self-learning, identifying the latent variables with the codes. Minimizing the Kullback–Leibler divergence between the approximate and true latent posterior and the reconstruction error increases a lower bound on the maximum likelihood (ELBO). The prior distribution
may be fixed or learned. Ref. [
26] introduces a variational mixture of posterior (VAMP) prior [
26]. Their encoder learns
representatives
of the training dataset called pseudo-inputs and sets the mixture model of their posteriors
as prior, optimizing in parallel the prior and posterior.
VAEs retain some advantages compared to other generative models. Differently from adversarial models, they are guaranteed to cover the whole dataset distribution; in contrast to autoregressive models, they process global variables in parallel, and, unlike flow and diffusion methods, they have complete freedom in the design of the generative network. One main drawback is a tendency to experience a posterior collapse in the latent space, which happens when the regularization of the latent space necessary for realistic generation prevents it from encoding high-frequency details. As a consequence, the decoder learns to produce overly smoothened point clouds. SetVAE [
27] introduced a hierarchical VAE model for point clouds, which is order-invariant with respect to the codes thanks to a suitable transformer module, and combined good reconstructions, diverse generation and efficient implementation.
Another solution to posterior collapse is adopting a discrete distribution. In particular, VQVAEs [
19] proved highly successful in encoding grid-like data in codes with discrete delta distributions. The encoding is deterministic: it removes the stochastic noise that VAEs typically add to model distributions. Furthermore, it removes the
regularization term because the divergence between a posterior delta distribution and the uniform categorical prior is constant.
The posterior of a discrete code corresponds to an index from a “codebook”, a list of learned vectors called embeddings which store complex semantic information. The decoder takes in the embeddings corresponding to the predicted indices and processes them to estimate the initial input. Discrete sampling does not allow backpropagation. As an approximate solution, VQVAEs copy the loss gradients of the decoder’s input and apply them directly to the encoder output. This solution only works when the encoded features are similar to the retrieved embeddings. Therefore, a natural choice for inferencing the latent code is selecting the index corresponding to the embedding closest to the encoder. VQVAEs also add a commitment loss given by the mean square distance between the two to tie the encoder to the existing codebook. They also use the same distance as an embedding loss to learn the embeddings. Alternatively, they use a moving average update of the corresponding encoded feature.
Because the latent prior is not enforced, the aggregate latent posterior distribution is unknown. VQVAEs approximate this distribution with a second autoregressive model, allowing random generation. At the same time, the autoregressive model is not parallelizable and cannot capture different sources of variability in global variables as vanilla VAEs do. We show that a second VAE is a viable alternative to the autoregressive model that keeps all the advantages of the VAE model.
2.2. Generative Models for Point Clouds
With random generation, we refer to a model’s ability to approximate the dataset distribution and sample from it. In this field, standard datasets contain a specific class of model objects, including aeroplanes, chairs and cars. Representing very different objects is complicated because most semantics (for example, the ones differentiating an armchair from a seat) only make sense within a category. Therefore, most related work tests the generative method with these three classes separately by training a different model for each class.
Point cloud random generation often combines the maximum likelihood approximation of the dataset distribution with a probabilistic description of the point cloud itself, seen as a sparse distribution on
where the points likely lie on the surface of the represented object [
28]. The advantage over early work [
11] with a deterministic description of the point cloud given by a fully connected neural network is that it allows sampling an arbitrary number of points from each point cloud and recycling the weights without having to infer the position of every point separately. Most recent models use codes to define a transformation of samples from an initial shared distribution to the target shape. This idea is particularly suitable to probabilistic frameworks such as diffusion models that first iteratively scatter the target shape to match a three-dimensional Gaussian and then learn to denoise it until recovering the target shape. Related to them, ShapeGF [
28] is an energy-based model that moves samples from an arbitrary initial distribution in Euclidean space according to the gradients learned from the target shape using a denoising process. Flow models approximate a bijection transformation from a Gaussian to the desired shape. PointFlow [
29] proposes a continuous normalizing flow whose theoretical and experimental estimation SoftFlow [
30] improves by perturbing the target shape.
All the cited probabilistic models complement their generative part with a (variational) encoder that helps the diversity of the generated point clouds. Alternatively, SP-GAN [
31] uses a generative adversarial model that learns some correspondence of realistic shapes to a 3D sphere. Autoregressive models on points [
32] also exist but currently are very slow. Except for Discrete Point Flow, which takes only a few milliseconds to generate a point cloud [
17,
33], the generation times of all the models mentioned here are in the order of tenths of a second or more, and thus incompatible with real-time constraints [
17].
2.3. VQVAE Models for Point Cloud Generation
Since VQVAEs do not have restrictions on the core autoencoder model, it is possible to consider various alternatives. Recent works [
34] showed that learning a Signed Distance Function (SDF) leads to unparalleled point cloud reconstruction quality. High resolution requires substantial computational time [
34]. AutoSDF [
18] introduces a VQVAE model for a 3D-autoregressive SDF generation using Transformers. Similarly, ShapeFormer [
35] trains a VQVAE model to generate a deep implicit function over the Euclidean Space. The 3D formulation of these models inevitably impacts their inference time. Closer to our work, ref. [
17] proposes a VQVAE model for point clouds. First, inspired by [
36], they establish a bilinear map from a shape to a sphere divided into patches. They order the patches following a spiral and recover the equivalent patches on the input. Then, they encode each input patch separately and learn the distribution of the encodings with an autoregressive model. Serializing the patches poses some limitations: nearby patches may be indexed away from each other and the bilinear mapping is only possible for shapes topologically equivalent to a sphere. Our approach is to use a global encoding instead and separate the codeword into semantic parts (codes). Instead of a slow autoregressive model, we generate the codes with a small variational autoencoder.
2.4. Parallelizable Decoders and Laplacian Filtering
VAEs and VQVAEs do not use the encoder during generation. Therefore, the inference time is only determined by the decoding part of the autoencoder. Fast decoders for point cloud reconstruction process all points in parallel. They typically have a geometrical approach, modelling the shape as a predominantly smooth surface. FoldingNet [
20] learns to fold a square grid around the shape, often running into topological constraints. TearingNet [
21] proposes a solution to allow a tearing of the grid, which allows more flexibility. AtlasNet [
22] covers the target surface with multiple patches, i.e., mappings of sampled points from a plane. AtlasNetv2 [
23] comes in two versions: AtlasNetv2 with patch deformations learns a high-dimensional embedding for the bi-dimensional patches; AtlasNetv2 with point translation learns patches from discrete elementary structures, learned sets of points which act like moulds and yield a fixed number of points. As in the previous version of AtlasNet, they process the patches independently and learn to estimate their relative position only indirectly from the reconstruction loss. This process may cause artefacts and misalignment along the borders of the patches [
21] and is too unreliable for randomly generated point clouds.
TearingNet [
21], following [
25], also implements Laplacian filtering after the point cloud reconstruction because of its well-known smoothening effects, improving their model slightly. They capture the proximity of points
and
in an adjacency matrix with exponential weights
thresholded on an estimated value
r and select the unnormalized graph Laplacian
The filter acts on a point cloud by multiplying it by
with
, spreading the curvatures along the surface. Such a filter may help the autoencoder reconstruct smooth shapes but it is unsuitable for more complex ones. Indeed, the filter has a marginal impact on the overall metric [
21]. Unlike previous work, we select a negative
to spread the close points along the surface to cover it more uniformly.
We remark that Laplacian filtering does not remove points and has a different purpose from filtering strategies used for noise reduction in point cloud acquisition. Reducing noise during the acquisition phase generally consists of removing isolated points or groups of sparse points (noise clusters) resulting from an incorrect sensor measurement [
37] or objects that are not relevant to the scan [
38]. In this work, we seek to reduce noise from a source that requires a different approach.
The noise in an autoencoder’s reconstruction is caused by the degradation of the features during compression. While there could be isolated points, these, for the most part, are not mistakes but part of the risk-minimization strategy that the model uses when it is not confident that an area is empty. Attempting automatic point removal in these low-density areas is bound to the risk of removing part of the target shape. The most significant part of the noise in reconstruction results either from the inability to recover the correct shape of the surface, which is what traditional Laplacian filtering tries to mitigate by making the shape smoother, or from surface sampling discrepancies, which is what the proposed filtering approach addresses.
3. Methods
Our main contribution is a fully parallelizable VQVAE for real-time point cloud generation. To make parallelization possible while producing high-quality samples, we propose advancements of the VQVAE [
19] design both specific to this field and applicable to other data types. We focus on three main aspects: the core point cloud autoencoder architecture, the compression of the latent space into discrete variables (quantization) and the strategy for sampling the discrete variables that allows random generation. The corresponding contributions independently build to the proposed model’s full fruition. We refer to
Figure 1 for a high-level view of the model.
We present a new architecture for the generator and filtering method in
Section 3.1 and
Section 3.2.
Section 3.3 explains our quantization procedure and introduces a variational autoencoder for discrete codewords (w-autoencoder). We propose its use as an effective sampling strategy for the complex distribution of the VQVAE latent space.
Section 3.4 details our chosen training scheme and the architecture of the point cloud encoder (pc-encoder) used in the experiments.
3.1. Generator
Our generator models a point cloud as independent samples from a shape distribution, keeping the simplicity and the parallel processing of the geometrical approach [
20,
21,
22,
23] but using the probabilistic interpretation of other generative methods [
28,
29] to avoid topological and regularity issues. It takes a quantized codeword
of length
in and maps an initial distribution into the codeword embedding space before projecting the distribution to the
space. We break it down into four blocks. In point sampling, we learn a starting distribution in a high-dimensional space. In code mixing, we propose a new layer that endows the starting distribution with the semantic meaning of the incoming codeword. We also introduce the attention and component blocks to extend the immediate projection to the Euclidean space by adding an attention mechanism.
Figure 2 gives a detailed illustration of the autoencoder architecture.
Point sampling starts by sampling M points
uniformly from the hypersphere
in
dimension. While other distributions are possible, we choose one defined on the hypersphere because it has compact support, which makes tracking down the mapping of the points (see
Appendix A) easier. Then, it embeds the distribution in the codeword space with a point convolution network, letting the model learn a correlation between the features and obtaining
.
Code mixing takes as inputs the codeword
and the previous distribution. Instead of concatenating them to the codes as commonly performed in the related literature, we multiply
term by term with the codeword: we first cap the obtained points between
and 1 (Hardtanh); then, we use the tensor product ⊙ to generate a high-dimensional representation
of the point cloud:
We merge the input codeword and in a non-linear way to enable a complex interaction between the two: the codeword stretches the marginal distributions of the relevant semantic dimensions and uses the learned correlation between the points as a base for the distribution of the resulting . We remark that the standard solution of concatenating the sample and the codeword determines a linear interaction in the following point convolution, which applies the non-linearity only after merging the two.
Before explaining the proposed component and attention blocks, let us consider a basic design of our architecture where we immediately obtain the output point cloud by projecting into the Euclidean space using a point convolution network. As we are going to explain, this design corresponds to setting the number of components to 1.
In this basic approach, the same weights describe all the parts of the target shape and struggle to represent local regions with very distinct geometries. Increasing the generator’s complexity, such as the number of convolutions and channels, is not enough to describe the complicated global geometry. The obvious solution is to independently process different regions of the target shape using different weights. At the same time, the intuitive patching approach proposed in [
22] leads to the drawbacks explained in
Section 2.4.
We introduce an alternative to the patching approach that ensures a smooth transition between local representations and can still run in parallel. We model the target point cloud as a mixture model of shape distributions that we call components. Each component is a Euclidean projection of the target shape that focuses on a different region (
Figure 3). The weights of the mixture model indicate which components have the most influence on each point of the output cloud. The component and attention blocks encapsulate the proposed approach.
The component block forms
projections of
into the Euclidean space using
point convolution networks. We refer to the projections as components and denote them by
. The attention block yields the interpolating weights
, representing the probability that a point belongs to the component
. To calculate
, we proceed as follows (see also
Figure 2). We extract local information by taking the features in the component block immediately before the final linear layer. For each component
c, we obtain an unnormalized score
using a point convolution with a single output channel. The scores
assess how much the component is relevant for each point, corresponding to a self-attention mechanism. We normalize component-wise
using the Gumbel-Softmax [
39], which also adds noise to encourage a more uniform weight distribution:
The output position of a point with index
i is as follows:
where
is a 3D vector denoting the position of one point of the output point cloud. Note that we do not enforce the components to only focus on specific regions. Instead, this occurs naturally as the model learns to distribute the tasks between them from the early stages of training.
3.2. Laplacian Filtering
A generator defines a mapping that samples points from a target shape. As we compute the points in parallel, they are independent. A likely consequence is the formation of small gaps on the surface resulting from sampling errors. We design here a conceptually simple and elegant filter that mitigates this effect, filling in possible gaps on the recovered surface. It benefits both training and inference. During training, it improves the quality of the gradients from the reconstruction loss, letting our model focus on the shape and converge more easily. During inference, it provides a quick postprocessing tool for point cloud generators that enhances the visual rendition and the standard quantitative metrics. In line with the overall goal of this paper, our filter is parallelizable.
3.2.1. Spreading the Points along the Surface
We modify the adjacency matrix with exponential weight in Equation (
1) by removing the threshold on the distance of the points. Instead, we consider only each point’s three closest neighbours. More precisely, let
X be a point cloud and
be the indices of the three points
closest to
. We define the weight matrix
W as follows:
Note that this matrix is generally not symmetric but grants a flexible receptive field that grows when the points are more distant.
We use the weight matrix
W above in Equation (
2) to define the unnormalized Laplacian
. The proposed weight matrix provides its action with the following intuitive interpretation. Given a point
,
becomes a weighted mean value change
from its closest neighbours (see black arrows in
Figure 4).
Traditionally, the Laplacian filter
takes a positive value of
to displace a point in the direction of its neighbours and smoothen an irregular surface. At the same time, it also facilitates the formation of clusters. We take the opposite approach by selecting
, therefore pushing each point away from its closest neighbours (see red arrows in
Figure 4).
Formally, the proposed filter nudges a point
to the following:
Assuming a locally flat receptive field, the proposed filter spreads nearby points along the surface. By discouraging the formation of neighbourhoods of high density, our filter helps cover the target surface more homogeneously. Indeed, the filter performs well when varying the number of neighbours considered as long as the receptive field does not grow too much.
Figure 5 shows the effect of the filter on a reconstructed sample.
The value of
has a substantial impact on the filter. Instead of assigning a fixed value, which would not be optimal for all training phases, we calculate it dynamically for each point cloud as half the mean distance between closest neighbours:
Section 3.2.2 proves that
ensures that the displacement is always larger the closer a point is to
. In practice, we use a mean value over the point cloud to circumvent possible instabilities when
approaches zero.
3.2.2. Mathematical Motivation
We provide mathematical proof here as motivation for this homogenizing effect of the proposed filter. In particular, we prove that, as long as two points are more than distant, the closer they are, the farther our filter pushes them away. We remind the reader that is a radial function centred in if and only if solely depends on the distance . That is, there exists such that .
Definition 1 (Displacement)
. Let W be a radial function centred in .
We define the displacement as follows: Definition 2 (Rescaling radial function)
. We say that a radial function W centred in rescales a set if the norm of the displacementis strictly decreasing in d when restricted to the domain . Proposition 1. Let be finite and not include . The functionwhere , is a radial function centred in that rescales the set X. We refer to
Appendix B for a proof of the above proposition.
3.3. Quantization and Sampling Strategy
We build on the previous sections by devising a quantization strategy that tailors the architecture in
Section 3.1. The perks of the resulting VQVAE model grant us the opportunity for further improvement, especially in the context of parallel computing. In addition to the quantization strategy, we introduce a variational autoencoder whose loss and training procedure are specifically designed to sample the discrete codes of the VQVAE model.
The proposed model overcomes the typical limitations of VQVAEs by integrating a VAE to support its generative inference. The inclusion of the VAE model provides a continuous global latent space and a sampling method that enables parallel computation. Therefore, the compound of the two models acts similarly to a singular improved VAE. Quantization replaces a continuous vector with a vector from a codebook, a finite set indexed by a discrete variable. When training the point cloud autoencoder, our model reshapes the codeword
w, which is the output of the encoder, into chunks and applies quantization at the chunk level (see
Figure 6). It then rearranges the quantized chunks into the quantized codeword
and relays it to the generator. Similarly, we quantize the chunks generated by the variational autoencoder (the w-autoencoder) during inference and pass them to the generator to produce random point clouds.
As is common in VQVAEs [
19], our codebook is a list of learned vectors with the same embedding dimension of the chunks and we use the Euclidean distance as the embedding distance. Our model presents two differences when compared to [
19]. The embedding space has a much lower dimension than in standard VQVAE models and can be covered by fewer learned vectors. The second difference is that the discrete variables represent semantic information that is separate and global instead of aggregate and localized. Because of these reasons, we keep a separate codebook for each code, as also proposed in [
17] for related reasons. More precisely, let
be the number of the codes,
the size of the codebook and
the embedding dimension. For each code
, the corresponding codebook is
.
During training, our model tends to rely on a few latent discrete codes, leaving others unused and limiting its expressiveness. To solve the issue, we redistribute the embeddings that the model did not use every few epochs. The redistribution randomly replaces the embedding with another from the same codebook according to the latter usage percentage in the training dataset and adds Gaussian noise. This solution makes the discrete embeddings more homogeneous and improves the reconstruction.
In its original formulation, VQVAE [
19] learns an ancestral sampling of the discrete latent space with an autoregressive model. The choice is intuitive, as it leverages the spatial consistency of an image array. There is no such advantage in our case because the quantized chunks yield global information.
We replace the autoregressive model on the discrete space with a variational autoencoder and a second (continuous)
-dimensional latent space governed by a VAMP prior distribution [
26] with
pseudo-inputs (see
Figure 7). The proposed second autoencoder, to which we refer as the w-autoencoder, presents significant differences from the standard formulation of a VAE. Naively encoding the discrete distribution does not take advantage of how the pc-autoencoder uses it and can lead to instabilities. Instead, our w-autoencoder implements the following solutions.
We reuse the quantized vectors as an embedding of the latent variables in the VAE autoencoder so that discrete latent variables with a similar effect during the point cloud generation are also close in the w-autoencoder. In this way, we limit the propagation of the reconstruction error in the w-autoencoder and make the random generation more robust.
The continuous codeword contains more information than its quantized counterpart and may lead to better generation. Because of this, instead of encoding the embedding , we use the continuous chunk outputted by the PC encoder. Therefore, the w-autoencoder learns to quantize the input rather than reconstruct it.
We improve the discrete variable prediction by replacing the reconstruction loss of the embeddings with cross-entropy classification loss. To obtain the estimated probability
, we take the Euclidean distance between the output of the decoder
and
. Then, we reshift and softmax all the distances along the possible values
j, pushing the reconstructed embedding close to the correct one and away from the others. Formally,
The w-autoencoder architecture independently encodes and decodes the channels corresponding to the chunks into and from a shared latent variable z. Avoiding direct connections between the channels helps reduce the required parameters. To encode a codeword, we expand the input embeddings in a larger dimension, using a convolution with a unitary kernel, i.e., a dense layer shared by all embeddings. We concatenate the outputs and infer the hyperparameters of the continuous latent variable z with a linear layer. To decode z, we give it as input to a dense layer with separate weights for each reconstructed embedding. Note that we can still compute the decoding in parallel.
Figure 8 illustrates the architecture of the w-autoencoder, complementing
Figure 2.
3.4. Encoder, Training and Final Loss
The PC encoder used in the experiments is a lighter version of DGCNN [
9], where we replace the edge convolutions with point convolutions followed by a local max-graph-pooling as in Foldingnet [
20], except for the first edge convolution, which we keep. We calculate the nearest neighbours only for the edge convolution and reuse the same neighbours for the max-graph-pooling to reduce the computational burden. The resulting encoder is significantly faster and almost as performant as a full DGCNN. Notice that the encoder does not influence inference time during generation but we choose the lighter version to reduce training time.
It is possible to train the two autoencoders together, provided that you are stopping the gradients of the w-autoencoder from impacting the pc-autoencoder. In our experiments, we have decided to train the w-autoencoder on an already trained pc-autoencoder. Therefore, we present the losses for the two autoencoders, and , separately.
The loss for the pc-autoencoder includes a sum of the Chamfer distance
and the Earth mover distance
, both defined in
Section 4.1.1, as reconstruction loss. To these, it adds the embedding loss
corresponding to the mean square distance between the features of the PC encoder
and the embeddings of the discrete codes
. This loss helps balance the training of the encoder and the generator, and needs a coefficient to allow for better reconstructions. We find it experimentally. In formulae,
The loss for the w-autoencoder is a sum of the Kullback–Leibler divergence
and the cross-entropy loss
defined in
Section 3.3. When using a VAMP prior with pseudo-inputs
and Gaussian posteriors, the KLD loss
for a latent variable
z given a codeword
w has the following form:
We find experimentally a coefficient to balance the two losses. In formulae,