The proposed module is integrated into PointNet++ and DGCNN to evaluate its semantic segmentation performance. In addition, the applications of the proposed module in other deep learning tasks are tested to show the feasibility of the proposed module in various PointNet variants.
3.1. Dataset
All deep learning networks in this paper are evaluated on four datasets:
(i) ModelNet10 [
20]: ModelNet10 dataset is a collection of 3D meshed models used for the object classification task in computer vision and machine learning research. It contains 4899 CAD models from 10 categories (toilets, chairs, nightstands, bookshelves, dressers, beds, tables, sofas, desks and monitors). We utilize a subset containing four random categories of ModelNet10 to test the proposed module. In the subset, 1000 samples are used for training and 290 samples for testing.
(ii) Stanford 3D semantic parsing dataset [
30]: Stanford 3D semantic parsing dataset includes point clouds and corresponding semantic labels for six areas of 271 indoor scenes. The semantic labels of all these scenes are divided into 13 categories, such as chairs, walls, ceilings, tables, and some other structural elements.
(iii) Our hull block dataset: 149 standard CAD models of ship hull blocks are gathered. They are divided into four categories (bottom blocks, side shell blocks, deck blocks, fore and aft blocks, note that ‘fore and aft block’ is one category) and split into 120 training samples and 29 testing samples.
(iv) Our hull block butt-joint part dataset: In the ship construction step, all hull blocks are joined together with butt joints; thus, the construction quality control of hull block butt-joints plays a significant role in shipyards. We gather 74 hull block butt-joint part point clouds from CAD models and realistic scanning scenes. Each point cloud contains semantic labels from four categories (hull plates, T-steels, flat steels and bulb flat steels). All point clouds are split into 60 training samples and 14 testing samples.
In the pre-procession step, the neighbor research radius is set to 0.2 after the point clouds are rescaled into a unit sphere. In the region growing configuration, the minimum point number and maximum point number of each region are set to 800 and 100,000, the neighbor point number is set to 30, the threshold of normal vectors to determine whether to add the neighbor point to the region of the central point is set to 15°, the curvature threshold which determines whether to consider the point in the region growing step is set to 0.05. According to these configurations, each input point cloud can be divided into sub-regions.
The existing PointNet variants adopt their default configurations as presented in the original papers. When the proposed module is taken into consideration, we designate the variables , , and as 5, 640, 256, and 50, respectively, which means that for each model or point cloud in the four aforementioned datasets, 163,840 points are sampled to represent the input 3D geometries. Therefore, in order to ensure sampling quality, we require the point number in each instance of the datasets to be greater than 163,840. This point number threshold is set to 300,000. For the Stanford 3D semantic parsing dataset, unfortunately, some scenes have to be discarded as their point numbers are less than 300,000, and the final number of instances remaining is 199. These 199 scenes are split into 160 training samples and 39 testing samples. Readers may easily notice that the processing point number of the proposed module (163,840 points) is significantly larger than the point numbers in the original PointNet variants (1024 or 2048 points). Each sampled point cloud is rescaled into a unit sphere, and just their 3D coordinates are utilized as input data; the remaining information, such as normal vectors, colors, etc., is disregarded. In the training step, all data are augmented by randomly rotating, translating, scaling and perturbing the point coordinates.
3.2. Semantic Segmentation
To test the semantic segmentation performance of the proposed module, it is embedded into PointNet++ and DGCNN, and evaluated on the Stanford 3D semantic parsing dataset and our hull block butt-joint part dataset. As the Kernel Point Convolution (KPconv) [
31] can process more points than the original PointNet++ and DGCNN, it is also evaluated in this section. The architecture of the proposed module is set as the configuration shown in
Figure 10 and
Figure 11 to meet the input
tensor procedure.
The set abstraction part of the proposed module is shown in
Figure 10. Since the SA modules are utilized to generate the feature of the
tensor, the sample point number of the FPS method in the first SA module is set as 64, the local search number and radius are set to 32 and 0.3, and the layer sizes of the multi-layer perceptron are set to
, which means that the multi-layer perceptron consists of two fully connected layers — the first layer contains 64 units and the second layer contains 128 units. The second SA module is similar to the first SA module. The FPS sample point number is set to 32, the local search number is set to 16, the local search radius is set to 0.5, and the layer sizes of the multi-layer perceptron are set to
. As for the third SA module, all local points and features are aggregated to compute the feature of the seed points, so the FPS and the local neighbor search procedures are abandoned, and the layer sizes of multi-layer perceptron are set to
.
The PointNet++ or the DGCNN is subsequently connected to the proposed module. Their input data are a
tensor and a
tensor, which can be treated as the 3D coordinates and corresponding features of five mini point clouds, and each point cloud contains 640 points. As the input point number is different from the original PointNet++ or DGCNN, some parameters of these two PointNet variants require modification. The parameters of three SA modules in the PointNet++ are set as shown in
Figure 12, and their explanation is similar to that of the proposed module, hence there is no need to reiterate it. As for DGCNN, due to the limited association between its network architecture and the input point number, we only make slight modifications to its parameters; the local search number is set to 64, and the final embedded channel is set to 512.
The feature propagation part of the proposed module is shown in
Figure 11. The feature propagation architecture contains three standard feature propagation modules, three full connection layers and two drop out layers. During each iteration, if the preceding network is PointNet++, the feature propagation architecture receives a
tensor. However, if the preceding network is DGCNN, the feature propagation architecture receives a
tensor, as its final embedded channel is set to 512. The first FP module is denoted as ‘Feature Propagation Module 3’ since it gathers the input points, input features, and output points of ‘Set Abstraction Module 3’, illustrated in
Figure 10 through a skip link concatenation procedure. The second FP module utilizes the input points, input features, and output points of ‘Set Abstraction Module 2’, as well as the output features of the first FP module, to generate high-level features. As just the coordinates of each dataset are used, the final FP module only gathers the input points and output points of ‘Set Abstraction Module 1’. The output features are processed by ‘Full Connection Layer 1’, which consists of 64 units, ‘Drop Out Layer 1’, which may set the output value of each neuron in the preceding layer to zero with probability 0.4, ‘Full Connection Layer 2’, ‘Drop Out Layer 2’, and the output layer ‘Full Connection Layer 3’. If the whole network is evaluated on the Stanford 3D semantic parsing dataset, the output channel is 13. If the whole network is tested on our hull block butt-joint part dataset, the output channel is 4.
Similar to the configuration in
Figure 12, the three feature propagation modules of the original PointNet++ are modified to an FP module with a multi-layer perceptron
(with two full connection layers; the first layer contains 512 neurons, while the second layer includes 256 neurons), an FP module with a multi-layer perceptron
and an FP module with a multi-layer perceptron
.
As for KPconv, each point cloud is down-sampled into 81,920 points. The super parameters of KPConv are consistent with that in its original work. Note that the original work of KPconv achieves the segmentation of dense point clouds by dividing the whole point cloud into unrelated sub-point clouds in many random sphere regions. We may certainly adhere to this approach to deal with a dense or even infinite number of points, but the maximum point number that KPConv can actually handle should be the max point number of the sub point clouds.
In the training step, the Adam [
32] algorithm is applied to optimize the networks containing the proposed modified PointNet++ module. The learning rate is set to 0.001, the weight decay rate is set to 0.0001, the momentum value of batch-normalization is 0.9,
and
are set to their default values 0.9 and 0.999, and the data batch size is 5.
Table 1 and
Table 2 show the semantic segmentation results of seven methods (original PointNet++, PointNet++ with our random RNN sampling, PointNet++ with our random RNN sampling based on region growing results, original DGCNN, DGCNN with our random RNN sampling, DGCNN with our random RNN sampling based on region growing results, and KPConv) on the hull block butt-joint part dataset and the Stanford 3D semantic parsing dataset. Compared to the original PointNet++ and DGCNN, the incorporation of our proposed modules in PointNet++ and DGCNN yields superior performance. For PointNet++ on the Stanford 3D semantic parsing dataset, the improvements from both random RNN sampling pre-processing and region growing pre-processing are similar (as shown in
Table 2, 86.7% and 86.2%), while for other cases, the improvement effect of the region growing pre-processing module surpasses the random RNN sampling pre-processing module. The advantage of our region growing pre-processing module may potentially be attributed to its incorporation of implicit prior region segmentation knowledge, where the ground truth semantic segmentation information is generally closely associated with the region segmentation information.
Figure 13 and
Figure 14 show four semantic segmentation instances from our hull block butt-joint part dataset and the Stanford 3D semantic parsing dataset, respectively. In the two figures, each column represents one instance, different semantic labels are represented by different colors, and the overall prediction accuracy of each instance is shown in the title of its corresponding subplot. Our pre-processing module outperforms KPConv in our hull block butt-joint part dataset, and achieves similar segmentation accuracy of KPConv in the Stanford 3D semantic parsing dataset. Compared with the sparse representation of 2048 points in the original PointNet++ and DGCNN, the 163,840 points processed by the proposed modules preserve a more comprehensive range of the original point cloud information, and its geometric structure closely resembles that of the original point cloud, which may be an attractive advantage of our work.
To analyze the properties of our region growing preprocessing module, the point intersection over union (IoU) results on our hull block butt-joint part dataset are considered. As shown in
Table 3, the methods embedding our random RNN sampling based on region growing results outperform the methods embedding our random RNN sampling in the segmentation of the ‘hull plate’ and ‘T-steel’ classes. As most points in the hull block butt-joint part dataset belong to these two classes, the overall accuracies of the corresponding methods in
Table 1 are better. As for the Stanford 3D semantic parsing dataset, as shown in
Table 4, the methods embedding our random RNN sampling based on region growing results outperform the methods embedding our random RNN sampling in most classes. However, in some classes, like ‘beam’, ‘ceiling’, ‘sofa’ and ‘window’, the latter outperform the former; since these classes contain a considerable number of points in the whole point clouds, the overall accuracies of the former in
Table 2 are sometimes superior to the latter, and sometimes inferior to the latter.
3.3. Other Applications
To show that the proposed module is applicable in different point-based deep learning tasks, this section utilizes the proposed module to process two other deep learning applications, classification and 3D registration, both with dense input points.
In the classification task, the proposed module is also embedded into PointNet++ and DGCNN, and evaluated in the ModelNet10 dataset and our hull block dataset. The architecture of the classification networks is similar, as shown in
Figure 10 and
Figure 12. The training configuration is the same as that of the semantic segmentation experiment part in
Section 3.2.
Table 5 and
Table 6 show the classification results of six methods in our hull block dataset and ModelNet10 dataset. It can be found that four test accuracies in
Table 5 are the same, which is caused by the scarcity of our hull block dataset. The proposed module outperforms the original PointNet++ and DGCNN in
Table 5. The classification ability of the proposed module is similar to the original PointNet++ and DGCNN (as shown in
Table 6, PointNet++ with random RNN sampling achieves 95.1% test accuracy, which is close to 95.2% in PointNet++, and DGCNN with region growing results achieves 98.3% test accuracy, which outperforms 96.0% in DGCNN) in ModelNet10.
Table 5 and
Table 6 illustrate the feasibility of the proposed module in the classification task of dense input points (each input point cloud contains 163,840 points).
As for the 3D registration task, the proposed module is embedded into PRnet, and evaluated in our hull block dataset and the ModelNet10 dataset.
PRnet is a deep learning network designed for a partial-to-partial point cloud registration task. PRnet is self-supervised, jointly learning an appropriate geometric representation, a key point detector that finds points in common between partial views, and key point-to-key point correspondences [
28]. The architecture illustrated in
Figure 10 is used to replace the original prediction point embedding head of PRnet, and the remaining parts of PRnet follows their original configurations.
Table 7 and
Table 8 show the 3D registration results of three methods (PRnet, PRnet with our random RNN sampling, and PRnet with our random RNN sampling based on region growing results) in our hull block dataset and ModelNet10.
Figure 15 and
Figure 16 show four registration instances from the two datasets. In the two figures, each row represents one instance, the initial positions and final poses after alignment of the fixed-point clouds and the float point clouds are shown; in each subplot, the purple point cloud represent the fixed-point cloud, while the green point cloud represent the float point cloud. Similar to the classification results, the utilization of the proposed module empowers PRnet with the capability to process dense input point clouds.