Next Article in Journal
A Digital Twin-Oriented Lightweight Approach for 3D Assemblies
Previous Article in Journal
A Process Monitoring Method Based on Dynamic Autoregressive Latent Variable Model and Its Application in the Sintering Process of Ternary Cathode Materials
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene

1
Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China
2
School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China
3
Key Laboratory of Intelligent Infrared Perception, Chinese Academy of Sciences, Shanghai 200083, China
4
School of Information Science and Technology, ShanghaiTech University, Shanghai 201210, China
*
Author to whom correspondence should be addressed.
Machines 2021, 9(10), 230; https://doi.org/10.3390/machines9100230
Submission received: 4 September 2021 / Revised: 3 October 2021 / Accepted: 3 October 2021 / Published: 8 October 2021
(This article belongs to the Section Robotics, Mechatronics and Intelligent Machines)

Abstract

:
This paper addresses the problem of instance-level 6DoF pose estimation from a single RGBD image in an indoor scene. Many recent works have shown that a two-stage network, which first detects the keypoints and then regresses the keypoints for 6d pose estimation, achieves remarkable performance. However, the previous methods concern little about channel-wise attention and the keypoints are not selected by comprehensive use of RGBD information, which limits the performance of the network. To enhance RGB feature representation ability, a modular Split-Attention block that enables attention across feature-map groups is proposed. In addition, by combining the Oriented FAST and Rotated BRIEF (ORB) keypoints and the Farthest Point Sample (FPS) algorithm, a simple but effective keypoint selection method named ORB-FPS is presented to avoid the keypoints appear on the non-salient regions. The proposed algorithm is tested on the Linemod and the YCB-Video dataset, the experimental results demonstrate that our method outperforms the current approaches, achieves ADD(S) accuracy of 94.5% on the Linemod dataset and 91.4% on the YCB-Video dataset.

1. Introduction

6d pose estimation is a functional task of many computer vision applications, such as augmented reality [1], autonomous navigation [2,3], robot grasping [4,5] and intelligent manufacturing. The purpose of 6d pose estimation is to obtain the rotation and translation from the object’s coordinate system to the camera’s coordinate system. In practical applications, the estimation process requires robustness to the noise, occlusion, different lighting conditions and achieves real-time. RGB image has rich texture information, so the pose of the object can be estimated by detecting the object in the images. Traditional RGB-based methods [6] extract the global or local feature to match the source model. With the rapid development of Convolutional Neural Network (CNN), image feature learning ability has been significantly improved. CNN is also used in pose estimation, PVNet [7] regress the 2d keypoint through the end-to-end network, and then use the PnP algorithm, estimate the 6d pose by calculating the 2d-3d correspondence relationship of the object. RGB-based methods always achieve good computing efficiency, but they are sensitive to the background, illumination, texture. In addition, these RGB-based approaches need to compute the projection of the 3d model, making partly lose of the geometry constraints. Depth image or point cloud contains sufficient geometry information, Hinterstoisser et al. [8] uses hand-crafted feature or pointnet to extract the point cloud feature and estimate the pose by feature matching or bounding box. Compared to RGB images, the point cloud has more geometry information. However, the point cloud does not have texture information and it is sparse. In addition, due to the mirror reflection, RGB-D sensors cannot obtain the depth information of the smooth or transparent surface.
Based on the limitations mentioned above of RGB-based methods and pointcloud-based methods, how to fully use the RGB image’s texture information and pointcloud’s geometry information becomes an important issue. PoseCNN [9] first uses RGB image to estimate the coarse pose of the object, then utilizes Iterative closest point(ICP) [10] to refine the result. However, these two separate steps cannot be optimized jointly and are time-consuming for real-time applications. Chi et al. [11] extracts the depth image’s information through CNN and uses it as a supplementary channel of the RGB image, but it needs a complex preprocess and does not make the best use of the RGBD information. DenseFusion [12] is an innovative work, it extracts the RGB and point cloud information from every single pixel by CNN and PointNet [13], respectively, then embeds and fuse the RGB and point cloud information of every pixel through the single-pixel fusion method. The later method FFB6D [14] adds the communication between the two channels. Full pixel estimation greatly increases the computed complexity, PVN3D [15] uses full pixel hough voting to obtain the 3d keypoints and then estimate the 6d pose through the least-squares method. Compared to 2d-keypoint-based methods, PVN3D significantly increases the robustness. However, PVN3D gets the keypoints by farthest point sample (FPS), which only concerns the Euclidean distance factor, without the texture information which may cause the selected keypoints appear on a smooth surface.
In order to make comprehensive use of image and point cloud information, we propose the 3D keypoint voting network (3DKV) for 6DoF pose estimation. As shown in Figure 1, 3DKV is a two-stage network, which first detects the keypoints, then utilizes the least-squares method to compute the 6d pose. To enhance RGB feature representation ability, we present the feature map split-attention block. More specifically, every block divides the feature map into several groups along the channel dimension and finer-grained into subgroups, and the group is determined via a weighted combination of its subgroups, where the weights are calculated through the contextual information. In addition, an ORB-FPS keypoint election approach that concerns both texture and geometry is proposed. Firstly, we detect the ORB keypoint in the RGB images, then calculate the correspondence 3d points in the point cloud through the camera parameter, then find the final 3d ORB-FPS keypoints through the Farthest Points Sample (FPS) from the selected points. This method improves the ability of keypoints to characterize objects, and avoids the selected keypoints appear on non-significant areas like smooth surfaces, making it easier to locate keypoints and improving the ability to estimate the pose. In general, the contributions of this paper can be concluded as follows:
(1) A split-attention block is presented in the image feature extraction part. This method combines channel-attention and group convolution. The channel-attention block enhances the feature fusion between the image’s channel dimension. Moreover, the group convolution reduces the network’s parameters and improves its computational efficiency.
(2) A simple and effective keypoints selection approach named ORB-FPS is proposed. It utilizes a two-stage approach to select the keypoints, and avoids the selected keypoints appear on non-significant areas like smooth surfaces, making them easier to locate and improving the network’s ability to estimate the pose.
(3) A thorough evaluation of the proposed algorithm is presented, including comparisons with the current algorithms on the Linemod and YCB datasets. The experiment results show that the proposed method performs better than other algorithms.
The remainder of this paper is organized as follows. Section 2 reviews the related work of other researchers. Section 3 demonstrates the details of our proposed method. Section 4 provides the experiment results and analyses. Section 5 concludes with the summary and the perspectives.

2. Related Work

2.1. Pose from RGB Images

The RGB image-based methods for 6DoF pose estimation can be roughly divided into three categories: template, correspondence relationship and voting. The template-based approaches first learn the feature of objects through convolutional neural networks, and then detect the objects in the image and calculate the pose. SSD6D [16] extends the SSD detection network to enable instance segmentation and 6d pose estimation. PoseCNN [9] proposes an end-to-end method based on RGB images, it includes three modules: semantic segmentation, 3D translation estimation and 3D rotation estimation, and uses the VGG16 to extract features. Zeng et al. [17] proposes a unsupervised pose estimation network based on the multi-perspective images. Another keypoint-based strategy first detects the keypoints of the corresponding object from the image and then calculates the pose by regressing the keypoints. For example, YoLo6D [18] detects the center point and 8 bounding box points of the projection of the three-dimensional object on the image, then utilizes the PnP algorithm to obtain the final pose. DpoD [19] uses the dense UV map to directly obtain the connection between the 2d pixel and the vertex of the 3D model. Some other methods vote on the pixels or patches of the image to obtain key points. PVNet [7] first votes the keypoints through RANSAC, then utilizes the 2D-3D correspondence to calculate the 6D pose. HybridPose [20] adds edge vector and symmetrical correspondence into the PVNet framework to enhance the robustness of symmetrical objects. Based on the shortcoming of the PnP algorithm and regress directly, GDR-Net [21] presents a geometry-guided regression Network. Furthermore, Stevsic et al. [22] adds the attention blocks into the feature extraction module to improve its feature representation ability. Some other researchers use panoramic images to accomplish this task, Zhang et al. [23] show how the use of panoramic images improves significantly the geometric analysis of the scene thanks to the large captured context and Xu et al. [24] it with object poses. Pintore et al. [25] combines modern CNN networks, 3D scene information and parameter optimization to recover the oriented bounding boxes of the captured objects inside the 3D layout of the indoor environment. RGB-based approaches always are efficient, but most of them built on the perspective projection, which may cause the partly lose of the geometric constraint information.

2.2. Pose from Point Cloud

With the development of terrestrial laser scanners (TLS) and low-cost 3D scan instruments like Kinect, point clouds can be obtained easily. There are many classical pointcloud-based algorithms for pose estimation, such as calculating FPFH [26], SHOT [27] and other local descriptors. Pointnet [13,28] and its variants make a break work, which directly applies deep learning to point clouds and enable them to complete advanced applications like object classification, semantic segmentation, object recognition. A variety of 6d pose estimation approaches are also proposed based on PointNet, Votenet [29] uses Hough voting to generate points close to the center of the object, then group and aggregate the points to obtain candidate boxes. PointnetLK [30] expands the PointNet and Lucas–Kanade (LK) algorithms into a single trainable recurrent deep neural network and achieves outstanding pose estimation performance. Weng et al. [31] proposes category-level 9Dof pose estimation, which consists of 3D rotation, 3D translation and 3D size. Huang et al. [32] predicts the pose by learning the stable geometry feature. PCRNet [33] uses PointNet as the backbone to extract the global feature for pose estimation, it is more robust to noise. Gao et al. [34] proposes a lightweight data synthesis pipeline to produce the data. However, the pointcloud is textureless and sparse. Meanwhile, due to the specular reflection, the depth camera cannot obtain the depth information of the smooth or transparent surface, which will limit the performance.

2.3. Pose from RGBD Data

Based on the above-discussed shortcomings of RGB-based and pointcloud-based approaches, some researchers combine the two types of information together. Traditional methods for pose estimation mainly use hand-crafted features. For example, Linemod [8] locates and estimates object poses by extracting gradient features of images and normal features of depth images. Some other methods use deep learning to extract the RGBD feature, Shao et al. [35] proposes two fusion strategies, the first is concatenates RGB and depth image into a raw input to the CNN network, and another strategy just like [3,36,37], they utilize CNN network to extract the RGB image and depth image features, and then concatenate the features as the input for object segmentation and pose estimation. However, these methods neglects the inner structure of the depth channel and extract depth image features as a supplement channel to the RGB feature channels. Densefusion [12] separately extracts RGB and depth feature information through CNN and PointNet, and designs a dense pixel-level fusion method, which integrates the features of RGB data and point cloud features more properly. In order to enhance the connection between the two channels, refs. [14,38] built the full flow bidirectional fusion and correlation fusion communication module. PVN3D [15] proposes a new method to generate 3D keypoints. It generates 3D key points through full-pixel voting and calculates 6D pose by using the least square method, 6-PACK [39] proposes an anchor-based attention network to generate an order of keypoints. 3D keypoints strategy improves the pose estimation performance significantly, however, it only uses the distance factor and the RGB texture information is not effectively used.

3. Methodology

Given an RGBD image, 6D pose estimation can be described as finding the best affine transformation between the object’s coordinates and the camera’s coordinate, which consists of two parts: rotation matrix and translation vector. The entire process should be fast and accurate, and robust to noise and occlusion.
In order to tackle this problem, this paper proposes a 3D Key Point Voting (3DKV) strategy for 6d pose estimation. As shown in Figure 2, 3DKV is a two-stage structure, first detects the 3D ORB-FPS keypoints, then predicts the 6D pose. In detail, given an RGB-D image as input, we use an ImageNet pre-trained in-channel attention block CNN network to extract the RGB image’s texture and other appearance features and use the PointNet++ [28] to extract the point cloud’s geometric features. We add the split-attention blocks into every channel of the image feature extraction module to enhance the channel information fusion. Then 3DKV utilizes DenFusion to fuse the RGB and point cloud feature of every pixel and sends them to the keypoints detection module. In order to make better use of the texture feature, we first detect the 2D ORB keypoints of the RGB image, then finds the 3D fps keypoints from the 3D corresponding points of the 2D orb keypoints. The keypoints detection module includes two parallel tasks: instance semantic segmentation and keypoints prediction. In addition, we select 12,888 points from each RGB-D image as the seed points and use shared Multi-Layer-Perceptron(MLP) to share the training parameters between the 3D keypoint prediction and instance semantic segmentation.

3.1. Split-Attention for Image Feature Extraction

For the extraction of RGB image features for classification or instance segmentation, the receptive field and the interaction across channels is very important. The convolution operation is the core of the CNN network and its essence is to perform feature fusion on a local area, including spatial (h,w) and cross-channel. Enhancing the network’s feature extraction ability can improve its ability of object classification, recognition, segmentation and other applications. Enlarging the receptive field is a common method, that is, fuse more features in spatial or extract multi-scale spatial information [40]. SeNet [41] pays attention to learn the importance of different channel features, which is equivalent to adding attention operations to the channel dimension. SeNet pays more attention to the channel with the most information, while suppresses those that are not important. SkNet [42] designs a dynamic selection mechanism so that every neuron can select different receptive fields according to the target object’s size. Based on the above two improvements, Resnest [43] proposes split-attention, which combines the dynamic receptive field and the attention mechanism to improve the network’s ability to express features. Based on the Resnest, we propose the Multi-branch-attention RGB feature extraction module. As shown in Figure 3, given an RGB image with ( h , w , c ) , where h, w represent the height and width of the input image and c is the channels. First divide the input channels into M cardinals, denoted as cardinal 1 M , and then divide each cardinal into N groups. So the total number of groups is M N , suppose the input of each group is I i , then the input of each cardinal can be obtained as:
I m = j = N ( m 1 ) + 1 N m ( I j ) ( m = 1 , 2 , M )
Since the multi groups of each single cardinal, an average pooling layer is added after each convolution operation, so the weight W of each stream can be calculated as:
W m = 1 H × W i = 1 H j = 1 W I m ( i , j )
According to the groups weight, the output of every cardinal will be:
O m = i = 1 N a i n I m ( n 1 ) + i
where a i m is the weight after softmax:
a i m = { exp W i c ( w m ) / j = 0 N exp W i c ( w m ) N > 1 1 / 1 + exp W i c ( w m ) N = 1
where W is determined by number of the group in the cardinal. When N = 1 means that each cardinal is regarded as a whole. After obtaining the output of each cardinal, the final output of these cardinals can be obtained through splitting.

3.2. Pointcloud Feature Extraction

The 3D keypoints voting requires accurate pointcloud geometric reasoning and contextual information. The traditional approaches generally use a hand-crafted features such as the Fast Point Feature Histogram (FPFH), Signature of Histograms of Orientations Shot (SHOT), View point Feature Histogram (VFH) and others. These features are time-consuming and sensitive to noise, illumination, etc. PointNet++ [28] is the improved version of PointNet [13]. It proposes a hierarchical network to capture local features and uses PointNet to aggregate local neighborhood information. In addition, compared with the recent point cloud network PointCNN [44] and RandLA-Net [45], PointNet++ performs better on the Non-Uniform density point cloud due to the Multi-scale Grouping (MSG) and Multi-resolution grouping (MRG). So we choose it as the backbone. The proposed method handles the point cloud directly instead of transforming it into other structures. This can avoid information loss during the processing. Meanwhile, due to the pointcloud’s sparsity, the process is only for the interest points.

3.3. ORB-FPS Keypoint Selection

3D keypoint voting mainly includes two steps: select keypoints from the 3D model and design the keypoints prediction module to predict keypoints. Many of the former 3D key points methods are simple to select the corner points of the target’s bounding box. However, the corner points are virtual and not on the object, which is not good for 6D pose estimation. PVN3D proposes Farthest Point Sample (FPS) to select the keypoints, but it only relies on the Euclidean distance, which may cause the keypoints to appear in areas without conspicuous features, such as a smooth surface. In order to avoid this, this paper proposes ORB-FPS 3D keypoints, we first detect the 2D ORB keypoints in the projected image of the 3D model, then calculate the corresponding 3D points on the model through the camera internal parameters, and finally find the M ORB-FPS keypoints through FPS algorithm. The ORB algorithm first obtains the multi-scale images, then detects the significant keypoints through the Feature from Accelerated Segment Test (FAST) method, it is fast and scale invariant. The ORB keypoint selection process can be described as follows:
(1) Choose a pixel p from the image and suppose its gray value is G p .
(2) Calculate the gray value difference between p and its neighboring points. Set the threshold T and consider the two points are different when the difference is greater than T.
(3) Compare p with 16 neighboring points and consider p is the keypoint if there are n ( n = 12 ) consecutive points that are different from this point.
After obtaining the ORB keypoints and its corresponding 3D points on the model, use FPS to find the 3D keypoints. Specifically, we first select the model’s center point as the first point, then add a point farthest from all the selected key points in turn until M ( M = 8 ) keypoints are obtained. The detail of the algorithm can be described in Algorithm 1.
Algorithm 1: ORB-FPS keypoint algorithm.
Machines 09 00230 i001

3.4. 3D Keypoint Voting

Through DenseFusion, the information from the image and the point cloud are closely combined to obtain the semantic features of each pixel. Based on the RGB-D information and offset strategy, we proposes the 3D keypoints voting module. In detail, we calculate the offset from each point to the predicted keypoints and the ground truth keypoints, and use their difference as the regression parameter. Suppose that the total number of points is N and the keypoints is M, the loss can be expressed as:
L k e y p o i n t = 1 N i = 1 N j = 1 M | | Δ v i , j | x | | + | | Δ v i , j | y | | + | | Δ v i , j | z | | + | | Δ v i , j | f | | Δ v i , j = v i , j v ^ i , j
where v i , j and v ^ i , j are the offset from the points to the predicted keypoints and groundtruth keypoints, respectively. . | x , . | y , . | z , . | f are the components of ., and . | x , . | y , . | z denotes the 3D coordinate and . | f is the extracted feature.

3.5. Instance Semantic Segmentation

In order to improve the framework’s ability to deal with multi object problem in 6d pose estimation, we add the instance semantic segmentation module into the framework. The common method utilizes semantic segmentation architecture to obtain the regions of interest (RoI) containing only a single object, and then performs keypoint detection on the RoI area. Using predictive segmentation to extract global and local features to classify objects is conducive to keypoint location, and the addition of size information can also help the framework to distinguish objects with similar appearance but different sizes. Based on this property, the proposed architecture performs these two tasks at the same time. Given the comprehensive features of each pixel to predict their semantic label, the Focal loss [46] can be described as:
L s e g = α ( 1 p i ) γ l o g ( p i ) p i = c i h i
where α is the balance parameter, it is to balance the importance of positive and negative samples. γ is the focusing parameter to adjust the decreasing rate of simple samples. In this paper, we set α = 0.25 and γ = 2 . c i and h i are the predicted confidence and groundtruth label of the i t h point.

3.6. Center Point Voting

Except for the keypoint, the center point voting module is also utilized in this part. The 3D center point will not be blocked by other objects, so it can treated as an assistant of the instance semantic segmentation. Following the PVN3D, we use the center point module to distinguish the different instance objects in the multi-objects scene. The 3D center point can also be considered a keypoint, so the center point loss L c e n t e r p o i n t can be described as the L k e y p o i n t where M = 1 .
The 3D keypoint prediction network, instance semantic segmantation network and center point voting network are trained jointly, and the loss function could be defined as:
L o s s = λ 1 L k e y p o i o n t + λ 2 L s e g + λ 3 L c e n t e r p o i o n t

3.7. Pose Calculate

Given two data sets of keypoints, one from the M predicted keypoints [ k i ] i = 1 M in the camera’s coordinate system, and another from the M ORB-FPS keypoints [ k i ] i = 1 M in the object’s coordinate system. The least-square method is used to calculate 6d pose translation, which is to find the best rotation matrix R and translation vector T that make the two datasets keypoints closest to each other. The R and T can be calculated as follows:
R , T = arg min R , T i = 1 M | | k i ( R k i + t ) | | 2

4. Experiments

This section shows the experimental results of the proposed method. We evaluate the performance with other pose estimation algorithms on two datasets, Linemod dataset [47] and YCB-Video dataset [9].

4.1. Datasets

Linemod is a standard benchmark widely used in 6d pose estimation. Linemod consists of 13 objects from 13 videos. This dataset contains multiple challenging problems for pose estimation: cluttered scenes, texture-less objects and different lighting conditions.
YCB-Video Dataset is used to evaluate the pose estimation performance in symmetry and severe occlusions. The dataset contains 21 YCB objects with varying shapes and textures and captures and annotates the 6D pose from 92 RGBD videos. Follow PoseCNN, we split the dataset into 80 videos for training and another 12 videos for testing. Meantime, we add the synthetic images into the training set.

4.2. Evaluation Metrics

We evaluate our framework with the ADD and ADD-S metrics. The ADD evaluates the average Euclidean distance between the model points transformed with the predicted 6D pose [ R , t ] and the groudtruth pose [ R ^ , t ^ ] :
ADD = 1 N x O | | ( R x + t ) ( R ^ x + t ^ ) | |
where O is the object mesh and the N is the total number. In addition, For symmetrical objects like eggbox and glue, the ADD-S is proposed. ADD-S computes the distance with the closest points. The formula can be described as follows:
ADD-S = 1 N x 1 O min x 2 O | | ( R x 1 + t ) ( R ^ x 2 + t ^ ) | |
We follow PoseCNN [9] and report the area under the ADD-S curve (AUC) obtained by changing the average distance threshold as the evaluation. Furthermore, in this paper we set the max accuracy threshold to 2 cm.

4.3. Implementation Details

During implementation, follow the PVN3D, we select eight ORB-FPS keypoints and one center point for the predicting network. Furthermore, during the training and testing processes, we sample 12,888 points for each frame of RGBD image. In order to enhance the framework’s robustness to the light condition and background, we synthesize 70,000 rendered images and 10,000 fused multi-objects images from the SUN2012pascalformat dataset [48]. In addition, we set the initial learning rate 0.001 and decrease 0.00001 every 20,000 iterations.

4.4. Results on the Benchmark Datasets

In order to evaluate the pose estimation performance of our algorithm, we design several group of experiments and show the experiment results. The ADD(S) metric means ADD metric for the non-symmetrical objects and ADD-S for the symmetrical objects.

4.4.1. Result on the Linemod Dataset

Table 1 shows our quantitative evaluation results of the 6D pose estimation experiments on the Linemod dataset. We compare our method with the RGB based methods PoseCNN [9], PVNet [7] and RGDB based methods PointFusion [49], Densefusion [12], PVN3D [15]. Among them, we reference PoseCNN, PVNet, PointFusion and DenseFusion as the baseline. For PVN3D, they release their pretrained model on the Linemod and YCB datasets, so we test the performance of the pretrained model on the datasets. These methods are the single view and without iterative refinement. As the table shows, the BB8 means 8 bounding box points and it makes the worst performance due to the keypoints may not appear on the surface of the object. In addition, the RGBD based methods perform better than RGB based methods demonstrates that RGB and point cloud features are necessary for this task. Furthermore, the proposed algorithm advances the other methods by more than 0.3 percent on the ADD(S) metric due to its better representation ability and the keypoint selection scheme. Figure 4 shows the visualization results of our approach on the Linemod dataset, it can be seen that our method performs well on the Linemod objects.

4.4.2. Result on the YCB-Dataset

Table 2 shows the quantitative evaluation results of the 6D pose on the YCB-Video dataset of the proposed algorithm and other methods. The compared methods are PoseCNN, DenseFusion and PVN3D and all the methods do not have iterative refinement. As the Table 2 shows, our method advances other approaches by 1.7 percent on the ADD-S metric and 0.8 percent on the ADD(S) metric. Our method also outperforms than others on the symmetrical objects (tagged in bold in the table). Figure 5 shows the visualization results of our method on the YCB-Video dataset, some of them are severely occluded and the performances of the proposed method are excellent as well. This is because our algorithm uses dense prediction, which calculates every point/pixel’s translation vector to the keypoint and votes the predicted keypoints through the vectors. The voting scheme is motivated by the property of the rigid objects, which means once we see the local visible parts, we can infer the relative directions to the invisible parts. Furthermore, the comprehensive utilization of RGB images and point cloud leads to better performance on the symmetrical object. It can be concluded that our algorithm works well on the occlusion scenes.

4.5. Ablation Study

A series of ablation studies are conducted to analyse the influences of different keypoints selection methods, number of keypoints on the framework and the in-channel weight calculation.
Effect on the Keypoint Selection. In this part, we select 8 corner points of the 3D object’s bounding box(BB8) and 8 Farthest Point Sample(FPS) points to compare with our ORB-FPS keypoint. In addition, the different number of ORB-FPS keypoints are also taken into the comparison. Table 3 shows the experiment result of the several methods on the YCB-dataset. Overall, the ORB-FPS keypoint selection scheme performs better than the FPS on the basics of selecting the same number of key points, this is because ORB-FPS makes better use of the RGB image’s texture feature. The BB8 keypoint performs worst due to the bounding box corner points are virtual and far away from the object. In addition, from the comparison of the different number of keypoints, eight keypoints selected from the ORB-FPS approach is a good choice. Fewer keypoints can not fully express the complete shape of the object and more keypoints will increase the output space and make the network harder to learn.
Effect of the channel-attention. In order to testify the influence of the channel-attention, we compare the experimental results with the channel-attention and without. According to Table 4, adding channel-attention block increase the 6D pose performance. The channel-attention block can calculate the weight of different channels and use these weights to integrate the channel-wise information, this will increase the network’s representation ability.

5. Conclusions

This paper presents an accurate deep 3D keypoint voting network for 6DoF pose estimation. We propose split-attention block to enhance the network’s ability to learn features from the RGB image. Due to the split channel attention, the network can selectively enhance useful channels and suppress less useful ones. In addition, we introduce the 3D ORB-FPS keypoint selection method, which first detects the 2D ORB keypoints in the image, then finds the corresponding 3D points through the camera intrinsic matrix, and finally finds the 3D keypoints through the FPS algorithm. The proposed keypoint selection method leverages texture information of RGB image and geometry information of pointcloud. Our algorithm outperforms the previous methods on two benchmark datasets on the ADD-S and ADD(S) metrics. Overall, we believe our network can be applied in the real applications such as Automatic driving, Bin-Picking and so on.

Author Contributions

Conceptualization, S.S.; methodology, S.S. and H.L.; software, H.L., Y.Z. and L.L.; validation, S.S., G.L. and H.L.; formal analysis, H.L. and H.X.; investigation, S.S.; resources, S.S.; data curation, H.L. and Y.L.; writing original draft preparation, H.L.; writing review and editing, H.L. and S.S.; visualization, H.L. and H.X.; supervision, S.S.; project administration, S.S. and H.L.; funding acquisition, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Innovation Project of Shanghai Institute of Technical Physics, Chinese Academy of Sciences (NO. X-209).

Data Availability Statement

Not applicable.

Acknowledgments

The authors would like to acknowledge the Linemod dataset and the YCB dataset for making their datasets available to us.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Marchand, E.; Uchiyama, H.; Spindler, F. Pose estimation for augmented reality: A hands-on survey. IEEE Trans. Vis. Comput. Graph. 2015, 22, 2633–2651. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
  3. El Madawi, K.; Rashed, H.; El Sallab, A.; Nasr, O.; Kamel, H.; Yogamani, S. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In Proceedings of the 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, 27–30 October 2019; pp. 7–12. [Google Scholar]
  4. Tremblay, J.; To, T.; Sundaralingam, B.; Xiang, Y.; Fox, D.; Birchfield, S. Deep object pose estimation for semantic robotic grasping of household objects. arXiv 2018, arXiv:1809.10790. [Google Scholar]
  5. Wu, Y.; Zhang, Y.; Zhu, D.; Chen, X.; Coleman, S.; Sun, W.; Hu, X.; Deng, Z. Object-Driven Active Mapping for More Accurate Object Pose Estimation and Robotic Grasping. arXiv 2020, arXiv:2012.01788. [Google Scholar]
  6. Mur-Artal, R.; Montiel, J.M.M.; Tardos, J.D. ORB-SLAM: A versatile and accurate monocular SLAM system. IEEE Trans. Rob. 2015, 31, 1147–1163. [Google Scholar] [CrossRef] [Green Version]
  7. Peng, S.; Liu, Y.; Huang, Q.; Zhou, X.; Bao, H. Pvnet: Pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4561–4570. [Google Scholar]
  8. Hinterstoisser, S.; Cagniart, C.; Ilic, S.; Sturm, P.; Navab, N.; Fua, P.; Lepetit, V. Gradient Response Maps for Real-Time Detection of Textureless Objects. IEEE Trans. Pattern Anal. Mach. Intell. 2012, 34, 876–888. [Google Scholar] [CrossRef] [Green Version]
  9. Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv 2017, arXiv:1711.00199. Available online: https://arxiv.org/pdf/1711.00199.pdf (accessed on 26 May 2018).
  10. Besl, P.J.; McKay, N.D. Method for registration of 3-D shapes. Sensor fusion IV: Control paradigms and data structures. Int. Soc. Opt. Photonics 1992, 1611, 586–606. [Google Scholar]
  11. Chi, L.; Jin, B.; Hager, G.D. A Unified Framework for Multi-View Multi-Class Object Pose Estimation. arXiv 2018, arXiv:1803.08103. Available online: https://arxiv.org/abs/1803.08103 (accessed on 8 September 2018).
  12. Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3343–3352. [Google Scholar]
  13. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  14. He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. arXiv 2021, arXiv:2103.02242. Available online: https://arxiv.org/abs/2103.02242 (accessed on 24 June 2021).
  15. He, Y.; Sun, W.; Huang, H.; Liu, J.; Fan, H.; Sun, J. Pvn3d: A deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11632–11641. [Google Scholar]
  16. Kehl, W.; Manhardt, F.; Tombari, F.; Ilic, S.; Navab, N. Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1521–1529. [Google Scholar]
  17. Zeng, A.; Yu, K.T.; Song, S.; Suo, D.; Walker, E.; Rodriguez, A.; Xiao, J. Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1383–1386. [Google Scholar]
  18. Tekin, B.; Sinha, S.N.; Fua, P. Real-Time Seamless Single Shot 6D Object Pose Prediction. 2017. Available online: https://openaccess.thecvf.com/content_cvpr_2018/papers/Tekin_Real-Time_Seamless_Single_CVPR_2018_paper.pdf (accessed on 24 June 2021).
  19. Zakharov, S.; Shugurov, I.; Ilic, S. DPOD: 6D Pose Object Detector and Refiner. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2020. [Google Scholar]
  20. Song, C.; Song, J.; Huang, Q. Hybridpose: 6d object pose estimation under hybrid representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 431–440. [Google Scholar]
  21. Wang, G.; Manhardt, F.; Tombari, F.; Ji, X. GDR-Net: Geometry-Guided Direct Regression Network for Monocular 6D Object Pose Estimation. arXiv 2021, arXiv:2102.12145. [Google Scholar]
  22. Stevsic, S.; Hilliges, O. Spatial Attention Improves Iterative 6D Object Pose Estimation. arXiv 2020, arXiv:2101.01659. [Google Scholar]
  23. Zhang, Y.; Song, S.; Tan, P.; Xiao, J. Panocontext: A whole-room 3d context model for panoramic scene understanding. In Proceedings of the 13th European Conference on Computer Vision (ECCV2014), Zurich, Switzerland, 6–12 September 2014; Springer: Cham, Switzerland, 2014; pp. 668–686. [Google Scholar]
  24. Xu, J.; Stenger, B.; Kerola, T.; Tung, T. Pano2cad: Room layout from a single panorama image. In Proceedings of the 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), Santa Rosa, CA, USA, 24–31 March 2017; pp. 354–362. [Google Scholar]
  25. Pintore, G.; Ganovelli, F.; Villanueva, A.J.; Gobbetti, E. Automatic modeling of cluttered multi-room floor plans from panoramic images. Comput. Graph. Forum 2019, 38, 347–358. [Google Scholar] [CrossRef]
  26. Rusu, R.B.; Blodow, N.; Beetz, M. Fast point feature histograms (FPFH) for 3D registration. In Proceedings of the 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, 12–17 May 2009; pp. 3212–3217. [Google Scholar]
  27. Tombari, F.; Salti, S.; Di Stefano, L. Unique signatures of histograms for local surface description. In Proceedings of the European Conference on Computer Vision, Crete, Greece, 5–11 September 2010; Springer: Berlin, Germany, 2010; pp. 356–369. [Google Scholar]
  28. Qi, C.R.; Li, Y.; Hao, S.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. arXiv 2017, arXiv:1706.02413. Available online: https://arxiv.org/pdf/1706.02413.pdf (accessed on 7 June 2017).
  29. Qi, C.R.; Litany, O.; He, K.; Guibas, L.J. Deep Hough Voting for 3D Object Detection in Point Clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27–28 October 2019. [Google Scholar]
  30. Aoki, Y.; Goforth, H.; Srivatsan, R.A.; Lucey, S. PointNetLK: Robust & Efficient Point Cloud Registration Using PointNet. arXiv 2019, arXiv:1903.05711. Available online: https://arxiv.org/abs/1903.05711 (accessed on 15 June 2019).
  31. Weng, Y.; Wang, H.; Zhou, Q.; Qin, Y.; Duan, Y.; Fan, Q.; Chen, B.; Su, H.; Guibas, L.J. CAPTRA: Category-level Pose Tracking for Rigid and Articulated Objects from Point Clouds. arXiv 2021, arXiv:2104.03437. [Google Scholar]
  32. Huang, J.; Shi, Y.; Xu, X.; Zhang, Y.; Xu, K. StablePose: Learning 6D Object Poses from Geometrically Stable Patches. arXiv 2021, arXiv:2102.09334. [Google Scholar]
  33. Sarode, V.; Li, X.; Goforth, H.; Aoki, Y.; Srivatsan, R.A.; Lucey, S.; Choset, H. PCRNet: Point Cloud Registration Network Using PointNet Encoding. arXiv 2019, arXiv:1908.07906. Available online: https://arxiv.org/pdf/1908.07906.pdf (accessed on 4 November 2019).
  34. Gao, G.; Lauri, M.; Hu, X.; Zhang, J.; Frintrop, S. CloudAAE: Learning 6D Object Pose Regression with On-line Data Synthesis on Point Clouds. arXiv 2021, arXiv:2103.01977. Available online: https://arxiv.org/pdf/2103.01977.pdf (accessed on 2 March 2021).
  35. Shao, L.; Cai, Z.; Liu, L.; Lu, K. Performance evaluation of deep feature learning for RGB-D image/video classification. Inf. Sci. 2017, 385, 266–283. [Google Scholar] [CrossRef] [Green Version]
  36. Li, C.; Bai, J.; Hager, G.D. A unified framework for multi-view multi-class object pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  37. Joukovsky, B.; Hu, P.; Munteanu, A. Multi-modal deep network for RGB-D segmentation of clothes. Electron. Lett. 2020, 56, 432–435. [Google Scholar] [CrossRef]
  38. Cheng, Y.; Zhu, H.; Acar, C.; Jing, W.; Wu, Y.; Li, L.; Tan, C.; Lim, J.H. 6d pose estimation with correlation fusion. arXiv 2019, arXiv:1909.12936. [Google Scholar]
  39. Wang, C.; Martín-Martín, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; Zhu, Y. 6-pack: Category-level 6d pose tracker with anchor-based keypoints. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10059–10066. [Google Scholar]
  40. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  41. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  42. Li, X.; Wang, W.; Hu, X.; Yang, J. Selective kernel networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 510–519. [Google Scholar]
  43. Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R.; et al. Resnest: Split-attention networks. arXiv 2020, arXiv:2004.08955. [Google Scholar]
  44. Li, Y.; Bu, R.; Sun, M.; Wu, W.; Di, X.; Chen, B. Pointcnn: Convolution on x-transformed points. Adv. Neural Inf. Process. Syst. 2018, 31, 820–830. [Google Scholar]
  45. Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
  46. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  47. Hinterstoisser, S.; Lepetit, V.; Ilic, S.; Holzer, S.; Bradski, G.; Konolige, K.; Navab, N. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the Asian Conference on Computer Vision, Daejeon, Korea, 5–9 November 2012; Springer: Berlin, Germany, 2012; pp. 548–562. [Google Scholar]
  48. Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html (accessed on 18 May 2012).
  49. Xu, D.; Anguelov, D.; Jain, A. Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 8–23 June 2018; pp. 244–253. [Google Scholar]
Figure 1. Pipeline of our proposed algorithm. (a) The RGB-D input data of our network; (b) The blue lines represent the direction vector to the keypoint; (c) The predicted 9 points, where 8 blue ORB-FPS keypoints and one red center point; (d) The 6DoF pose estimated by our algorithm.
Figure 1. Pipeline of our proposed algorithm. (a) The RGB-D input data of our network; (b) The blue lines represent the direction vector to the keypoint; (c) The predicted 9 points, where 8 blue ORB-FPS keypoints and one red center point; (d) The 6DoF pose estimated by our algorithm.
Machines 09 00230 g001
Figure 2. Overview of our algorithm. Given RGB image and point cloud, the point-wised features are extracted by the attention CNN network and PointNet++ respectively. The 3D keypoint prediction, center point prediction and instance semantic segmentation modules are trained jointly to obtain the keypoints. Finally, the 6DoF pose is estimated through the least-square algorithm.
Figure 2. Overview of our algorithm. Given RGB image and point cloud, the point-wised features are extracted by the attention CNN network and PointNet++ respectively. The 3D keypoint prediction, center point prediction and instance semantic segmentation modules are trained jointly to obtain the keypoints. Finally, the 6DoF pose is estimated through the least-square algorithm.
Machines 09 00230 g002
Figure 3. Illustration of the proposed in-channel attention block. Several channels are obtained by convolution, then calculate the weights of every channel and concatenate them together.
Figure 3. Illustration of the proposed in-channel attention block. Several channels are obtained by convolution, then calculate the weights of every channel and concatenate them together.
Machines 09 00230 g003
Figure 4. Some visual results of the Linemod dataset.
Figure 4. Some visual results of the Linemod dataset.
Machines 09 00230 g004
Figure 5. Some visualization results of the YCB dataset. The proposed method performs well even when the objects are seriously occluded.
Figure 5. Some visualization results of the YCB dataset. The proposed method performs well even when the objects are seriously occluded.
Machines 09 00230 g005
Table 1. The accuracies of our method and the baseline methods on the LINEMOD dataset in terms of the ADD(-S) metric, where the bold are considered as symmetrical objects.
Table 1. The accuracies of our method and the baseline methods on the LINEMOD dataset in terms of the ADD(-S) metric, where the bold are considered as symmetrical objects.
PoseCNNPVNetPointFusionDensefusionPVN3DOurs
ape77.043.670.479.596.096.4
benchvise97.599.980.784.294.294.5
cam93.586.960.876.592.395.0
can96.595.561.186.693.894.9
cat82.179.379.188.895.495.8
driller9596.447.377.793.094.0
duck77.752.663.076.393.792.9
eggbox97.199.299.999.996.296.7
glue99.495.799.399.496.497.4
holepuncher52.881.971.879.094.596.3
iron98.398.983.292.192.293.1
lamp97.599.362.392.393.593.7
phone87.792.478.888.093.593.6
average88.686.373.786.294.294.5
Table 2. Quantitative evaluation of 6D Pose (ADD-S, ADD(S)) on the YCB dataset. Symmetric objects’ names are in bold.
Table 2. Quantitative evaluation of 6D Pose (ADD-S, ADD(S)) on the YCB dataset. Symmetric objects’ names are in bold.
PoseCNNDenFusionPVN3DOurs
ADD-SADD(s)ADD-SADD(s)ADD-SADD(s)ADD-SADD(s)
002 master chef can83.950.295.370.795.879.696.481
003 cracker box76.953.192.586.995.493.094.590.0
004 sugar box84.268.495.190.897.295.997.195.6
005 tomato soup can8166.293.884.795.789.895.689.2
006 mustard bottle90.48195.890.997.696.597.695.4
007 tuna fish can8870.795.779.696.791.796.787.0
008 pudding box79.162.794.389.396.593.695.190.6
009 gelatin box87.275.297.295.897.495.197.295.6
010 potted meat can78.559.589.379.692.184.491.287.1
011 banana8672.39076.796.392.496.994.1
019 pitcher base7753.393.687.196.293.896.995.5
021 bleach cleanser71.650.394.487.595.590.996.193.4
ADD-SADD(s)ADD-SADD(s)ADD-SADD(s)ADD-SADD(s)
024 bowl69.669.6868685.585.595.495.4
025 mug78.258.595.383.897.194.097.392.8
035 power drill72.755.392.183.796.695.096.794.9
036 wood block64.364.389.589.590.890.895.195.1
037 scissors56.935.890.177.491.891.896.592.6
040 large marker71.758.395.189.195.290.194.386.2
051 large clamp50.250.271.571.590.090.095.495.4
052 extra large clamp44.144.170.270.277.677.695.595.5
061 foam brick888892.292.295.495.496.596.5
Average75.261.390.982.994.290.795.891.4
Table 3. Experiment results on YCB dataset of the different keypoint selection approaches.
Table 3. Experiment results on YCB dataset of the different keypoint selection approaches.
BB8FPS8ORB-FPS4ORB-FPS8ORB-FPS12
ADD-S93.294.294.195.894.7
ADD(S)89.490.790.591.491.0
Table 4. Experiment results on YCB dataset with/withou the channel-attention.
Table 4. Experiment results on YCB dataset with/withou the channel-attention.
WithoutWith
ADD-S95.695.8
ADD(S)91.091.4
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Liu, H.; Liu, G.; Zhang, Y.; Lei, L.; Xie, H.; Li, Y.; Sun, S. A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene. Machines 2021, 9, 230. https://doi.org/10.3390/machines9100230

AMA Style

Liu H, Liu G, Zhang Y, Lei L, Xie H, Li Y, Sun S. A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene. Machines. 2021; 9(10):230. https://doi.org/10.3390/machines9100230

Chicago/Turabian Style

Liu, Huikai, Gaorui Liu, Yue Zhang, Linjian Lei, Hui Xie, Yan Li, and Shengli Sun. 2021. "A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene" Machines 9, no. 10: 230. https://doi.org/10.3390/machines9100230

APA Style

Liu, H., Liu, G., Zhang, Y., Lei, L., Xie, H., Li, Y., & Sun, S. (2021). A 3D Keypoints Voting Network for 6DoF Pose Estimation in Indoor Scene. Machines, 9(10), 230. https://doi.org/10.3390/machines9100230

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop