1. Introduction
Today, flexible and scalable robotic systems are a key component, considering the new trend associated with Industry 4.0, and, more specifically, the new mass-customization production concept [
1]. The evolution of visual perception with RGB-D sensors and computational power has transformed autonomous and semi-autonomous tasks in the industry. Tasks involving robot navigation [
2], manufacturing operations [
3], and grasping [
4] have become more accessible and efficient over the years. From this perspective, machine learning technologies should be used to improve grasping systems for object recognition, localization, and handling [
5].
Object detection networks, such as [
6,
7], can detect objects in the environment in robotic applications but lack depth, reducing their efficiency in tasks where scale or more precise ambient perception are needed. To improve this, studies on point clouds have been used in semantic segmentation tasks such as [
8,
9,
10].
In robotic manipulators, point cloud has been used to estimate the best region to grasp without prior knowledge of the object [
11]. The systems presented by [
12,
13] use deep learning networks to estimate the best position to grasp, with some limitations, such as planar grasping, where the gripper is always perpendicular to the table or needs a heuristic to select one of the possible grasps returned by the network. Neural networks are commonly applied to grasp objects using partial point clouds observation [
14], pose refinements [
15,
16], and grasping for multiple robotic manipulators [
17]. Recent grasping methods are evolving towards 6D pose estimation, such as [
18], where a method and dataset use RGB or RGB-D to segment, detect, and estimate an object’s 6D pose. The segmentation and point cloud generation of objects in RGB and RGB-D [
19], and an improvement in the 6D estimation of moving objects can be seen in [
20,
21], respectively.
This paper proposes a new selective grasping system in 6D using only point clouds from RGB-D sensors. This paper has four main contributions: (i) validation of a new grasping algorithm based on lateral curvatures and geometric primitives in a real scenario, which is an improvement of the work proposed by [
11,
22]; (ii) a new deep learning network for object classification, named Point Encoder Convolution (PEC), using point clouds enabling selective grasping in an environment with several objects; (iii) a new point cloud classification dataset obtained in a simulator, named LARS Classification Dataset, which facilitates the manipulation of object classes in different poses; (iv) extensive validation of the complete grasping system using a UR5 robotic manipulator, a Robotiq 2F140 gripper, and an Intel Realsense D435 RGB-D sensor.
The remainder of this document is organized as follows.
Section 2 presents the proposed grasping system overview, showing each step of the grasping process.
Section 3 proposes a new deep learning network. In
Section 4, the 6D grasping method is presented.
Section 5 offers experimental results. Finally, the conclusions are drawn in
Section 6.
2. Grasping System
The grasping system overview proposed in this paper can be seen in
Figure 1, where the stages of the system are as follows:
Filtering stage:
- 1.
Input point cloud: the initial point cloud returned by the RGB-D sensor.
- 2.
Gripper and table removal: the point cloud after being filtered by removing the gripper and the table, having only the objects over the table. The filtering process works as follows:
- (a)
Removing the gripper by limiting the z-axis (depth axis) between meters, where and are the parameters of the system setup. The gripper and objects far away from the camera are removed, for example, the floor.
- (b)
Removing the table using plane segmentation and an Outliner filter.
- 3.
Clustering to separate the objects in the point cloud. Returns to the point cloud of each object, with , where n is the number of objects.
Classification stage:
- 4.
Preprocess the data by normalization, downsampling using farthest point sampling (FPS), and grouping the points. Returns the point cloud , with , where n is the number of objects.
- 5.
The objects’ point cloud is classified using the proposed classification network PEC. Returns objects class , with , where n is the number of objects.
- 6.
Select the object with the class c desired to be grasped and return its point cloud p to the grasping algorithm. From this step on, selective grasping is possible, as only the point cloud of the desired object will be considered for the grasping stage.
Grasping stage:
- 7.
Receive the object’s point cloud p and estimate its 6D pose using principal component analysis (PCA).
- 8.
Estimate the grasping region of the selected object in the camera’s field of view. The proposed algorithm uses lateral curvatures and geometric primitives to estimate the grasping region.
- 9.
Returns the 6D pose to grasp the desired object.
3. Point Encoder Convolution
In this subsection, a new deep learning network called Point Encoder Convolution (PEC) will be described, which was developed to classify objects using their respective point clouds as perceptual information.
The problem definition can be described as follows. Considering a group of point clouds as an input, with , where n is the number of objects, containing unknown segmented objects, the PEC predicts the object classes . With those predictions, it will return the point cloud p with the desired class c.
3.1. Data Preprocessing
Before applying the point cloud
p to the classification network, the data are preprocessed to improve the learning process and data generalization. First, the object centroid is defined, where
is the point cloud centroid,
N is the number of points of the point cloud, and
is a point from the point cloud
p, with
:
and
is the translated point cloud to the system origin
This way, the data are normalized around the origin, avoiding data overfitting during the training step. After that, the point cloud
is sub-sampled using FPS [
23], as recommended by [
8], to ensure that the data will have the same size as the network input size. Finally, a pseudo-organized point cloud is created by grouping the points following these steps:
Select a number of points k, where this number has to be divisible by the total number of points in the point cloud .
Use k-d tree [
24] to find
k points near the target point.
Group those points and remove them from the original point cloud.
Repeat this process until all points are grouped.
When all points are grouped, obtain each group’s centroid by using Equation (
1), and sort them by their distance to the origin, as proposed by [
8].
Merge all the groups back in a point cloud with its points reorganized.
3.2. Network Architecture
The chosen architecture features an autoencoder using 1D convolutions, 1D transpose convolutions, and multiLayer perceptron. The convolution layers use batch normalization to help stabilize the training and make it more consistent.
Figure 2 shows the complete PEC architecture, considering an input size of 64 points and a total of 5 classes for a total of 890 K parameters. Such parameterization is part of the convolution layers (input size, output size, kernel size, stride, padding), linear layers (input size, output size), and the transposed convolution layers (input size, output size, kernel size, stride). For the loss, the cross-entropy loss was used due to better generalization when using objects that are not the same as the ones used in training. A stochastic optimization method, Adam [
25], was chosen due to having better and faster training compared to other optimization methods. The network output will be a class
c and its confidence score, where the confidence score is obtained by applying the softmax function [
26] to the output of the network. The hyperparameters for the network were obtained in an empirical way.
3.3. Lars Classification Dataset
To validate the network PEC, a new dataset is proposed, which is one of the contributions of this paper. The simulator Isaac Sim [
27] was used to help with this task since it is easy to manipulate objects through scripting and features good graphical fidelity. The graphical settings used were the default ones as of the 2022 version. The generated dataset used five classes, each class with two objects. Each object received one thousand samples by generating random poses in
space, with a range between [−2 m, 2 m] and with a range for orientation between [
,
], where on each pose, a simulated RGB-D sensor recorded the point cloud.
Figure 3 represents the simulator interface and the generation process.
Figure 4 shows the objects and classes used on the dataset. One of the challenges of this approach is that the data generated in the simulation are ideal; that is, these data do not show the most defects of a real point cloud. To solve this and help with its generalization in real experiments, each data sample had several random points removed from a region, where the region was chosen randomly. The number of points removed varied between 0% and 40% of the total points. This did not affect the accuracy of the training, which can either be related to the environment or be caused by filters and generalization.
Figure 5 shows a classification process example, where the network receives the object already segmented. The segmentation of the objects was performed by removing the table using RANSAC to detect the plane [
28]. Some defects on the point cloud are noticeable, such as missing spots, and the segmentation process causes others. In some cases where we have multiple of the same object, they can have different confidence rates. This depends on similarities to the objects used in training; on mice, for example, by trying on different models, the confidence rate could vary between 70% and 95%.
4. Lateral Curvatures and Geometric Primitives
This section proposes a new algorithm for 6D pose estimation using lateral curvatures and geometric primitives of point clouds. The new algorithm is based on the works of [
11,
22] and can work independently of PEC to grasp unknown objects, with an improvement in two significant points: (i) by breaking the object into smaller and more accessible regions to perform the grasp; this way, we can analyze the entire object in less complex regions, improving the algorithm proposed in [
22], where it focuses its analysis near the center of mass of the object. This also gives the ability to know which region does not fit the gripper and discard it to avoid unnecessary computation time on its analysis. (ii) improving the geometric primitive estimation performed in [
11], where the author tries to reduce even complex objects to a single primitive. By breaking the object into multiple regions, we can check if the object can be reduced to a geometric primitive.
4.1. Problem Statement
The grasping algorithm will receive a point cloud
as an input, where it only contains the desired object to be grasped. Next, we consider a two-finger gripper with gripper stroke
and finger width
, see
Figure 6. The proposed algorithm will return a 6D pose
to grasp the object. This pose will have its position aimed at the center of the selected region.
The proposed grasping algorithm will work as follows:
Use PCA to establish the object’s orientation in relation to the camera, as suggested in [
11,
22].
Use the object orientation to align it to the origin axis.
Generate grasping regions based on the value.
Obtain the curvature of each region and check if they have the same geometry, as seen in [
11].
If the object can be reduced to a primitive by checking if each region has the same geometric primitive and width, grasp the object by its centroid.
If it is not, obtain the lateral curvature of each region.
Return the pose of the region of the smaller curvature that fits inside the gripper.
4.2. Region Generation
Regions
K are defined using the object height
and the gripper dimensions:
If
, the object is considered too small and grasping will be performed at its center. Otherwise, the 6D grasping algorithm proposed in [
29] is applied. Since some objects have regions that can be separated into multiple regions, for each region found, a clustering algorithm is applied to separate them when applicable. This way, a complex object can be broken into smaller and more accessible regions, such as the joypad seen in
Figure 7.
4.3. Curvature Calculation
The point cloud surface curvature is obtained using the method proposed by [
11,
22]. First, the centroid of each region is calculated by using the equation below, where
is the number of points in the region, with
,
is a point of the point cloud
, and
is the centroid of the region:
Then, the covariance matrix is obtained:
and the eigenvalues
and eigenvector for three dimensions
:
and by using PCA, the curvature
is
4.4. Geometric Primitive Reduction
The geometric primitives, proposed by [
11], are estimated through the curvature around each point of the region using the Equations (
7) and (
8):
where
is the number of points
∈
where
≤ 0.01 e M is the total number of points from the point cloud. This way, the region geometric primitive can be defined as seen in
Table 1:
4.5. Region Selection
The best region to grasp is defined considering whether each region has the same geometric primitive and a similar length, inside a margin of error, due to noises and measurement errors in the point cloud. If they have this, consider the entire object a geometric primitive and grasp it on its centroid. The region length
is defined by
where
and
are the maximum and minimum values of each generated region on the
x axis. The x-axis represents the horizontal axis,
represents the region, with
, and
the number of regions.
If this condition fails, the curvature around
and
is calculated, and a score called
is defined as
where
is the curvature around
and
the curvature around
.
The best region to grasp is the one where
is minimal, and the condition
is valid to avoid regions that do not fit inside the gripper. In the last step, PCA is applied to the region as it may have a different orientation to the object, as seen in
Figure 8.
The output of the algorithm will be the 6D pose of the region to be grasped .
5. Experimental Results
This section presents experimental results for pick and place tasks using an UR5 robotic manipulator, a Robotiq 2F140 gripper, and an Intel Realsense D435 RGB-D sensor. For the development of the system, Python 3.7 was used as a programming language, Open3D [
30] for point cloud processing, and Pytorch Lightning [
31] for the development of the classification neural network. The results were divided into two subsections: Object Classification, where it will show the results of the PEC network in classification tasks, and Grasping Results, where the entire grasping system was tested. Two computational systems were used for development and testing: a computer with a Ryzen 5 3600 and a RTX 3060ti graphics card and a laptop i5-4200M with a Nvidia GT850M graphics card, respectively.
5.1. Object Classification
The PEC classification network was tested with two datasets: ModelNet10 [
32] and LARS classification Dataset, presented in
Section 3.3. The results for the ModelNet10 can be seen in
Table 2, where the default size of the entry point cloud is 64 points, except for
and
that used input sizes of 128 points and 256 points, respectively. In order to test the network with different groupings of the sensor point cloud, groupings of 4, 8, 16, 32, and 64 points were formatted.
In
Table 2, the results comparing the PEC network to other classification networks can be seen. It can be verified that the group size influences the accuracy of the network and that more points in the input do not translate to better results. The PEC network was trained for this dataset with a learning rate of
for 150 epochs.
The accuracy of the network on the LARS classification dataset can be seen in
Table 3, where the best result was achieved with an input size of 64 points and a group size of 16 points. The dataset was split into 75% for training and 25% used in validation, using a learning rate of
and trained for 150 epochs.
Figure 9 shows the validation accuracy and loss in the training stage. The validation loss starts high in the training step due to batch normalization. However, over time, it normalizes and results in a faster and more stable training process.
For comparison purposes, using a RTX 3060ti graphics card, the PEC network with the LARS dataset can achieve prediction in around 0.002 s and takes 152 s in the training step with an input size of 64 points. With a GT850M graphics card, it takes around 0.009 s for prediction and 250 s for training. The network training step is only performed once for a set of object classes, with low training time, and low execution time, and it is suggested that it can be used on low-cost hardware and retrained as needed in the case of changing the objects to be classified.
5.2. Grasping Results
The objects were selected based on their challenges to grasp, such as a joypad (which has a complex geometry), a plug adapter (with a simple geometry that can only stand on the table with a specific orientation), pliers (also with a complex geometry), a wire cutter (hard to grasp due to its low thickness and complex handle), and, finally, a cylinder with a simple geometry.
Figure 10 shows an example of the sequence of tasks to perform a grasp. This section was divided into two experiments: Grasping Algorithm, where the proposed grasping algorithm will be validated; and Selective Grasping, where the whole system will be validated.
5.2.1. Grasping Algorithm
Table 4 shows the results of grasping an object twenty times in different poses to better show the algorithm’s efficiency. The point cloud of the objects and their respective grasping regions can be seen in
Figure 11, showing the performance of the grasping stage with objects of different complexities:
Joypad is a complex object where the gripper has to grasp it close to its center due to better stability.
In the case of the pliers, both the clusters and the original region are added as possible grasp since some handles may not be ideal to be grasped individually. The grasping algorithm chose to grasp both handles instead of individually.
The plug adapter is a rectangular object that uses the wall plug to stay up, causing it to have an orientation about the table. Since it possesses a rectangular geometry, the gripper will grasp it by its center.
The wire cutter has predominantly planar geometry and lacks depth. Therefore, the plane segmentation will consider only the handles.
The cylinder is another object that possesses a simple geometry so that it will be grasped by its center.
To benchmark the proposed grasping system, a Generic Grasping Algorithm (GGA) is used for comparison. It grasps the object by its center and keeps the gripper perpendicular to the table. For simple objects, such as the cylinder and the plug adapter, the difference in success rate is minimal. The grasp will depend more on other factors, such as correct pose estimation and object dimensions. A more significant gap between the proposed grasping algorithm and the GGA is noticed on more complex objects, as seen in
Table 4. It is essential to emphasize the computational cost of the proposed system that takes an average of
s to generate a grasping for the objects seen in these experiments on a CPU Ryzen 5 3600.
5.2.2. Selective Grasping
Selective grasping allows the user to choose an object of interest in a given environment with other objects.
Figure 12 and
Figure 13 show the system grasping the joypad and the staples, respectively. It is important to point out that the network was tested with objects similar to those used in the simulation training stage. In the experimental results, it was noted that the quality of the point cloud of the objects is essential for a good performance of the system. The point clouds obtained by the sensors have noises and defective regions, reducing the confidence score of the classification, which does not happen in the simulated point clouds. Even so, the system proved to be robust enough for a diverse set of objects. A video with further experiments can be seen in the
Supplementary Materials.
6. Conclusions
This article proposes a new visual selective grasping system using only point clouds of objects, and it can be executed in around 0.004 s (0.002 s for the classification of objects plus 0.0018 s for pose generation for feasible graspings). The proposed system was validated using an UR5 robotic manipulator, a Robotiq 2F-140 gripper, and an Intel Realsense D435. The grasping method analyzes the point cloud of the object and generates a region to grasp based on a single view. For this, the grasping algorithm breaks a complex object into smaller and easier-to-analyze regions, and uses curvatures and geometric primitives to estimate the best region to grasp.
A new deep learning network called Point Encoder Convolution (PEC) that uses 1D convolution on a segmented object’s point cloud was introduced to provide the ability to select objects of interest, as well as a new point cloud classification dataset obtained in a simulator. The experimental results showed an average success rate of 94% in objects with different geometries, with a feasible individual grasping generation time of around s.
The PEC neural network proved to be simple and effective in selective grasping. Using the LARS dataset, it achieved an accuracy of 92.5% and an object classification time of 0.002 s. It is essential to point out that simulated objects were used in the network training stage similar to the real ones, and, even so, the PEC obtained a correct classification with good precision. Using a public dataset, ModelNet10, it obtained an accuracy of 92.24%, proving itself to be generalist and applicable in several applications.
The system is highly dependent on the point cloud quality since it works with a single view. Because of its simplicity, it does not scale well to more complex classification tasks where many classes are used. Nevertheless, on the other hand, it presents low processing times for a small class of objects, feasible for robotic tasks with high repeatability rates. Improvements using multiple views, noise-related problems, defective points cloud, and occlusion can be addressed for future works.
Author Contributions
Conceptualization, D.M.d.O. and A.G.S.C.; methodology, A.G.S.C.; software, D.M.d.O.; validation, D.M.d.O. and A.G.S.C.; formal analysis, D.M.d.O. and A.G.S.C.; investigation, D.M.d.O.; resources,iting—original draft preparation, D.M.d.O. A.G.S.C.; data curation, D.M.d.O. and A.G.S.C.; wr and A.G.S.C.; writing—review and editing, D.M.d.O. and A.G.S.C.; visualization, D.M.d.O. and A.G.S.C.; supervision, A.G.S.C.; project administration, A.G.S.C.; funding acquisition, A.G.S.C. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by SEPIN/MCTI and the European Union’s Horizon 2020 Research and Innovation Programme through the Grant Agreement No. 777096, the Brazilian funding agency (CNPq) Grant Numbers [311029/2020-5 and 407163/2022-0], and CAPES.
Data Availability Statement
The data presented in this study are available on request from the corresponding author.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
GGA | Generic Grasping Algorithm |
PEC | Point Encoder Convolution |
PCA | Principal Component Analysis |
FPS | Farthest Point Sampling |
6DOF | 6 Degrees of Freedom |
References
- Costa, F.S.; Nassar, S.M.; Gusmeroli, S.; Schultz, R.; Conceição, A.G.S.; Xavier, M.; Hessel, F.; Dantas, M.A.R. Fasten iiot: An open real-time platform for vertical, horizontal and end-to-end integration. Sensors 2020, 20, 5499. [Google Scholar] [CrossRef] [PubMed]
- Ferreira Neto, N.A.; Ruiz, M.; Reis, M.; Cajahyba, T.; Oliveira, D.; Barreto, A.C.; Simas Filho, E.F.; de Oliveira, W.L.; Schnitman, L.; Monteiro, R.L. Low-latency perception in off-road dynamical low visibility environments. Expert Syst. Appl. 2022, 201, 117010. [Google Scholar] [CrossRef]
- Arrais, R.; Veiga, G.; Ribeiro, T.T.; Oliveira, D.; Fernandes, R.; Conceição, A.G.S.; Farias, P. Application of the Open Scalable Production System to Machine Tending of Additive Manufacturing Operations by a Mobile Manipulator. In Progress in Artificial Intelligence. EPIA 2019. Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2019; Volume 11805, pp. 345–356. [Google Scholar] [CrossRef]
- Carvalho de Souza, J.P.; Costa, C.M.; Rocha, L.F.; Arrais, R.; Moreira, A.P.; Pires, E.S.; Boaventura-Cunha, J. Reconfigurable Grasp Planning Pipeline with Grasp Synthesis and Selection Applied to Picking Operations in Aerospace Factories. Robot. Comput.-Integr. Manuf. 2021, 67, 102032. [Google Scholar] [CrossRef]
- de Souza, J.P.C.; Rocha, L.F.; Oliveira, P.M.; Moreira, A.P.; Boaventura-Cunha, J. Robotic grasping: From wrench space heuristics to deep learning policies. Robot. Comput.-Integr. Manuf. 2021, 71, 102176. [Google Scholar] [CrossRef]
- Qu, Z.; Gao, L.-Y.; Wang, S.-Y.; Yin, H.-N.; Yi, T.-M. An improved YOLOv5 method for large objects detection with multi-scale feature cross-layer fusion network. Image Vis. Comput. 2022, 125, 104518. [Google Scholar] [CrossRef]
- Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
- Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 5105–5114. [Google Scholar]
- Rao, Y.; Zhang, M.; Cheng, Z.; Xue, J.; Pu, J.; Wang, Z. Semantic Point Cloud Segmentation Using Fast Deep Neural Network and DCRF. Sensors 2021, 21, 2731. [Google Scholar] [CrossRef] [PubMed]
- Sirohi, K.; Mohan, R.; Buscher, D.; Burgard, W.; Valada, A. EfficientLPS: Efficient LiDAR Panoptic Segmentation. IEEE Trans. Robot. 2022, 38, 1894–1914. [Google Scholar] [CrossRef]
- Jain, S.; Argall, B. Grasp detection for assistive robotic manipulation. In Proceedings of the 2016 IEEE International Conference on Robotics and Automation (ICRA), Stockholm, Sweden, 16–21 May 2016; pp. 2015–2021. [Google Scholar] [CrossRef]
- Mahler, J.; Matl, M.; Liu, X.; Li, A.; Gealy, D.V.; Goldberg, K. Dex-Net 3.0: Computing Robust Vacuum Suction Grasp Targets in Point Clouds Using a New Analytic Model and Deep Learning. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation (ICRA), Brisbane, Australia, 21–25 May 2018; pp. 1–8. [Google Scholar]
- Mousavian, A.; Eppner, C.; Fox, D. 6-DOF GraspNet: Variational Grasp Generation for Object Manipulation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 2901–2910. [Google Scholar]
- Wang, L.; Meng, X.; Xiang, Y.; Fox, D. Hierarchical Policies for Cluttered-Scene Grasping with Latent Plans. IEEE Robot. Autom. Lett. 2022, 7, 2883–2890. [Google Scholar] [CrossRef]
- Wang, C.; Xu, D.; Zhu, Y.; Martín-Martín, R.; Lu, C.; Fei-Fei, L.; Savarese, S. Densefusion: 6d object pose estimation by iterative dense fusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3343–3352. [Google Scholar]
- He, Y.; Huang, H.; Fan, H.; Chen, Q.; Sun, J. FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation. arXiv 2021, arXiv:2103.02242. [Google Scholar] [CrossRef]
- Kitagawa, S.; Wada, K.; Hasegawa, S.; Okada, K.; Inaba, M. Few-experiential learning system of robotic picking task with selective dual-arm grasping. Adv. Robot. 2020, 34, 1171–1189. [Google Scholar] [CrossRef]
- Xiang, Y.; Schmidt, T.; Narayanan, V.; Fox, D. PoseCNN: A Convolutional Neural Network for 6D Object Pose Estimation in Cluttered Scenes. arXiv 2017, arXiv:1711.00199. [Google Scholar] [CrossRef]
- Wang, J.; Li, S. Grasp detection via visual rotation object detection and point cloud spatial feature scoring. Int. J. Adv. Robot. Syst. 2021, 18, 17298814211055577. [Google Scholar] [CrossRef]
- Hu, Y.; Hugonot, J.; Fua, P.V.; Salzmann, M. Segmentation-Driven 6D Object Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 3380–3389. [Google Scholar]
- Wang, C.; Martín-Martín, R.; Xu, D.; Lv, J.; Lu, C.; Fei-Fei, L.; Savarese, S.; Zhu, Y. 6-PACK: Category-level 6D Pose Tracker with Anchor-Based Keypoints. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10059–10066. [Google Scholar] [CrossRef]
- Zapata-Impata, B.; Gil, P.; Pomares, J.; Medina, F. Fast Geometry-based Computation of Grasping Points on Three-dimensional Point Clouds. Int. J. Adv. Robot. Syst. 2019, 16. [Google Scholar] [CrossRef]
- Moenning, C.; Dodgson, N.A. Fast Marching Farthest Point Sampling; Technical Report; University of Cambridge, Computer Laboratory: Cambridge, UK, 2003. [Google Scholar]
- Bentley, J.L. Multidimensional Binary Search Trees Used for Associative Searching. Commun. ACM 1975, 18, 509–517. [Google Scholar] [CrossRef]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
- Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016; Available online: http://www.deeplearningbook.org (accessed on 1 April 2023).
- Monteiro, F.; Vieira e Silva, A.L.; Teixeira, J.M.; Teichrieb, V. Simulating Real Robots in Virtual Environments Using NVIDIA’s Isaac SDK. In Proceedings of the XXI Symposium on Virtual and Augmented Reality, Rio de Janeiro, Brazil, 28–31 October 2019; pp. 47–48. [Google Scholar] [CrossRef]
- Rusu, R.B.; Cousins, S. 3D is here: Point Cloud Library (PCL). In Proceedings of the 2011 IEEE International Conference on Robotics and Automation, Shanghai, China, 9–13 May 2011; pp. 1–4. [Google Scholar] [CrossRef]
- de Oliveira, D.M.; Viturino, C.C.B.; Conceicao, A.G.S. 6D Grasping Based On Lateral Curvatures and Geometric Primitives. In Proceedings of the 2021 Latin American Robotics Symposium (LARS), Natal, Brazil, 11–15 October 2021; pp. 138–143. [Google Scholar] [CrossRef]
- Zhou, Q.Y.; Park, J.; Koltun, V. Open3D: A Modern Library for 3D Data Processing. arXiv 2018, arXiv:1801.09847. [Google Scholar]
- Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; Volume 32, pp. 8024–8035. [Google Scholar]
- Wu, Z.; Song, S.; Khosla, A.; Yu, F.; Zhang, L.; Tang, X.; Xiao, J. 3D ShapeNets: A deep representation for volumetric shapes. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1912–1920. [Google Scholar]
- Kasaei, S.H.M. OrthographicNet: A Deep Learning Approach for 3D Object Recognition in Open-Ended Domains. arXiv 2019, arXiv:1902.03057. [Google Scholar]
- Liu, S.; Giles, L.; Ororbia, A. Learning a Hierarchical Latent-Variable Model of 3D Shapes. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 542–551. [Google Scholar] [CrossRef]
- Gomez-Donoso, F.; Garcia-Garcia, A.; Garcia-Rodriguez, J.; Orts-Escolano, S.; Cazorla, M. LonchaNet: A sliced-based CNN architecture for real-time 3D object recognition. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 412–418. [Google Scholar] [CrossRef]
- Yavartanoo, M.; Kim, E.Y.; Lee, K.M. SPNet: Deep 3D Object Classification and Retrieval Using Stereographic Projection. In Computer Vision–ACCV 2018; Jawahar, C., Li, H., Mori, G., Schindler, K., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 691–706. [Google Scholar]
- Kanezaki, A.; Matsushita, Y.; Nishida, Y. RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5010–5019. [Google Scholar] [CrossRef]
- Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. Generative and Discriminative Voxel Modeling with Convolutional Neural Networks. arXiv 2016, arXiv:1608.04236. [Google Scholar]
- Sfikas, K.; Pratikakis, I.; Theoharis, T. Ensemble of PANORAMA-based convolutional neural networks for 3D model classification and retrieval. Comput. Graph. 2018, 71, 208–218. [Google Scholar] [CrossRef]
- Li, J.; Chen, B.M.; Lee, G.H. SO-Net: Self-Organizing Network for Point Cloud Analysis. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 9397–9406. [Google Scholar]
- Xiang, T.; Zhang, C.; Song, Y.; Yu, J.; Cai, W. Walk in the Cloud: Learning Curves for Point Clouds Shape Analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 915–924. [Google Scholar]
- Gezawa, A.; Bello, Z.; Wang, Q.; Yunqi, L. A voxelized point clouds representation for object classification and segmentation on 3D data. J. Supercomput. 2022, 78, 1479–1500. [Google Scholar] [CrossRef]
- Chen, Y.; Liu, J.; Ni, B.; Wang, H.; Yang, J.; Liu, N.; Li, T.; Tian, Q. Shape Self-Correction for Unsupervised Point Cloud Understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 8382–8391. [Google Scholar]
Figure 1.
Grasping system overview. Filtering Stage: returns the point cloud of each object. Classification Stage: returns the desired object’s point cloud, allowing selective grasping. Grasping Stage: returns the 6D pose to grasp the desired object.
Figure 1.
Grasping system overview. Filtering Stage: returns the point cloud of each object. Classification Stage: returns the desired object’s point cloud, allowing selective grasping. Grasping Stage: returns the 6D pose to grasp the desired object.
Figure 2.
PEC’s network architecture.
Figure 2.
PEC’s network architecture.
Figure 3.
Dataset generation through Isaac Sim.
Figure 3.
Dataset generation through Isaac Sim.
Figure 4.
Classes and objects of the LARS Classification Dataset.
Figure 4.
Classes and objects of the LARS Classification Dataset.
Figure 5.
Classification process.
Figure 5.
Classification process.
Figure 6.
Robotiq 2F-140 general dimensions: gripper stroke mm and finger width mm.
Figure 6.
Robotiq 2F-140 general dimensions: gripper stroke mm and finger width mm.
Figure 7.
A joypad broken in regions.
Figure 7.
A joypad broken in regions.
Figure 8.
An object with regions that have a different orientation than the complete object orientation.
Figure 8.
An object with regions that have a different orientation than the complete object orientation.
Figure 9.
Training process of PEC on LARS classification Dataset.
Figure 9.
Training process of PEC on LARS classification Dataset.
Figure 10.
Sequence of stages to perform a grasp.
Figure 10.
Sequence of stages to perform a grasp.
Figure 11.
RGB images and their respective point clouds. Objects with simple geometry are grasped at their centroids.
Figure 11.
RGB images and their respective point clouds. Objects with simple geometry are grasped at their centroids.
Figure 12.
Selective Grasping. The environment has two objects (pliers and a joypad), and the objective is to grasp the joypad.
Figure 12.
Selective Grasping. The environment has two objects (pliers and a joypad), and the objective is to grasp the joypad.
Figure 13.
Selective Grasping. The environment has three objects (pliers, staples, and a joypad), and the objective is to grasp the staples.
Figure 13.
Selective Grasping. The environment has three objects (pliers, staples, and a joypad), and the objective is to grasp the staples.
Table 1.
Geometric primitive classification.
Table 1.
Geometric primitive classification.
Sphere | 0.0 ≤ ≤ 0.10 |
Cylinder | 0.10 < ≤ 0.40 |
Box | 0.40 < ≤ 1.0 |
Table 2.
Table comparing neural networks on ModelNet10.
Table 2.
Table comparing neural networks on ModelNet10.
Network | Accuracy |
---|
| 78.8% |
| 88.74% |
| 89.5% |
| 69.8% |
| 79.4% |
| 92.24% |
| 82.9% |
3DShapeNets [32] | 83.5% |
OrthographicNet [33] | 88.56% |
VSL [34] | 91.0% |
LonchaNet [35] | 94.37% |
SPNet [36] | 97.25% |
RotationN [37] | 98.46% |
VRN Ensemble [38] | 97.14% |
Panorama-ENN [39] | 96.85% |
SO-Net [40] | 95.7% |
CurveNet [41] | 96.3% |
Voxelized Point Clouds [42] | 93.4% |
Shape Self-Correction [43] | 95.5% |
Table 3.
PEC accuracy on LARS Classification Dataset.
Table 3.
PEC accuracy on LARS Classification Dataset.
Network | Accuracy |
---|
| 89.2% |
| 92.5% |
| 89.1% |
| 67% |
| 91% |
| 91.5% |
Table 4.
Table comparing the proposed grasping system vs. a Generic Grasping Algorithm (GGA).
Table 4.
Table comparing the proposed grasping system vs. a Generic Grasping Algorithm (GGA).
Object | Proposed | GGA |
---|
Joypad | 90% | 40% |
Cylinder | 95% | 85% |
Plug Adapter | 100% | 100% |
Pliers | 100% | 65% |
Wire cutter | 95% | 70% |
Wire cutter (Open) | 85% | 0% |
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).