1. Introduction
Achieving dexterous object manipulation with an anthropomorphic robotic hand much like a human hand is still a challenge in robotics. There is a growing interest in adopting anthropomorphic robotic hands for various applications in logistics and production lines [
1], and to improve emerging fields of service robotics [
2]. For instance, anthropomorphic robotic hands are available such as Shadow Hand [
3], ADROIT [
4], and RBO Hand2 [
5], which allow highly dexterous movements. Controlling these anthropomorphic robotic hands is challenging due to their high degree-of-freedom (DoF) in movements [
6] compared to under-actuated manipulators such as parallel grippers and vacuum grippers [
7,
8]. For an anthropomorphic robotic hand, with five independent fingers and their higher DoFs, the complexity of control increases accordingly. Especially for human-like grasping of an object, which requires placing the robotic fingers around the object while considering constraints to achieve a stable grasp much like human fingers. In addition, relocation using robotic hands strongly requires a stable grasping to relocate an object to a target position without dropping. In this work, we propose a method for natural grasping and object manipulation based on a novel synergy space.
Recent works have proposed deep reinforcement learning (DRL) for object manipulation. A DRL agent (i.e., policy) learns optimal actions in an environment to maximize a cumulative reward via trial-and-error [
9,
10]. For instance, a DRL policy was trained using a non-parametric Relative Entropy Policy to roll an object in a real-world environment using only images from a camera sensor and a three-fingered robotic hand [
9]. More recently, a DRL policy using Proximal Policy Optimization was trained for achieving complex in-hand manipulation for solving a Rubik cube using a five-fingered anthropomorphic robotic hand with RGB sensor data [
6]. DRL approaches without a model of the environment are challenging as they need to interact and collect large amounts of interactions (i.e., actions and states) for learning tasks. Besides, the tasks learned by DRL are difficult to generalize to different manipulation tasks (e.g., pushing and grasping). As a result, different approaches train a DRL policy with human demonstrations to help the task with prior knowledge (i.e., behavior cloning) [
11,
12]. In this work, a DRL policy learned object manipulation with a latent representation of natural hand poses derived from human grasping information.
Some approaches use human demonstrations to train hand manipulation skills with supervised learning methods, thereby optimizing the DRL [
13,
14]. For instance, in [
14] human demonstrations were used to pre-train a Q-function approximator that guided the policy exploration by minimizing the error between the demonstrations and policy experience with a prioritized replay mechanism. In [
15], an algorithm for training the DRL policy using an augmented cost function with demonstrations was proposed using a natural policy-gradient search algorithm. DRL policies use expert knowledge such as grasping locations with gripper poses over the object or predefined hand poses from a grasp taxonomy for a successful grasp. This approach is proven successful for grasping trained objects [
16,
17]. There are not many works that use high dimensional data such as 3D point clouds as priors to compute the grasp movements for object manipulation using robotic. Ji et al. [
18] took an object partial point cloud and obtained a bounding box to identify the shape, pose, and orientation. Then, based on this box a policy chooses between two grasp strategies (i.e., top-down or side-grasp) and plans the final pose for the robotic hand to grasp the object that switched to position-based force control to grasp the object. Osa et al. [
19] proposed a hierarchical DRL utilizing features extracted from potential areas for object grasping in a given point cloud to teach the policies proper grasp planning. The lower-level policies learned different poses to grasp an object whereas the upper-level policy learned to select a specific grasp for each object. Learning to grasp using grasping priors allows the policies to propose novel hand poses for various objects, which otherwise would be difficult to model. However, simultaneous learning of various policies in the hierarchical DRL is not a trivial task and still requires high-quality human demonstrations. As the complexity of control increases with the number of DoF from anthropomorphic robotic hands, it is hard to identify and control each DoF simultaneously. Robust control is needed that can exploit pose information from human demonstrations and grasp priors in DRL. In this context, a prior-based latent space using 3D point clouds of haptic information could improve the DRL policy to control robotic hands.
For anthropomorphic robotic hands with a high DoF, it is crucial to derive a feasible space of optimal grasping poses to control the finger movements (e.g., finger joint angles). Earlier works employ a dimension reduction of this space using Principal Component Analysis (PCA) to simplify the planning and control of a fully-actuated robotic hand by finding the correlation between fingers and joints movements. For instance, in Santello et al. [
20], patterns of covariation for 15 joint sensors in a total of 57 hand postures were investigated using PCA. The principal components computed from the eigenvalues and eigenvectors of the matrix of covariance coefficients showed that using the first three principal components could account for more than 80% of the grasp variance. In addition, including some low-order components of PCA obtained from features computed from images of 3D object models improved the control for grasping objects [
21,
22]. Synergies represent the variance of various hand poses for grasping and the correlation between the fingers’ joint movements in a low-dimensional space. These approaches using PCA for computing the synergies restricted the hand motion to a limited number of static hand postures defined at the end of the task. This implies that improving DRL policies to find the optimal grasp configuration requires better ways to approximate the relationship between the proper hand poses and synergy space [
22]. Synergies using lower-order principal components represent coarse hand opening and closing and adding a few high-order components provides a finer control [
23,
24,
25]. To ensure a fast convergence of the RL algorithm a dimensionality reduction in the search space is needed. For such purposes, an efficient synergy space can be used as a tool in planning and controlling a fully actuated and high-DOF anthropomorphic robotic hand for complex hand manipulations.
Our proposed DRL policy generates natural hand poses for object manipulation based on a novel low-dimensional synergy space. First, we obtained high-dimensional point clouds using haptic contact areas of various objects from human demonstrations of successful grasping. Then, we generated a novel low-dimensional synergy space from this haptic information via Continuous Normalizing Flow (CNF) using a deep autoencoder network. The obtained latent space via CNF contains a rich representation of haptic information, which a deep neural network regressor utilizes to estimate the joint angles for an anthropomorphic hand. The proposed synergy-based DRL performs a dexterous grasping and relocation of objects using an anthropomorphic robotic hand. Through various experiments, our results demonstrate that the proposed DRL policy and regressor can generate natural hand poses to grasping and relocation variously shaped test objects. The remainder of this paper is organized as follows.
Section 2 provides an overview of the database, the proposed DRL structure, and the CNF synergy space modeling.
Section 3 presents the qualitative and quantitative results with the synergy-based DRL and standard DRL for grasping and relocation tasks. Lastly,
Section 4 presents conclusions and a discussion of the work.
2. Methods
The proposed DRL method and the simulation environment are illustrated in
Figure 1. The DRL policy is instantiated using a deep neural network trained to control the robot hand movements to grasp an object and relocate to a target location. Formally, in each timestep
, the action vector
controls the displacement of the joint angles in the hand and the position of the robotic hand to navigate within the environment. The robot hand interacts with the object in the environment which generates the next timestep state
of the object and robotic hand (i.e., joint angles and cartesian positions), which are sent to the DRL agent
(i.e., policy). The policy generates two outputs: an object-specific sample described by the synergy space of haptic information
and a vector to control the robot hand position
for the next timestep
. The optimal natural hand pose (i.e., joint angles)
is obtained via a pretrained regressor
using the synergistic information in
. The action vector for the next step
is composed by
and sent to the robot to interact within the environment again. By iterating this process, the DRL algorithm learns to derive an optimal hand pose for natural manipulation of the objects with the anthropomorphic robotic hand.
2.1. Simulation Environment
The simulation environment is set up with the Mujoco physics engine [
26]. Mujoco is a robotics simulation package providing robotic kinematics, objects models, and manipulation tasks. The simulated anthropomorphic robotic hand is modeled after the ADROIT hand [
4], which has 24-DoF with five fingers for dynamic and dexterous manipulations. The first, middle, and ring fingers have 4-DoF. The little finger and thumb have 5-DoF, while the wrist has 2-DoF. Each joint is equipped with a joint angle sensor and a haptic sensor. The movements are controlled via a position control actuator.
2.2. Synergy Space of Haptic Information
The proposed low-dimensional synergy space is created by encoding high-dimensional 3D haptic point clouds of each variously shaped object from successful grasping. This low-dimensional representation of haptic information helps to reduce the action space that the DRL policy explores to learn how to manipulate objects. An encoder-decoder architecture is trained with a database of grasping demonstrations to attain the latent representation.
2.2.1. PCA based Synergy Space
Previous works that derived a latent representation for grasping poses in [
20,
21,
27,
28], showed that the joints angles in the hand are not independently controlled. Also, the grasping poses can be represented in a low-dimensional synergy space. As a result, reducing the complexity of higher DoFs in anthropomorphic hands allows controlling grasp poses more effectively. However, the low-dimension representation of grasps was obtained using PCA on a set of handcrafted features in terms of hand joint angles of the predefined hand poses according to the object shape. In contrast [
20,
21,
23], our proposed method replaces the predefined set of hand joints with higher dimensional 3D point clouds of haptic information. The haptic information includes various shapes from an apple, camera, flashlight, hammer, banana, and lightbulb extracted from a public database of human demonstrations with a large variety of successful grasps, which yields a dataset containing a rich variety of natural hand poses. Nevertheless, obtaining a linear combination of components that covers most variance in the point clouds of 3D objects shapes using PCA might be limited. PCA assumes normally distributed data and might fail to approximate the complex manifold of the natural hand poses in the database. Therefore, in this work, we utilize a deep neural network based on flow models that learns a non-linear parametric function to capture the underlying variability of such high-dimensional data. It is expected that this approach better exploits the latent representation from point clouds for generating natural hand poses.
2.2.2. Contact Map Dataset
The ContactDB database contains data of various household objects including 3D point clouds of hand contact information from grasping demonstrations by human hands. A thermal camera was used to record the after-heat prints from a human hand after grasping each object [
29,
30]. The thermal images represent the hand-object contact map and the contact points are recorded as continuous values defined over the mesh surface for each object. The point clouds included samples from six objects, grasped with the intent of use, and handoff the object after grasping by 50 participants. We normalized the point clouds between a range of 0 and 1 and sampled a fixed number of points
from the values above a threshold of 0.75, which indicates a higher grasping probability. The 3D point clouds of haptic contact points are
. After sampling, each point cloud is augmented with random rotations up to 162-point clouds are obtained for each of the six objects used in this study.
2.2.3. Synergy Space via Continuous Normalizing Flow
We propose a framework of condensing high-dimensional 3D point clouds into a latent space which includes local variations and complicated distribution of haptic information from human object grasping demonstrations. For such purpose, an encoder-decoder network using a flow-based model is proposed. Flow-based models are efficient to model a complex distribution (e.g., 3D point clouds of natural hand poses) into a compact space through a sequence of invertible transformation functions [
31]. In particular, we can though of a point cloud from the haptic maps
to be part of a larger distribution of haptic information
. The continuous normalizing flow could sample points from a generic prior distribution (e.g., Gaussian) and induce a transformation to match the distribution of
[
31,
32]. The latent space derived from a database of natural hand poses using the encoder-decode model proposed is named a synergy space.
The encoder network
, based on the PointNet architecture [
33], infers a posterior
(i.e., the sample of synergy space) given the input of haptic point clouds
. The encoder is a neural network with four 1D convolutional layers with feature maps of 128, 128, 256, and 512 respectively, which are followed by three fully connected layers with 256, 64, and 4 units. The encoder architecture includes max-pooling layers as symmetric function and T-Net modules for input and feature alignment networks as introduced in the PointNet network. The decoder,
is based on the continuous normalizing flow (CNF) architecture in [
34]. It computes the reconstruction of the point cloud
conditioned on the latent space
evaluating the likelihood of
and the inverse CNF,
. The decoder is instantiated with a tri-layered fully connected network with 512 units in each layer.
CNF employs a sequence of invertible transformations, defined using neural networks, that learns a mapping from the synergy space
to the input
. This mapping is done such that the embedding
corresponds to
. The normalizing flow
is a parametrized function that models the probability density of
to estimate the contact locations by minimizing:
For a single deterministic sample
, using a change of variables, inverse function theorem, and logarithm properties, it can be written as [
32]:
where the partial derivative is the Jacobian matrix of a function
at step
and the value of log-determinant measures the change in the log-density made by transformation
. In practice, choosing transformations whose Jacobian is a triangular matrix achieves a fast calculation and ensures invertibility. The density function for the synergy space
can be defined as a spherical multivariate Gaussian distribution
. In the case of continuous 3D point clouds with a continuous dense distribution
and a synergy space prior distribution
, CNF can be written as [
34]:
Here,
can be calculated using the inverse flow of the transformation function
. A black-box ordinary differential equation (ODE) solver can be applied to estimate the outputs and the inputs gradients of continuous normalizing flow [
35].
The synergy space was built considering the haptic contact map dataset to be a part of a larger distribution of human-like finger placements for natural hand poses. By learning a synergy space of this larger distribution via CNF, new haptic contact points are proposed similar to the contact map dataset.
2.2.4. Training and Validation Synergy Space via CNF
Previous works that derived a synergy space for high DoF robotic hands selected the higher principal components from PCA to represent the variability of grasp poses [
20,
21,
22,
23,
25]. Despite the dimensions differing between all works, a minimum of three or four principal components was commonly used (in some cases added more dimensions for finer control). Therefore, the CNF latent space is trained and tested for optimal dimension starting with 4 and increasing to 16, and 32.
The point clouds from the contact map dataset were randomly split into 90% training and 10% testing sets. CNF is trained using a black-box ODE solver to compute the gradients of the normalizing flows and updates the parameters of the deep network. During testing, we computed the Chamfer distance (CD) and the earth mover’s distance (EMD) between the ground truth point clouds and the reconstructed output from the CNF decoder. The lower value for these metrics indicates a better reconstruction capability of the decoder, which are defined as follows:
where
and
are two point clouds with the same dimension and
is a bijective function between them.
2.3. Synergy-Based Regressor
The regressor estimates the corresponding joint angles for the hand pose sampled from the low-dimensional synergy space.
2.3.1. Haptic Map Dataset
A dataset of natural hand poses from human demonstrations and 3D point clouds of object contact points was created. Retargeting hand poses obtained from human demonstrations into our simulated environment, the haptic sensors located in the simulated anthropomorphic robotic hand provided the approximate locations of contact points over the object mesh. Then, a normal distribution of points was sampled from the object surface as 3D point clouds, . Complementarily, the joint angles corresponding to the 22-DoF of the hand pose while grasping an object, were recorded. The generated dataset had 2730 pairs of 3D contact point clouds and joint angles for 22-DoF hand poses for each object. This dataset was named the haptic map dataset and used in training the regressor.
2.3.2. Deep Regressor Using Synergy Information
The 3D point clouds from the haptic map dataset
are mapped into synergy representations using the pre-trained encoder,
. Then, the regressor,
uses these latent representations to estimate the joint angles,
generating a natural hand pose of the anthropomorphic hand. The regressor is parameterized using a neural network with four layers of 256 neurons. Each layer uses a ReLU piecewise activation function and batch normalization. The parameters are optimized minimizing the mean squared error (MSE) between the regressor output and corresponding joint angles
for the 22-DoF fingers of the anthropomorphic hand.
Figure 2 illustrates the regressor under a DRL policy,
, using a synergy representation from the point cloud to estimate the angles of the finger joints for corresponding natural hand poses in time series.
2.3.3. Training and Validation Synergy-Based Regressor
The haptic map dataset was randomly split into 80% for training and 20% for testing. Training of the regressor minimizes the MSE loss function using Adam optimizer with a learning rate of . For testing, the overall MSE error was computed for each joint angle of the anthropomorphic hand with the regressor prediction. The regressor was trained until the error for every joint is less than 5% from each corresponding DoF.
2.4. Synergy-Based Deep Reinforcement Learning
In our system, the DRL policy explores potential solutions for natural object grasping and relocation using the anthropomorphic robotic hand. The policy accesses the synergy space corresponding to each object and obtains low-dimensional synergistic haptic information. Finally, the pre-trained regressor, estimates a natural hand pose based on the synergistic information. The goal of the policy, is to maximize the sum of expected rewards at the end of each training episode and utilize a reward function as an indicator of task completion.
A model-free DRL policy
is proposed, where the action vector
is an element of
and represents the haptic maps in the synergy space.
is the current state of the simulation with the Cartesian coordinates of the object, hand, and target location within the environment. A reward function,
provides a measure of the completeness of the task, measures how close is the object to the target location at the end of an episode. For training, an on-policy gradient method optimizes the parameters,
of the policy by maximizing the objective,
using gradient ascent. Our implementation follows the natural policy gradient implementation (NPG) in Kakade [
36], which computes the vanilla policy gradient:
Then, it pre-conditions this gradient with the Fisher Information Matrix,
and makes the following normalized gradient ascent update:
where
is the step size of choice. The advantage function
is the difference between the value for the given state-action pair and value function of the state, the NPG implementation in [
36] uses the general advantage estimator (GAE) defined as:
where,
defines the temporal difference residual between consecutive prediction assuming a value function
approximating the true value function,
. Both the policy and value networks share the same architecture and were defined by a three-layer neural network with 64 units in the hidden layers.
2.4.1. DRL Reward for Task Completion
Shaping the reward function to solve the manipulation tasks must reflect the progress on the learning problem and guide the exploration of DRL. In our work, the reward is defined as a combination of the Cartesian distance between the anthropomorphic hand, object, and target locations. The following distances are used:
is the distance from the robotic hand to the object location,
from the robotic hand to the relocation target, and
from the object location and relocation target. Then the total reward is given as:
During training, the term aims to reduce the distance between the hand and the object. Then, the second term evaluates if the object has been lifted from the table above a threshold . When the object is lifted, minimizing the distance between the hand and object to the target location receives a higher reward. The total reward, measures the completion of grasping and relocation tasks. The greater the sum of rewards at the end of an episode indicates the faster the policy solves the tasks.
2.4.2. Training and Validation for Synergy-Based DRL
Using our proposed approach, an individual policy in a multilayer perceptron with two hidden layers of 64 nodes was trained three times per object. All policies were trained for 5000 iterations, each iteration with episodes, and a time horizon of time steps. The parameter initialization for each policy was done using a random seed every time. The proposed method is compared against a standard DRL trained per object without the synergy space to solve the same grasping and relocation tasks using the NPG algorithm and the reward function . The training also used the same number of iterations, episodes, and time steps. We evaluate the success rate between both methods. The success rate is the ratio of successful grasping and relocation among all generated trajectories. We compare the success rate for the standard DRL using NPG without the synergy space and the proposed DRL using synergy space from 100 generated trajectories.
4. Discussion
A robust DRL with anthropomorphic robotic hands requires learning an effective policy for controlling a robot hand with large DoFs. By learning from human grasping demonstrations, we can leverage the prior knowledge needed to learn various tasks with DRL for robotic manipulation. In this work, we have shown that integrating a synergy space derived from haptic information into DRL improves the learning for object manipulation. First, deriving the latent space via CNF showed promising results, reconstructing the complex distribution of high-dimensional point clouds as in
Figure 3. We explored empirically the dimension of the latent space and analyze that a 4-dimensional synergy space could encode as much information as a 16- and 32-dimensional synergy space. Further analysis of the synergy dimension could provide a perspective of when is more beneficial to use a more expressive (i.e., larger) latent dimension.
The policies trained via the proposed method outperform in five out of six objects tested as in
Figure 5. The hand poses estimated by the proposed regressor are correlated with the variety of natural grasps in the grasping demonstration database used for deriving the synergy space. The proposed DRL could be used for grasping objects with similar shapes to the objects in the training database. For more complex and novel objects, our method could be fine-tuned with an additional synergy space of them.
The DRL manipulation system controls the actions of a robotic hand with high dexterity for grasping and relocating various objects. The manipulation results show the hand extending the fingers before reaching the object and adapt the angle for grasping as seen in
Figure 5. Furthermore, the abnormal behaviors for object manipulation observed with the standard DRL were avoided with our proposed DRL.
5. Conclusions
This work proposed a synergy-based DRL policy for generating natural hand poses for grasping and relocating variously shaped objects including apple, hammer, banana, lightbulb, flashlight, and camera. The average success rate achieved by the proposed DRL policy was 88.38%, outperforming the 50.66% success rate achieved by standard DRL without synergy. We trained an encoder-decoder network via CNF to reduce high-dimensional 3D point clouds of hand poses into a 4-dimensional synergy space. Then, a synergy-based regressor estimated the joint angles for the 22-DoFs from the anthropomorphic finger joints with less than a 2% MSE error. Finally, the proposed method for grasping and relocation using the anthropomorphic hand showed qualitative and quantitative results, indicating its ability to produce natural movements for manipulating variously shaped objects.
Effectively reducing the action space for DRL via synergies helped to learn a policy for natural grasping with a 22-DoFs anthropomorphic hand. Learning an individual synergy space for each object allows the proposed DRL to produce natural hand poses and grasp the objects in a human-like manner (i.e., opening the hand before grasping and placing the fingers on the object in a natural way). Finally, an extension of this work would be learning a general synergy space for various objects conditioning the latent space to the corresponding object shape. To learn a non-trivial grasping and relocation policy with natural hand poses for an anthropomorphic hand, future considerations include experiments with more complex objects and improving the synergy space representation for more complicated objects.