3.2. Topological Skeleton Extraction
The proposed method has been developed for objects that:
are rigid, as it is not applicable to deformable objects;
are not perfectly symmetrical since although it is possible to define a non-symmetric topological skeleton, the detector may become confused during the extraction process due to symmetrical features.
In this work, we focus on real automotive parts, including two crankcase oil separator covers made of cast iron and plastic and an air pipe. The selected objects have an increasing level of difficulty. The first object, the cast iron crankcase oil separator cover, exhibits a high degree of symmetry with multiple grasping points and can be grasped by a cylindrical part, therefore reducing the impact of the robot orientation errors around the axis of the pin. The second object, the plastic crankcase oil separator cover, also exhibits a high degree of symmetry with various grasping points but must be grasped with a specific orientation. Finally, the air pipe has a complex shape, lacks symmetry, and has only two available grasping points, representing the most challenging task for the robot.
We decided to model their skeletons considering a few keypoints, some of which correspond to the potential grasping points for lifting that object with the robot manipulator (see the upper right part of
Figure 1). To detect the Topological Skeleton (TS) of the objects to be grasped, we consider Lightweight OpenPose [
19], whose architecture consists of three main components: a feature extractor, a TS estimator, and a Part Affinity Fields (PAF) network.
In comparison to the original OpenPose [
20], we chose Lightweight OpenPose because the high computational demand of the former method makes it less applicable in real-time applications on devices with little processing power. OpenPose employs a two-branch, multi-stage CNN architecture. The first branch predicts part confidence maps (PCM) for body parts, and the second branch predicts part affinity fields (PAF) to model the connections between body parts. The architecture involves several stages of convolutions to refine these predictions iteratively, resulting in high accuracy at the cost of increased computational load. Light OpenPose, on the other hand, modifies the original architecture to reduce complexity and improve efficiency. The approach reduces the number of convolutional layers and stages and uses depthwise separable convolutions in place of standard convolutions to reduce the number of parameters and operations. Moreover, the backbone network uses MobileNet or ShuffleNet in place of the heavier VGG19 or ResNet used in the original OpenPose and optimizes the computation of part affinity fields to strike a balance between accuracy and efficiency.
Feature extraction. The original Lightweight OpenPose uses a MobileNetV1 network that is optimized for reaching real-time feature extraction. MobileNet is a family of neural network architectures designed for efficient deployment on mobile and embedded devices with limited computational resources. The key feature of MobileNet is its use of depthwise separable convolutions, which can significantly reduce the number of parameters and computations required while maintaining high accuracy.
While MobileNetV1 is a highly effective neural network, it does have some limitations and drawbacks that should be considered. For instance, it has limited accuracy because it is designed to balance model size and accuracy. It may not achieve the same level of accuracy as larger and more complex neural networks, especially on challenging objects where the keypoints (joints) are not evident. The depthwise separable convolution operation used in MobileNetV1 can be less expressive than traditional convolutional operations and may not be able to capture all the important features of an image.
For the above reasons, in this work, we propose to replace MobileNetV1 with MobileNetV3 [
21] for the feature extraction step. MobileNetV3 has been designed to address the limitations of MobileNetV1 while maintaining efficiency. The architecture of the MobileNetV3 network used in this work is shown in
Figure 3.
MobileNet V3 has two main variants: (1) MobileNet V3-Large designed for higher accuracy applications, with more layers and channels, and (2) MobileNet V3-Small optimized for resource-constrained environments, trading off some accuracy for reduced computational demand. Our choice fell on the latter one. MobileNet V3 introduces several new components, such as Inverted Residual Blocks to maintain a high degree of efficiency, Squeeze-and-Excitation (SE) Modules to improve the representational power of the model by recalibrating channel-wise feature responses and the H-Swish Activation Function.
The hard-swish function is a non-linear activation function that is designed to be more efficient than traditional activation functions such as ReLU. The hard-swish function is defined as
where
is a clipped ReLU function that outputs values between 0 and 6, still providing a non-linear behavior while increasing the computational speed with respect to the standard
function.
Another important feature of MobileNetV3 is the use of a squeeze-and-excitation (SE) module. The SE module is a simple and efficient way to improve the representational power of the network (i.e., the ability to learn and represent complex patterns and features in the input data). It works by learning channel-wise scaling factors that are used to selectively enhance informative features in the network. The SE module is added to each bottleneck block in the MobileNetV3 architecture, contributing to increasing the accuracy with respect to MobileNetV1.
MobileNetV3 also introduces a new technique called the mobile inverted bottleneck convolution (MBConv), which is a modified form of the depthwise separable convolution used in MobileNetV1. The MBConv block consists of three types of convolutions: a 1 × 1 convolution to expand the number of channels, a depthwise convolution to perform spatial filtering, and a 1 × 1 convolution to reduce the number of channels back to the original size. The MBConv block also includes a shortcut connection that allows the gradient to flow directly from the input to the output. MBConv block helps in increasing the expressiveness of the model with respect to MobileNetV1.
Finally, MobileNetV3 includes a middle-flow block that is used to maintain a high level of accuracy while minimizing the number of computations required. It also uses a dynamic convolution operation that adapts to the input data. The details of the parameters used for each block are described in
Table 2. The input image is resized to 384 × 384, and the output is a set of 24 × 24 × 96 feature maps, one for each keypoint and one for the background.
TS estimation. The feature maps from MobileNetV3 are the input to generate a set of candidate key points for each object part in the image. In fact, the feature maps capture the spatial information in the input image and provide a rich representation of the image that can be used to detect keypoints. MobileNetV3 adds a custom head to predict keypoint locations, which consists of o a series of convolutional layers that generate heatmaps, refining the features extracted by the backbone and generating heatmaps for each keypoint.
Figure 4 shows an example of the TS estimator output for the cast iron crankcase oil separator cover, which consists of five heatmaps, one for each considered keypoint. Each heatmap has the same spatial resolution as the feature maps and is normalized to have values between 0 and 1. Each pixel in the heatmap indicates the likelihood that the corresponding body part is present at that location in the image.
PAF network. It takes the feature maps generated by the feature extractor as input and outputs a set of PAF feature maps, one for each pair of the detected keypoints. The PAF feature maps encode the direction and strength of the connections between keypoints using a two-channel representation, where each channel encodes a different aspect of the connection. Specifically, one channel encodes the unit vector that represents the direction of the connection, while the other channel encodes the confidence score that represents the strength of the connection.
Final TS computation. Once the PAF and heatmaps are generated, they are used together to group the individual keypoints into the final TS. The final TS is obtained by first identifying the candidate connections using the PAFs and then scoring the connections based on the likelihood that they form a valid connection. The connections are then used to construct the final TS by connecting the individual keypoints into a complete object TS.
Once the keypoints are calculated, we use the depth information for building the final 3D TS given the set of keypoints from Lightweight OpenPose.
Figure 5 shows some examples of final TSs for the three objects considered, where it is possible to note the robustness of the proposed TS extraction approach with respect to different views of the object, to photochromic changes and partial occlusions. The approach is also working with multiple instances of the object.
3.3. Grasping Pose Selection
After the selection of the
keypoints for the TS extraction, these keypoints are also identified within the CAD model of the object through 3D modeling software. This results in the generation of a nominal three-dimensional representation of the TS,
, in the CAD coordinate system,
. Moreover, the poses in
of all possible
grasping reference frames (see
Figure 6), expressed via the (4 × 4) homogeneous transformation matrix [
22],
,
, can be localized on the model. For the sake of clarity, let us assume that each grasping point coincides with a keypoint.
Then, given all possible combinations of three keypoints
for each triple
, a plane is identified via a coordinate frame attached to it, whose pose is denoted by the homogeneous transformation matrix
. For each grasping reference frame, the relative pose with respect to the
ith plane can be determined as
Thus, for each grasping point, a list of
transformation matrices,
, representing the grasping frame poses in the plane frame, can be computed. This set of operations, summarized in Algorithm 1, is performed only once.
Algorithm 1: Pre-processing algorithm |
|
At runtime, the following steps are executed:
- 1.
A YOLO detector [
23] is adopted to distinguish between the objects. YOLO has been chosen since it is faster than classifier-based systems but with similar accuracy and makes predictions with a single network evaluation by considering object detection as a single regression problem, and this leads to high accuracy performance. Moreover, YOLO can detect and classify multiple objects simultaneously within an image.
- 2.
The current 3D TS, , is extracted.
- 3.
The grasping point closest to the camera, , is selected as the best one.
- 4a.
If at least 3 keypoints are visible, a set of three keypoints, , is used to compute the corresponding plane in the camera frame, , and to select the homogeneous transformation matrix, , that identifies the grasping pose in the plane frame. Then, the procedure continues with the step 5.
- 4b.
If only 2 or fewer keypoints are visible, the robot starts moving in a circle around the center of the object bounding box to acquire a new image from a different point of view. Then, the procedure comes back to the step 1.
- 5.
The grasping pose in camera frame is computed as
This procedure is summarized in Algorithm 2.
Let us define the homogeneous transformation matrix
, i.e., the constant homogeneous matrix performing the transformation between the camera frame and the end-effector frame, obtained via the calibration method described in
Section 3.1.
Algorithm 2: Runtime algorithm |
|
To capture the grasping pose in the inertial frame,
is transformed as follows
where
is the homogeneous matrix representing the pose of the end-effector in the inertial frame.
Remark 1. It is worth noting that if the grasping point is not coincident with a keypoint, the above procedure is still applicable, but a further constant transformation needs to be applied to link the grasping point to one of the keypoints belonging to the plane.
3.4. Robot Grasping
To perform the grasp, the end-effector must be commanded to align its reference frame to the grasping reference frame. The trajectory is planned by assigning a sequence of three points: the first one is the view pose of the robot, the intermediate one is the
approach point, i.e., a point positioned along the
z axis of the grasping reference frame at a distance of 10 cm to the origin, and the last one is the estimated grasping position,
. More in detail, the end-effector desired position,
, is defined as
where
is the view position and
is the approach point position,
(
) is the
arc length form
to
(from
to
). To ensure continuous acceleration and velocities at the path points, both for
and
, the time-law can be designed as a quintic polynomial. Regarding the time instants,
is the duration of the motion, and
is the intermediate time instant at the approach point that is chosen to have a fast motion until the approach point and a slow motion in the object’s proximity.
Regarding the end-effector orientation, it is planned to reach the same orientation of the estimated grasping pose, , at the approach point and to keep such orientation constant during the last part of the path.
The planned trajectory in terms of position and orientation is the input of the closed-loop inverse kinematics algorithm [
22] aimed at computing the reference values of the joint positions and velocities. Let denote with
and
the end-effector position and orientation, respectively, and with
the end-effector desired orientation. The robot joint velocity references,
, are computed as
where
denotes the right pseudo-inverse of the robot Jacobian matrix,
is a positive definite matrix gain,
is the desired end-effector linear and angular velocity, and
e is the tracking error defined as
where
and
are the unit quaternion extracted from
and
, respectively, and
is the skew-symmetric matrix operator performing the cross product [
22].
A flowchart representation highlighting the whole process is given in
Figure 7.