A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution

Duan, Jianguo; Ye, Chuyan; Wang, Qin; Zhang, Qinglei

doi:10.3390/act14020050

Open AccessArticle

A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution

¹

China Institute of FTZ Supply Chain, Shanghai Maritime University, Shanghai 201306, China

²

Logistics Engineering College, Shanghai Maritime University, Shanghai 201306, China

³

Economics and Management College, Shanghai Maritime University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Actuators 2025, 14(2), 50; https://doi.org/10.3390/act14020050

Submission received: 21 November 2024 / Revised: 19 January 2025 / Accepted: 21 January 2025 / Published: 24 January 2025

(This article belongs to the Section Actuators for Robotics)

Download

Browse Figures

Versions Notes

Abstract

:

The robotic arm frequently performs grasping tasks in unstructured environments. However, due to the complex network architecture and constantly changing operational environments, balancing between grasping accuracy and speed poses significant challenges. Unlike fixed robotic arms, mobile robotic arms offer flexibility but suffer from relatively unstable bases, necessitating improvements in disturbance resistance for grasping tasks. To address these issues, this paper proposes a light-weight grasping pose estimation method called Grasp-DSC, specifically tailored for mobile robotic arms. This method integrates the deep residual shrinkage network and depthwise separable convolution. Attention mechanisms and soft thresholding are employed to improve the arm’s ability to filter out interference, while parallel convolutions enhance computational efficiency. These innovations collectively enhance the grasping decision accuracy and efficiency of mobile robotic arms in complex environments. Grasp-DSC is evaluated using the Cornell Grasp Dataset and Jacquard Grasp Dataset, achieving 96.6% accuracy and a speed of 14.4 ms on the former one. Finally, grasping experiments conducted on the MR2000-UR5 validate the practical applicability of Grasp-DSC in practical scenarios, achieving an average grasping success rate of 96%.

Keywords:

mobile robotic arm; grasping pose; deep residual shrinkage network; depthwise separable convolution; light-weight grasp

1. Introduction

With the rapid advancement of artificial intelligence technology, robotic arms have found extensive applications in industries, agriculture, and household domains. Among the various functionalities of robotic arms, grasping is considered fundamental and crucial. The success of a grasp hinges on swiftly and accurately determining the grasping pose of the robotic arm. Grasping pose estimation for robotic arms has garnered significant international research attention due to its potential to enhance work environments and living conditions. In recent years, there has been a notable shift in research focus from fixed robotic arms to mobile robotic arms. Compared to traditional industrial robotic arms that repetitively grasp fixed objects within fixed workspaces, mobile robotic arms face more challenges when performing grasping tasks in unstructured environments. These challenges include variations in lighting, environmental noise, base instability, and human–machine interactions [1]. Therefore, addressing grasping challenges in unstructured environments with mobile robotic arms requires mitigating environmental disturbances and generating appropriate grasping configurations based on the characteristics of the object to be grasped, including the arm’s pose and the gripper aperture width.

Over the past few decades, many scholars have extensively researched methods for estimating the grasping poses of robotic arms. In terms of information input, it can be categorized into two types: two-dimensional image information and three-dimensional model information [2]. Some scholars use RGB images and point clouds as input and estimate grasping poses using manually annotated grasping features. However, the effectiveness in practical scenarios is frequently constrained by the substantial limitations associated with manual labeling [3]. Also, some scholars first reconstruct the 3D model of objects and use the reconstruction information as input to estimate the grasping poses of robotic arms. Agnew utilize depth images to reconstruct the three-dimensional models of objects and subsequently detect the optimal grasping poses for robotic arms. However, the process of re-modeling for each grasp operation is time-consuming [4]. In terms of pose estimation, they can generally be divided into two main categories: formula-based computational methods and deep learning-based model methods. During the grasping process, the core idea of formula-based computational methods is to perceive the local features of the target object using sensors. These methods utilize mathematical and physical formulas from geometry, kinematics, and mechanics to design various constraints that ensure the stability and task compatibility of the robotic arm during grasping [5]. However, the computational complexity of this approach increases with the addition of constraints, and the design of these constraints often requires specific conditions to be met by the formulas. Therefore, it cannot effectively handle complex environments, often leading to poor practical application outcomes [6]. Currently, with the continuous advancement of sensors and deep learning technology, deep learning has shown outstanding performance in generating grasping configurations for robotic arms. However, in deep learning models, the scale and structure of neural networks significantly impact overall performance, often exhibiting contradictory tendencies between performance and efficiency. Larger networks increase computation time and costs, while smaller networks may reduce performance [7]. Nevertheless, achieving high-performance and high-efficiency robotic arm grasping is crucial in practical applications. During the grasping process, it is essential not only to generate accurate grasping configurations but also to minimize the computation time required for grasping actions [8].

Therefore, this paper proposes a light-weight grasping pose estimation method called Grasp-DSC for mobile robotic arms to strike a balance between grasping performance and efficiency. The model takes RGB images and depth images as input and output parameters including grasping quality, grasp angle, and opening width, computing a grasping rectangle. Unlike previous methods, Grasp-DSC introduces both depth residual shrinkage networks (DRSN) and depthwise separable convolution (DSC) in its model structure. This approach enhances grasping accuracy while concurrently performing light-weight processing of parallel convolutional kernels to control computational costs and improve grasping efficiency. Additionally, mobile robotic arms, compared to fixed ones, need to handle interference during actual grasping tasks. To address this issue, this paper employs soft thresholding to enhance the mobile robotic arm’s ability to filter out interference, thereby further improving grasping accuracy.

In summary, the primary contributions of this study are outlined as follows:

(1): This paper introduces Grasp-DSC, a light-weight grasping pose estimation method tailored for mobile robotic arms. Grasp-DSC achieves a tradeoff between grasping speed and accuracy.
(2): Grasp-DSC utilizes adaptive soft thresholding in the DRSN to effectively mitigate the impact of background noise, overlapping objects, and relative instability of robotic arm bases. This enhancement improves the model’s robustness against issues such as insufficient feature extraction and interference.
(3): Grasp-DSC introduces DSC, which partitions spatial dimensions and channel dimensions to parallelize convolutional kernels. This approach reduces network complexity, thereby optimizing the balance between grasping accuracy and efficiency for robotic arms.
(4): Experimental evaluations using the Cornell Grasp Dataset and Jacquard Grasp Dataset validate the effectiveness of Grasp-DSC. Comparative analyses with multiple algorithms demonstrate its superior performance.
(5): The practical applicability of Grasp-DSC is validated through grasping experiments conducted on the MR2000-UR5 platform. These experiments underscore Grasp-DSC’s efficacy in real-world applications.

The rest of this paper is organized as follows: Section 2 provides a comprehensive review of related works; Section 3 presents a detailed description of our method; Section 4 discusses the comparative experiments and results of Grasp-DSC using the Cornell Grasp Dataset and Jacquard Grasp Dataset; Section 5 outlines real-world grasping experiments with Grasp-DSC; and Section 6 provides the conclusion.

2. Related Works

2.1. Deep Learning Methods for Robotic Grasping

Grasping pose estimation for robotic arms entails integrating visual data with algorithmic models to determine optimal positions and orientations for completing gripping tasks. Recent advancements in deep learning have significantly bolstered these methodologies, enabling the training of algorithmic models on extensive datasets. Leveraging their training, these algorithms demonstrate effective performance in grasping tasks. Consequently, deep learning has emerged as the predominant research approach for grasping pose estimation in robotic arms [9].

Yan et al. [10] introduced an actor–critic method for grasping, where the actor algorithm samples grasping instances and the critic algorithm evaluates the outcomes to determine the optimal grasping poses. Antanas et al. [11] utilized a probabilistic logic approach to deduce pre-grasp configurations based on the intended task, utilizing semantic object parts. Their methodology also tackled uncertainties in visual perception and grasp planning, thereby improving the grasping efficacy of robotic arms. Nevertheless, its applicability is constrained to particular scenarios. To further improve grasping accuracy, Zhang et al. [12] proposed a grasp region network named ROI-GD. This approach utilizes features from ROIs to generate graspable regions, which are subsequently evaluated for optimal grasping positions. However, it faces challenges in establishing a universal quantitative metric applicable across diverse objects, relying solely on classification functions that may result in selecting suboptimal grasp regions [3]. To address this issue, Morrison et al. [13] proposed the GG-CNN model, which employs an encoder–decoder structure. It takes depth images as input and uses three convolutions and three deconvolutions to extract object features, ultimately outputting grasping quality, rotation angle, and gripper width. This generative algorithmic structure for generating grasping rectangles can transform input data into pixel-wise grasp configurations, facilitating the deployment of grasping control on robotic arms, and has become a foundational research focus in recent years [14]. Wang et al. [15], building upon Morrison et al.’s work, proposed an improvement aimed at generating grasp configurations for parallel grippers and three-finger grippers. However, their complex algorithm structure may compromise grasping efficiency. Yu et al. [16] integrated the Squeeze-and-Excitation Networks into the SE-ResUNet model, enhancing grasping accuracy with attention mechanisms but potentially overlooking multi-scale object features. Kumra et al. [14] proposed a novel generative residual convolutional neural network achieving excellent results, albeit with extensive training parameters exceeding one million. To achieve reduced model training parameters while maintaining excellent grasping accuracy, Shukla et al. proposed a generative thresholding neural network. The proposed model achieved good results in grasping accuracy and efficiency; however, it limits the model’s generalization ability and requires further improvement [17]. Among the recent advancements in the field, AnyGrasp proposed by Fang et al. achieves superior accuracy in grasp pose estimation by employing a multi-task deep learning framework, which concurrently predicts grasp poses and extracts object features from RGB-D images. This method demonstrates robust generalization across various object types and environments, proving particularly effective in controlled scenarios with static objects [18]. GraspNet-1Billion introduced by Fang et al. is distinguished by its utilization of a large-scale dataset comprising over one billion grasp instances, facilitating exceptional generalization across a diverse array of objects and complex grasping scenarios. This extensive training allows the model to exhibit strong adaptability in dynamic and varied environments. However, the high computational cost associated with both training and inference limits its practical deployment in real-time applications with constrained computational resources [19]. These two approaches, though highly effective, pose challenges when deployed in computational resource-constrained mobile devices due to their high demands on computational power.

2.2. Deep Residual Shrinkage Network (DRSN)

The DRSN, a contraction variant of the deep residual network, has been widely adopted across various domains to mitigate interference and enhance information processing. Chen et al. [20] employed the DRSN in automatic modulation recognition, augmenting accuracy by incorporating soft thresholding to eliminate insignificant features after convolutional neural network (CNN) processing, thereby substantially reducing parameters while maintaining performance integrity. In earthquake assessment, Su et al. [21] employed the DRSN to optimize feature map channels via independent thresholds within residual networks, thereby improving assessment accuracy. In image recognition, the DRSN’s soft thresholding capability has been instrumental. Wang et al. [22] utilized deep networks and residual shrinkage blocks for semantic information extraction and image fusion, leveraging residual contraction modules to focus feature map channels separately on objects and backgrounds. Zhang et al. [23] enhanced the subnetwork for threshold acquisition within the DRSN in their study on rock image recognition. Their approach included augmenting the network with the global maximum pooling of features as information representation and introducing attention mechanism-based weight coefficients to differentiate the importance of different features. These modifications enhanced the soft thresholding function, thereby improving recognition accuracy under microscopic conditions.

In this study, Grasp-DSC leverages the soft thresholding feature of the DRSN to effectively suppress interference for mobile robotic arms. By incorporating attention mechanisms, it enhances feature emphasis on objects to be grasped, resulting in the precise generation of the grasping rectangles and thereby improving the overall grasping accuracy of the model.

2.3. Depthwise Separable Convolution (DSC)

DSC partitions spatial and channel dimensions, processing convolutions independently to decrease convolutional calculation parameters [24]. Widely adopted in research, DSC acts as a light-weight module enhancing convolutional kernel efficiency. Zheng et al. [25] addressed CNN suitability for embedded devices by integrating DSC with field-programmable gate arrays, proposing an efficient inference accelerator. Due to its light-weight nature, DSC finds application in mobile devices with limited computational capabilities. Li et al. [26] applied DSC to mobile underwater equipment, augmenting the AdamW optimizer with multi-dimensional channel modules to expedite training and enhance real-time underwater image restoration. Yi et al. [27] introduced a light-weight multi-class classification deep model for wearable devices, utilizing separable convolutions across multiple channels to strike an optimal balance between computational complexity and recognition accuracy. Li et al. [28] enhanced 3D object recognition in mobile networks like robotic and autonomous driving systems with LVNet, a light-weight network substituting standard 3D convolutions with DSC to reduce model size and computational demands. LVNet also integrates an attention mechanism to mitigate parameter reduction effects on performance.

Inspired by the methodology of Li et al., this paper introduces Grasp-DSC for deployment on mobile robotic arms. Grasp-DSC integrates DSC in place of standard convolutions within the DRSN, facilitating synergy between the two networks. This substitution aims to optimize recognition performance while alleviating computational demands, thereby achieving an equilibrium between grasping accuracy and efficiency.

3. Methodology

3.1. Grasp-DSC Framework

Figure 1 illustrates the architectural comparison among the GG-CNN, GG-CNN 2, and Grasp-DSC for robotic grasping systems. Grasp-DSC, introduced in this paper, is a light-weight grasping pose estimation model designed for mobile robotic arms.

The GG-CNN (Generative Grasping Convolutional Neural Network), introduced by Morrison et al. [13], encompasses two primary variants: the GG-CNN and the GG-CNN 2. This model is inspired by semantic segmentation and predicts grasp quality, opening width, and grasp angle for every pixel in the image. Known for its low parameter count, robust real-time capabilities, and effectiveness in challenging environments, the GG-CNN has emerged as a pivotal model in robotic grasping.

Figure 1 illustrates that the GG-CNN employs a classic encoder–decoder architecture. During the encoding phase, multiple standard convolutional layers sequentially extract features from depth images to generate a low-resolution feature map. In the decoding phase, feature maps are effectively reconstructed to their original size using stacked transposed convolutional layers. However, due to its relatively simple network structure and limited parameter count, the GG-CNN may face constraints in learning capacity when applied to large-scale datasets. To enhance its architecture, the GG-CNN 2 introduced dilated convolutions (layers L5 and L6) to bolster the model’s expressive power. Despite these advancements, the GG-CNN 2 achieved an accuracy of 65% on the Cornell Grasp Dataset and 84% on the Jacquard Grasp Dataset, indicating ongoing challenges in achieving optimal performance in complex environments.

Therefore, enhancing grasping prediction performance in complex environments requires bolstering the model’s deep learning capabilities. This entails not only deepening network layers to capture comprehensive global features but also effectively utilizing fine-grained information from shallow features to refine prediction outcomes. Moreover, compared to stationary robotic arms, mobile counterparts face increased operational interference factors, demanding robustness enhancements. However, due to the high real-time requirements of mobile robotic arms and the increased computation time and cost associated with overly large networks, it is crucial to consider light-weight network optimization to mitigate these issues. Figure 2 depicts the architecture of Grasp-DSC.

Figure 2 illustrates the optimization of Grasp-DSC based on the GG-CNN 2:

Grasp-DSC integrates the DRSN as the core residual unit within the encoder–decoder framework of the GG-CNN 2. This integration utilizes attention mechanisms and soft thresholding to enhance grasping accuracy and resistance to disturbances.
Grasp-DSC introduces parallel DSC to lighten the model, enabling simultaneous processing of convolutional kernels within the DRSN.
Grasp-DSC adopts Smooth L1 Loss as its loss function, incorporating weight coefficients to adjust training emphasis on different grasp parameters. This approach mitigates the potential training instability associated with traditional mean square error loss.

3.2. Deep Residual Shrinkage Network

The DRSN, introduced by Zhao et al. [29], integrates the deep residual network with attention mechanisms and soft thresholding. This integration is suitable for processing data that include noise, which refers to irrelevant interference information unrelated to the current task. Grasp-DSC utilizes attention mechanisms and soft thresholding capabilities from the DRSN to effectively handle interference factors in complex operational environments, thereby enhancing model robustness and improving grasping accuracy.

Figure 3 depicts the model architecture schematic of the DRSN. The DRSN extends the deep residual network by incorporating a subnetwork that learns thresholds applied via soft thresholding to each channel of the feature map. The green dashed box illustrates the overarching framework of the DRSN. Prior convolutional layers convert significant features into values with large absolute magnitudes while reducing the absolute magnitudes of features associated with redundant information. A subnetwork delineates the boundary between these feature types and utilizes soft thresholding to zero out redundant features while maintaining non-zero outputs for important features. Within the blue dashed subnetwork, the input feature map undergoes absolute value transformation for all elements. Subsequently, the global max-pooling extracts a set of features denoted as A. In a parallel pathway, these features are fed into a compact, fully connected network. Here, a sigmoid activation function adjusts the output to a range between 0 and 1, represented as α. The final threshold is computed as α × A, ensuring it remains positive and appropriately scaled. This method guarantees that the threshold is not excessively large. Finally, the DRSN achieves its complete architecture by stacking residual shrinkage building units (RSBUs), convolutional layers, batch normalization (BN), ReLU activation functions, global max-pooling (GAP), and fully connected (FC) layers.

The expression for soft thresholding is given by Equation (1):

d_{o u t} = \{\begin{matrix} d_{i n} - τ & d_{i n} > τ \\ 0 & - τ \leq d_{i n} \leq τ \\ d_{i n} + τ & d_{i n} < - τ \end{matrix}

(1)

In this context,

d_{i n}

represents the input,

d_{o u t}

denotes the output, and

τ

stands for the threshold, which is a constant positive value. The fundamental principle involves decomposing the input signal, applying a soft thresholding function to filter the decomposed elements, and finally reconstructing the signal. From the above equation, it can be seen that it sets features with absolute values lower than

τ

to 0, and adjusts other features towards 0, achieving a ‘shrinkage’ effect. Furthermore, the soft thresholding function is flexible nonlinear mapping. From Equation (1), it can be observed that its gradient only takes values of 0 and 1. This property effectively mitigates issues including gradient vanishing and gradient exploding.

Applying the DRSN to the GG-CNN 2 assigns varying thresholds to different samples, facilitating the removal of unimportant features while preserving crucial ones. Despite identity shortcuts across layers potentially passing unimportant features to higher levels, the accumulation of residual modules gradually diminishes their impact. This systematic approach effectively eliminates irrelevant features, thereby managing interference factors like background noise and overlapping objects. As a result, it enhances the operational stability of mobile robotic arms during grasping tasks.

3.3. Depthwise Separable Convolution

The GG-CNN employs a standard CNN architecture that typically operates in single-channel mode for image feature extraction. However, the employment of a single convolutional channel often leads to insufficient feature extraction. Parallel structured CNNs are employed, operating on upper-level feature maps with two separate channels performing independent convolution operations. The resulting feature maps from these two channels are subsequently fused together. As depicted in Figure 4, this parallel structure utilizes convolutional kernels of different scales, M × M and N × N, across the two channels to comprehensively extract diverse features. This approach enhances feature diversity and robustness, thereby improving overall network accuracy and ultimately enhancing the grasping accuracy of the model.

The output of the convolutional layer in parallel convolutions can be computed using the system of Equation (2):

\{\begin{matrix} x_{j, 1}^{l} = f_{1} (\sum_{i \in N} x_{i}^{l - 1} \times a_{i, j}^{l, 1} + b_{j}^{l}) \\ x_{j, 2}^{l} = f_{1} (\sum_{i \in N} x_{i}^{l - 1} \times a_{i, j}^{l, 2} + b_{j}^{l}) \\ x_{j}^{l} = f_{2} (x_{j, 1}^{l}, x_{j, 2}^{l}) \end{matrix}

(2)

where N is a subset of the upper-level feature map,

f_{1}

(~) denotes the activation function,

x_{j, 1}^{l}

and

x_{j, 2}^{l}

are the outputs of the convolutions for the first and second channels, respectively,

f_{2}

(~) represents the method of merging the feature maps from the two channels, and

x_{j}^{l}

represents the fused output after merging.

DSC was initially introduced by Laurent Sifre in his doctoral thesis [30] and the Google Team subsequently evolved into two prominent models, Xception [24] and MobileNet [31]. Convolution across feature maps typically utilizes 3D kernels, which simultaneously learn correlations on both the spatial dimension and channel dimension. DSC breaks down the conventional convolution operation into two stages: depthwise convolution (DW) and 1 × 1 pointwise convolution (PW). In DW, each kernel independently processes each channel of input from the preceding layer. PW aggregates information and interacts across channels from the results of the preceding convolutional layer. Figure 5 illustrates the architecture of DSC. Compared to standard convolutions, DSC reduces the number of parameters required. Unlike standard convolutions, which consider spatial regions and channels concurrently, DSC first addresses spatial regions independently and subsequently processes channels, thereby separating spatial and channel operations. During DW processing, each channel of the input feature map undergoes individual processing using independent 3 × 3 convolutional kernels, resulting in new spatial feature maps and achieving channel-wise feature processing. The PW process, depicted in Figure 6, uses 1 × 1 convolutional kernels to linearly combine features across all channels, thereby altering the depth of the feature maps. Subsequently, a ReLU activation function introduces nonlinearity, generating the final new feature maps.

This paper integrates a parallel structure with DSC to propose a light-weight variant. As depicted in Figure 7, the process begins with feeding feature maps into two branches: one branch employs a 3 × 3 convolutional kernel, while the other utilizes a 5 × 5 convolutional kernel. Both branches execute depthwise convolution concurrently to extract features. Subsequently, 1 × 1 convolutional kernels are applied in each branch to consolidate channels and generate new sets of feature maps. Finally, another layer of 1 × 1 convolution ensures comprehensive information representation in the resulting feature maps. When the CNN exhibits a shallow hierarchy, employing parallel multi-scale convolutional kernels effectively extracts features across different scales, thereby enhancing network accuracy.

Assuming N = 128, where both the input and output feature maps of the network are 256 each, the parameter count for the traditional 3 × 3 convolutional structure P₁ can be calculated as follows:

P₁ = 256 × 3 × 3 × 256 = 589,824

The parameter count P₂ for the proposed light-weight parallel DSC can be represented as follows:

P₂ = 256 × 3 × 3 + 256 × 1 × 1 × 256 + 256 × 5 × 5 + 256 × 1 × 1 × 256 + 256 × 1 × 1 × 256 = 205,312

This demonstrates that integrating the light-weight parallel DSC structure into the GG-CNN 2 significantly reduces network parameters. Furthermore, the use of parallel convolutional kernels enhances the diversity of feature extraction, striking a balance between efficiency and speed.

3.4. Loss Function

In Grasp-DSC, Smooth L1 Loss is selected as the loss function for the thorough evaluation of prediction errors. Smooth L1 Loss achieves a more favorable trade-off between robustness and sensitivity in comparison to traditional loss functions. Conventional loss functions are prone to generating sparse gradients, which can hinder the optimization process. In contrast, Smooth L1 Loss provides smoother gradients, thereby facilitating more stable and efficient training of deep learning models, particularly in the context of grasp pose estimation. The choice of Smooth L1 Loss is justified for its balance between robustness and sensitivity. Unlike traditional loss functions, which can result in sparse gradients, Smooth L1 Loss provides smoother gradients, making it more suitable for training deep learning models in grasping pose estimation. This combination of robustness and gradient smoothness is ideal for the challenges of grasp pose prediction, especially in unstructured environments where the presence of noise and outliers is common. Equation (3) defines the specific formula used to compute the loss function, representing the average error between predicted and true values in the sample data of Grasp-DSC:

Loss = (m, \hat{o}) = \frac{1}{n} \sum_{i} z_{i}

(3)

where

m

and

\hat{o}

, respectively, represent the estimated grasping pose from Grasp-DSC within the set of target object grasp poses G, and the ground truth grasping pose from the dataset, denoted as the number of samples in the dataset. Equation (4) defines the Smooth L1 Loss as a segmentation function: it utilizes squared error when the loss is less than 1, and switches to absolute error otherwise, ensuring a smooth transition. The formulation is specifically detailed as follows:

z_{i} = \{\begin{matrix} 0.5 \times {(m - \hat{o})}^{2} i f |m - \hat{o}| < 1 \\ |m - \hat{o}| - 0.5 o t h e r w i s e \end{matrix}

(4)

Finally, the total loss function

{L o s s}_{a l l}

for Grasp-DSC comprises three components:

{L o s s}_{θ}

,

{L o s s}_{δ}

, and

{L o s s}_{μ}

. These parts represent the cumulative losses for pose prediction, weight prediction, and potentially other auxiliary tasks such as residual terms in residual learning. Equation (5) formulates it as follows:

{L o s s}_{a l l} = {L o s s}_{θ} + {L o s s}_{δ} + {L o s s}_{μ}

(5)

4. Experimentation and Results from the Dataset

4.1. Dataset

This paper leverages the publicly available Cornell Grasp Dataset [32] and Jacquard Grasp Dataset [33] for training and evaluating the performance of Grasp-DSC. These datasets are recognized benchmarks for training deep networks designed for robotic arm grasping tasks.

The Cornell Grasp Dataset comprises 1035 images featuring 244 different objects, each appearing in various positions and orientations within the images. It includes 8019 manually annotated grasping rectangles, detailing both successful and unsuccessful grasps based on the coordinates of their four vertices. Specifically, there are 5110 successful grasps, and 2909 unsuccessful grasps annotated in the dataset. However, due to its relatively small size, the Cornell Grasp Dataset is susceptible to overfitting, which restricts the generalization ability of models. Figure 8 illustrates an example from the Cornell Grasp Dataset.

The Jacquard Grasp Dataset encompasses 11,619 unique objects, 54,485 high-resolution RGB-D images, and annotations for over 11,000 grasp positions. In contrast to the Cornell Grasp Dataset, Jacquard offers more extensive and realistic annotations for grasps. Figure 9 illustrates some examples from the Jacquard Grasp Dataset. Jacquard excels over Cornell in dataset size, diversity of target objects, and complexity of grasp configurations. These comparative advantages significantly bolster the validation of deep learning models for grasping pose estimation, rendering it more compelling and valuable for research purposes.

4.2. Evaluation Metrics

As per the GG-CNN architecture, robotic grasping problems leverage grasping rectangle representations to derive pixel-level grasp configurations [13].

As depicted in Figure 10, with the robotic gripper perpendicular to the x–y plane, a single grasp can be represented by Equation (6):

g = (p, φ, ω, q)

(6)

where q represents the grasping quality, indicating the success probability normalized within the range [0, 1]. A higher q value signifies a higher probability of successful grasping.

ω

denotes the gripper’s opening width required for the grasp. p denotes the gripper’s position, where p = (x, y, z) represents the Cartesian coordinates of the gripper’s center point.

φ

represents the rotation angle around the z-axis during the grasping process. Due to the symmetry of robotic grasping, the range of

φ

is

[- \frac{π}{2}, \frac{π}{2}]

.

Assuming the detection of grasp position and pose from an n-channel image

I = R^{n \times w \times h}

with height h and width

ω

can be defined as:

g_{i} = (p_{i}, φ_{i}, ω_{i}, q)

(7)

where

p_{i}

represents p converted into pixel indices with a certain depth H, given by

p_{i}

= (u, v).

φ_{i}

denotes rotation in the camera reference coordinate system, while

ω_{i}

represents the required width in image coordinates, ranging from 0 to

ω_{m a x}

, which denotes the maximum width of the gripper.

Next, utilizing Equation (8), we convert the image coordinates into the robot’s reference frame for the purpose of robot control and execution of grasping actions derived from the image:

G_{r} = T_{r c} (T_{c i} (g_{i}))

(8)

where

T_{r c}

represents the transformation parameters converting camera space to robot space using calibrated camera poses, and

T_{c i}

represents the transformation parameters converting image space to 3D camera space using the intrinsic parameters of the depth camera.

Therefore, the collection of all grasp points for an object can be represented as:

G = (A, W, Q) {\in R}^{3 \times w \times h}

(9)

where the grasp points are determined based on the grasping angle, width, and quality scores.

To ensure fair comparison with results from other algorithms, this paper utilizes the Intersection over Union (IoU) and the Jaccard index as evaluation metrics.

(1): IoU: This metric measures the angle intersection index between the predicted grasping rectangle G and the ground truth grasping rectangle g. The IoU is expressed by the following formula:

I o U = |θ_{G} - θ_{g}|

(10)

(2): The Jaccard index: The area overlap index between the predicted grasping rectangle G and the ground truth grasping rectangle g. The Jaccard index is expressed by the following formula:

J a c c a r d = \frac{a r e a (G \cap g)}{a r e a (G \cup g)}

(11)

When the grasp configuration of an object satisfies an IoU of less than 30

°

and a Jaccard index greater than 0.25, it is considered a successful grasp.

4.3. Training Process

The training of Grasp-DSC is conducted using the PyTorch 2.0 deep learning framework, utilizing Python version 3.8, and operating on Ubuntu 18.04. The hardware setup is NVIDIA TITAN XP GPU (manufactured by NVIDIA Corporation, which based on Santa Clara, CA, USA).

During the training process, the Cornell Grasp Dataset is divided into training and validation sets using a 9:1 ratio. The Xavier initialization method is applied to ensure a 7% variance in the outputs of each layer. Batch normalization layers are incorporated within convolutional layers to enhance spatial representation capability, mitigate gradient vanishing, and expedite training. Figure 11 illustrates the training loss curves for grasping quality, grasping angle, and grasping width. These curves depict rapid initial decreases in loss within the first 20 epochs, followed by stabilization, indicating the network’s effective extraction of pertinent information from multi-modal inputs and its progressive convergence towards optimal grasping strategies.

Figure 12 illustrates the overall loss trends for the training and validation sets. As depicted, beyond 30 iterations, the training set’s convergence rate slows down, while the validation set’s loss starts to stabilize. This observation indicates that although Grasp-DSC continues to enhance its performance on the training data, there is a growing risk of overfitting. Therefore, timely adjustments to hyperparameters or the implementation of regularization techniques are necessary to optimize performance on the validation set.

4.4. Results Comparison

To optimize the balance between gripping speed and accuracy of the mobile robotic arm, Grasp-DSC incorporates the DRSN and DSC into the GG-CNN 2 framework. To assess the performance of different modules combined with the GG-CNN 2, this study conducted ablation experiments using the Cornell Grasp Dataset. The experimental outcomes are summarized in Table 1.

The DRSN architecture incorporates residual connections, which enable deeper feature extraction and consequently enhance accuracy. However, this improvement in accuracy is accompanied by an increase in the computational cost and processing time. Conversely, the DSC module is designed to mitigate the computational burden by employing depthwise separable convolutions, thereby accelerating model inference at the cost of a slight reduction in accuracy. The trade-off between these two strategies, as demonstrated in Table 1, reflects the critical balance between maximizing accuracy and maintaining computational efficiency, a consideration that is particularly important for real-time applications in mobile robotic arms.

Compared to the GG-CNN 2, the model incorporating the DRSN demonstrates a significant improvement in accuracy, achieving a 33.9% increase. However, its relatively intricate structure leads to a slower processing speed of 35.1 ms. In contrast, the model integrating DSC alongside the GG-CNN 2 benefits from light-weight parallel convolutional processing, resulting in a faster speed of 9.7 ms, albeit with a lower accuracy improvement of 24.2% compared to the DRSN-integrated model. Integrating both the DRSN and DSC networks into the GG-CNN 2 architecture effectively harnesses their respective strengths. Therefore, this paper introduces Grasp-DSC, which integrates all three networks to mitigate the performance trade-offs observed in light-weight models due to parameter reduction. Experimental results demonstrate an accuracy of 96.6% and a processing speed of 14.4 ms, highlighting a notable enhancement in overall performance compared to the GG-CNN 2.

To enable a more holistic assessment of the model’s performance, this study introduces the Final Score equation, which integrates both speed and accuracy into a unified metric. The Final Score equation functions as a comparative tool, grounded in the principles of multi-criteria decision-making (MCDM), which is widely utilized in performance evaluations of robotic and optimization tasks [34]. The weights assigned in the Final Score equation are primarily contingent on the specific application scenario under consideration, derived empirically through experimental analysis. In the context of mobile platforms, speed serves as a critical indicator of the algorithm’s efficiency, particularly in terms of its light-weight nature when deployed on platforms with comparable computational capabilities. A light-weight algorithm facilitates faster processing, thereby reducing task completion time and enhancing operational efficiency. In robotic arm grasping tasks, many operations necessitate rapid object recognition and execution within a constrained time frame. When the system exhibits slow response times, real-time responsiveness is degraded, and the advantages of high accuracy are often insufficient to offset the grasping failures induced by processing delays [15]. Thus, although accuracy is a pivotal factor in grasping success, speed must not be disregarded. The 60% accuracy and 40% speed weighting strikes an optimal balance between efficient execution and the preservation of accuracy, where the 40% speed component encapsulates the system’s sensitivity to computational delays. This weighting ensures that the system maintains sufficient flexibility and responsiveness, making it suitable for real-time tasks in dynamic and complex environments. The Final Score formula is as follows:

F i n a l S c o r e = 60 % \times A c c u r a c y + 40 % \times \frac{1}{S p e e d}

(12)

This paper initially evaluates the performance of Grasp-DSC on the Cornell Grasp Dataset. Released prior to the Jacquard Grasp Dataset, the Cornell Grasp Dataset has undergone extensive scrutiny with various algorithms, facilitating comparisons across a broad spectrum of grasp detection methods. This paper juxtaposes Grasp-DSC against prominent algorithms from the past five years, including Multi-grasp, GraspNet, ROI-GD, GRPN, Multilevel CNNs, GK-Net, GPWRG, GGS-CNN, Confinet + BEM, FANet-CPU, UPG, and Faster R-CNN + CBAM. The comparison specifically centers on detecting grasping rectangles for identical objects, depicted in Figure 13.

To verify the light-weight nature of Grasp-DSC, the inference time per single image serves as a metric for speed comparison. Faster inference times indicate greater light-weight efficiency, stronger real-time capabilities, and better suitability for the real-time grasping tasks of mobile robotic arms. The comparative outcomes are presented in Table 2.

As illustrated in Figure 14 and Figure 15, it is evident that the top three algorithms in terms of accuracy are GK-Net, Grasp-DSC, and Multilevel CNNs, while GPWRG, Grasp-DSC, and Confinet + BEM lead in terms of speed. Regarding accuracy, Grasp-DSC closely follows GK-Net by a margin of 0.3%, yet it boasts a significant speed advantage, being 27.3 ms faster than GK-Net. In terms of speed, Grasp-DSC trails the top-performing GPWRG by only 1.5 ms, while surpassing GPWRG in accuracy by 2.2%. Consequently, Grasp-DSC secures the second position both in accuracy and speed among the 15 compared algorithms. However, it excels in achieving the best balance between speed and accuracy, demonstrating its capability for effective real-time grasping performance. As shown in Table 2, Grasp-DSC ranks first in the Final Score.

Due to the relatively limited size of the Cornell Grasping Dataset, comparative experiments were conducted to assess the generalization ability of the Grasp-DSC model on the Jacquard Grasping Dataset. The findings are detailed in Table 3.

Table 3 illustrates that Grasp-DSC achieves an accuracy of 92.2% on the Jacquard Grasping Dataset, showcasing a superior performance relative to other algorithms. This reinforces Grasp-DSC’s proficiency in handling larger datasets, underscoring its adaptability and reliability.

From the comprehensive comparison results provided earlier, it is clear that Grasp-DSC consistently delivers outstanding accuracy alongside efficient grasping speeds. This balanced blend of speed and precision underscores its effectiveness in practical applications.

5. Experiments and Results in Real-World Scenarios

To validate the applicability of Grasp-DSC in real-world scenarios, grasping experiments were conducted using the MR2000-UR5 mobile robotic arm (UR5 Robotic Arm is manufactured by Universal Robots which based in Odense, Denmark; MR2000 Mobile Platform is manufactured by Mobile Robots which based in Odense, Denmark). The setup for grasping with the MR2000-UR5 mobile robotic arm is depicted in Figure 16.

The MR2000-UR5 mobile robotic arm comprises three main components: the MR2000 mobile base, navigation module, and robot arm module. The robot arm module includes the UR5 robot arm, robot arm controller, Intel RealSense D435i camera (manufactured by Intel Corporation which based in Santa Clara, CA, USA), and gripper. The process of conducting grasping experiments with the MR2000-UR5 is illustrated in Figure 17. During the experiment, the robot arm first identifies the target position. Upon initiating the grasping node, the robot arm moves from its initial position to a predefined home state where the depth camera is positioned perpendicularly to the grasping surface. Images captured by the depth camera undergo cropping and normalization within a specified region (300 × 300) before being inputted into Grasp-DSC for analysis. Utilizing the sigmoid function, Grasp-DSC computes and outputs grasping decision parameters, which include high-confidence coefficient coordinates of the grasping point, the optimal gripper rotation angle, and the appropriate grasping width.

The grasping process commences with Grasp-DSC predicting the grasping point in the pixel coordinate system. Subsequently, the depth value of this point in the depth image is retrieved. Using internal parameters and hand-to-eye calibration, the depth value is then converted into the grasping pose within the world coordinate system and packaged into a message sent to the robot arm controller. Upon receiving the message, the robot arm utilizes the MoveIt! plugin for motion planning to guide the gripper center to the predicted optimal grasping point. It then executes the grasping operation to securely grasp the object. Finally, the robot arm moves to place the grasped object at a designated location before returning to the home position in preparation for the subsequent grasping task. This process ensures the robot arm efficiently and accurately performs grasping tasks.

(1): The RGB-D camera (Intel RealSense D435i) connects to the laptop using a Type-C data cable, while the laptop interfaces with the robot arm controller via an Ethernet cable.
(2): The Intel RealSense D435i camera captures RGB and depth images of the object intended for grasping.
(3): These captured images are fed into the Grasp-DSC network running on the laptop to determine a grasping rectangle.
(4): Subsequently, the laptop transmits the identified grasping parameters to the UR5 robot arm controller.
(5): Upon receiving the signal, the UR5 robot arm controller processes the command and directs the UR5 robot arm to execute the grasping motion.

In this grasping experiment, we compared Grasp-DSC with two models: UPG [44] and Faster R-CNN + CBAM [45], which were proposed in 2023 and 2024, respectively. These models represent the latest advancements in robotic arm grasping pose estimation. Initially, we trained these models and subsequently evaluated their performance in both single-object and multi-object grasping scenarios.

First, single-object grasping experiments were conducted where RGB-D camera images of the target object were captured to obtain both RGB and depth images. These images were subsequently fed into the three algorithms: Grasp-DSC, UPG, and Faster R-CNN + CBAM. The algorithms processed the images to generate parameters such as grasping quality, grasping angle, and grasping width. Utilizing specific formulas and calculations, each algorithm determined the correct grasping rectangle, depicted in Figure 18.

In the single-object grasping experiment, we conducted 100 grasping attempts using different algorithm models, and the results are summarized in Table 4. The experiments confirm that Grasp-DSC can effectively perform grasping tasks in real-world environments, achieving an accuracy of 98% for single-object grasping. Furthermore, Grasp-DSC demonstrates efficient performance with an average grasping time of 4.08 s per attempt. In terms of the Final Score ranking, Grasp-DSC takes first place, followed by UPG in second, and Faster R-CNN + CBAM in third.

To further evaluate the performance of Grasp-DSC in complex scenarios, we conducted a multi-object grasping experiment involving 100 grasping attempts using various algorithm models. The process of grasping is illustrated in Figure 19.

Table 5 compares the results of three algorithm models in the multi-object grasping experiment. It is apparent from the table that Grasp-DSC achieves a relatively low accuracy in multi-object grasping compared to UPG. Since UPG [44] is specifically designed to tackle the grasping challenges presented by multi-object scenes, it demonstrates excellent performance in multi-object grasping experiments. Despite Grasp-DSC’s lower accuracy compared to UPG, it achieves a respectable accuracy rate of 94% and completes grasping tasks faster than UPG. This underscores Grasp-DSC’s capability to achieve effective results in multi-object grasping scenarios. In terms of the Final Score ranking, Grasp-DSC takes first place, followed by UPG in second, and Faster R-CNN + CBAM in third.

6. Conclusions

In this paper, we introduce Grasp-DSC, a light-weight grasping pose estimation method designed for mobile robotic arms. Implemented on the GG-CNN 2 and integrating the DRSN and DSC networks, Grasp-DSC utilizes RGB images and depth images as input to output parameters including grasping quality, grasping angle, and gripper width for the target object, along with generating pixel-level grasping rectangles. This paper examines Grasp-DSC training and testing using the Cornell and Jacquard Grasp Datasets. Comparative studies are conducted to illustrate its performance. Grasp-DSC’s light-weight design reduces computational costs and improves speed compared to major grasping algorithms developed in the last five years. Experimental findings indicate that Grasp-DSC achieves a harmonious balance between grasping speed and accuracy. In grasping experiments conducted with the MR2000-UR5 mobile robotic arm, Grasp-DSC demonstrates robust generalization across various scenarios, underscoring its practical applicability for executing grasping tasks in real-world environments. Consequently, the light-weight grasping capability of Grasp-DSC makes it well-suited for deployment on mobile devices with constrained computational resources. Although our method demonstrates an effective balance between grasping speed and accuracy in real-world scenarios, the current experimental setting remains within a controlled environment, which limits its generalizability. In subsequent research, we intend to deploy Grasp-DSC in dynamic environments to conduct more challenging real-world experiments, thereby further validating its applicability and robustness.

Author Contributions

Conceptualization, C.Y.; methodology, J.D. and C.Y.; software, Q.Z. and C.Y.; validation, Q.W. and C.Y.; formal analysis, J.D. and C.Y.; investigation, J.D. and C.Y.; resources, C.Y.; data curation, C.Y.; writing—original draft preparation, C.Y.; writing—review and editing, C.Y.; visualization, C.Y., Q.W. and Q.Z.; supervision, J.D.; project administration, C.Y. All authors have read and agreed to the published version of the manuscript.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this manuscript.

Data Availability Statement

The data presented in this study are available on request from the corresponding author due to laboratory research privacy policy.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zou, M.; Li, X.; Yuan, Q.; Xiong, T.; Zhang, Y.; Han, J.; Xiao, Z. Robotic grasp detection network based on improved deformable convolution and spatial feature center mechanism. Biomimetics 2023, 8, 403. [Google Scholar] [CrossRef]
Fang, H.; Gou, M.; Wang, C.; Lu, C. Robust grasping across diverse sensor qualities: The GraspNet-1Billion dataset. Int. J. Robot. Res. 2023, 42, 1094–1103. [Google Scholar] [CrossRef]
Wang, S.; Jiang, X.; Zhao, J.; Wang, X.; Zhou, W.; Liu, Y. Efficient fully convolution neural network for generating pixel wise robotic grasps with high resolution images. In Proceedings of the 2019 IEEE International Conference on Robotics and Biomimetics (ROBIO), Dali, China, 6–8 December 2019; pp. 474–480. [Google Scholar] [CrossRef]
William, A.; Christopher, X.; Aaron, W.; Caelen, W.; Pedro, D.; Siddhartha, S. Amodal 3D reconstruction for robotic manipulation via stability and connectivity. In Proceedings of the 4th Conference on Robot Learning, (CoRL 2020), Cambridge, MA, USA, 16–18 November 2020; pp. 1498–1508. [Google Scholar] [CrossRef]
Wang, C.; Zang, X.; Zhang, X.; Liu, Y.; Zhao, J. Parameter estimation and object gripping based on fingertip force/torque sensors. Measurement 2021, 179, 109479. [Google Scholar] [CrossRef]
Hong, Q.; Yang, L.; Zeng, B. RANET: A grasp generative residual attention network for robotic grasping detection. Int. J. Control Autom. Syst. 2022, 20, 3996–4004. [Google Scholar] [CrossRef]
Zhai, D.; Yu, S.; Xia, Y. FANet: Fast and accurate robotic grasp detection based on keypoints. IEEE Trans. Autom. Sci. Eng. 2024, 21, 2974–2986. [Google Scholar] [CrossRef]
Sulabh, K.; Shirin, J.; Ferat, S. GR-ConvNet v2: A real-time multi-grasp detection network for robotic grasping. Sensors 2022, 22, 6208. [Google Scholar] [CrossRef] [PubMed]
Huang, Z.; Wang, L.; An, Q.; Zhou, Q.; Hong, H. Learning a contrast enhancer for intensity correction of remotely sensed images. IEEE Signal Process. Lett. 2022, 29, 394–398. [Google Scholar] [CrossRef]
Yan, M.; Adrian, L.; Mrinal, K.; Peter, P. Learning probabilistic multi-modal actor models for vision-based robotic grasping. In Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 4804–4810. [Google Scholar] [CrossRef]
Antanas, L.; Moreno, P.; Neumann, M.; de Figueiredo, R.P.; Kersting, K.; De Raedt, L. Semantic and geometric reasoning for robotic grasping: A probabilistic logic approach. Auton. Robot. 2019, 43, 1393–1418. [Google Scholar] [CrossRef]
Zhang, H.; Lan, X.; Bai, S.; Zhou, X.; Tian, Z.; Zheng, N. ROI-based robotic grasp detection for object overlapping scenes. In Proceedings of the International Conference on Intelligent Robots and Systems, Macau, China, 3–8 November 2019; pp. 4768–4775. [Google Scholar] [CrossRef]
Douglas, M.; Peter, C.; Jürgen, L. Learning robust, real-time, reactive robotic grasping. Int. J. Robot. Res. 2020, 39, 183–201. [Google Scholar] [CrossRef]
Sulabh, K.; Shirin, J.; Ferat, S. Antipodal robotic grasping using generative residual convolutional neural network. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 9626–9633. [Google Scholar] [CrossRef]
Wang, D.; Liu, C.; Chang, F.; Li, N.; Li, G. High-performance pixel-level grasp detection based on adaptive grasping and grasp-aware network. IEEE Trans. Ind. Electron. 2022, 69, 1161–11621. [Google Scholar] [CrossRef]
Yu, S.; Zhai, D.; Xia, Y.; Wu, H.; Liao, J. SE-ResUNet: A novel robotic grasp detection method. IEEE Robot. Autom. Lett. 2022, 7, 5238–5245. [Google Scholar] [CrossRef]
Priya, S.; Nilotpal, P.; Deepesh, M.G.C. Nandi Generative model based robotic grasp pose prediction with limited dataset. Appl. Intell. 2022, 52, 9952–9966. [Google Scholar] [CrossRef]
Fang, H.; Wang, C.; Fang, H.; Gou, M.; Yan, H.; Liu, W.; Xie, Y.; Lu, C. AnyGrasp: Robust and efficient grasp perception in spatial and temporal domains. IEEE Trans. Robot. 2023, 39, 1552–3098. [Google Scholar] [CrossRef]
Fang, H.; Wang, C.; Gou, M.; Lu, C. GraspNet-1Billion: A large-scale benchmark for general object grasping. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11441–11450. [Google Scholar] [CrossRef]
Chen, H.; Guo, W.; Kang, K.; Hu, G. Automatic modulation recognition method based on phase transformation and deep residual shrinkage network. Electronics 2024, 13, 2141. [Google Scholar] [CrossRef]
Su, Z.; Yu, J.; Xiao, X.; Wang, J.; Wang, X. Deep learning seismic damage assessment with embedded signal denoising considering three-dimensional time-frequency feature correlation. Eng. Struct. 2023, 286, 116148. [Google Scholar] [CrossRef]
Wang, H.; Wang, J.; Xu, H.; Sun, Y.; Yu, Z. DRSNFuse: Deep residual shrinkage network for infrared and visible image fusion. Sensors 2022, 22, 5149. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Wang, Z.; Liu, D.; Sun, Q.; Wang, J. Rock thin section image classification based on depth residuals shrinkage network and attention mechanism. Earth Sci. Inform. 2023, 16, 1449–1457. [Google Scholar] [CrossRef]
François, C. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Liu, Z.; Liu, Q.; Yan, S.; Ray, C.C.C. An efficient FPGA-based depthwise separable convolutional neural network accelerator with hardware pruning. ACM Trans. Reconfigurable Technol. Syst. 2024, 17, 1–20. [Google Scholar] [CrossRef]
Li, S.; Liu, Z.; Gao, M.; Bai, Y.; Yin, H. MDSCN: Multiscale depthwise separable convolutional network for underwater graphics restoration. Vis. Comput. 2024. [Google Scholar] [CrossRef]
Yi, S.; Mei, Z.; Kamen, I.; Mei, Z.; He, T.; Zeng, H. Gait-based identification using wearable multimodal sensing and attention neural networks. Sens. Actuators A Phys. 2024, 374, 115478. [Google Scholar] [CrossRef]
Li, L.; Qin, S.; Yang, N.; Hong, L.; Dai, Y.; Wang, Z. LVNet: A lightweight volumetric convolutional neural network for real-time and high-performance recognition of 3D objects. Multimed. Tools Appl. 2024, 83, 61047–61063. [Google Scholar] [CrossRef]
Zhao, M.; Zhong, S.; Fu, X.; Tang, B.; Michael, P. Deep residual shrinkage networks for fault diagnosis. IEEE Trans. Ind. Inform. 2020, 16, 4681–4690. [Google Scholar] [CrossRef]
Laurent, S.; Stéphane, M. Rigid-motion scattering for image classification. arXiv 2014, arXiv:1403.1687. [Google Scholar] [CrossRef]
Andrew, G.H.; Zhu, M.; Chen, B.; Dmitry, K.; Wang, W.; Tobias, W.; Marco, A.; Hartwig, A. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Ian, L.; Honglak, L.; Ashutosh, S. Deep learning for detecting robotic grasps. IJRR Spec. Issue Robot Vis. 2015, 34, 705–724. [Google Scholar] [CrossRef]
Amaury, D.; Emmanuel, D.; Chen, L. Jacquard: A Large Scale Dataset for Robotic Grasp Detection. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3511–3516. [Google Scholar] [CrossRef]
Zhao, J.; Sui, Y.; Xu, Y.; Lai, K. Industrial robot selection using a multiple criteria group decision making method with individual preferences. PLoS ONE 2021, 16, e0259354. [Google Scholar] [CrossRef] [PubMed]
Chen, L.; Huang, P.; Meng, Z. Convolutional multi-grasp detection using grasp path for RGBD images. Robot. Auton. Syst. 2019, 113, 94–103. [Google Scholar] [CrossRef]
Umar, A.; Tang, J.; Harrer, S. GraspNet: An efficient convolutional neural network for real-time grasp detection for low-powered devices. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018; pp. 4875–4882. [Google Scholar] [CrossRef]
Hakan, K.; Patric, J. Object detection approach for robot grasp detection. In Proceedings of the 2019 International Conference on Robotics and Automation. (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 4953–4959. [Google Scholar] [CrossRef]
Shao, Z.; Qu, Y.; Ren, G.; Wang, G.; Guan, Y.; Shi, Z.; Tan, J. Batch normalization masked sparse autoencoder for robotic grasping detection. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 27 October 2020; pp. 9614–9619. [Google Scholar] [CrossRef]
Yu, Q.; Shang, W.; Zhao, Z.; Cong, S.; Li, Z. Robotic grasping of unknown objects using novel multilevel convolutional neural networks: From parallel gripper to dexterous hand. IEEE Trans. Autom. Sci. Eng. 2021, 18, 1730–1741. [Google Scholar] [CrossRef]
Xu, R.; Chu, F.; Patricio, A.V. GKNet: Grasp keypoint network for grasp candidates detection. arXiv 2021. arXiv.2106.08497. [Google Scholar] [CrossRef]
Hu, W.; Wang, C.; Liu, F.; Peng, X.; Sun, P.; Tan, J. A grasps-generation-and-selection convolutional neural network for a digital twin of intelligent robotic grasping. Robot. Comput. -Integr. Manuf. 2022, 77, 102371. [Google Scholar] [CrossRef]
Song, Y.; Wen, J.; Liu, D.; Yu, C. Deep robotic grasping prediction with hierarchical RGB-D fusion. Int. J. Control Autom. Syst. 2022, 20, 243–254. [Google Scholar] [CrossRef]
Yu, Y.; Cao, Z.; Liu, Z.; Geng, W.; Yu, J.; Zhang, W. A two-stream CNN with simultaneous detection and segmentation for robotic grasping. IEEE Trans. Sys. Man Cybern. Syst. 2022, 52, 1167–1181. [Google Scholar] [CrossRef]
Li, X.; Zhang, X.; Zhou, X.; Chen, I. UPG: 3D vision-based prediction framework for robotic grasping in multi-object scenes. Knowl. -Based Syst. 2023, 270, 110491. [Google Scholar] [CrossRef]
Duan, J.; Zhuang, L.; Zhang, Q.; Qin, J.; Zhou, Y. Vision-based robotic grasping using faster R-CNN-GRCNN dual-layer detection mechanism. Proc. IMechE. Part B J. Eng. Manuf. 2024. [Google Scholar] [CrossRef]
Amaury, D.; Emmanuel, D.; Chen, L. Scoring graspability based on grasp regression for better grasp prediction. In Proceedings of the 2021 International Conference on Robotics and Automation (ICRA), Xi’an, China, 5 June 2021; pp. 4370–4376. [Google Scholar] [CrossRef]

Figure 1. Network architecture comparison of robotic grasping systems (The left column depicts GG-CNN, the middle column shows GG-CNN 2, an improved variant of GG-CNN, and the right column presents Grasp-DSC, a light-weight grasping pose estimation model proposed in this paper for mobile robotic arms).

Figure 2. The architecture of Grasp-DSC. (Firstly, each image undergoes normalization and data augmentation preprocessing. Next, the preprocessed images are input into the residual units of the DRSN for deep feature extraction and upsampling. Finally, based on the network outputs, the loss function is computed to determine the grasping pose parameters).

Figure 3. The architecture of the DRSN (The green dashed box outlines the overall framework, the blue dashed box represents each repeated residual shrinkage module, and the yellow dashed box indicates the thresholding process).

Figure 4. Two different CNN structures.

Figure 5. The architecture of DSC. (Each channel is processed individually before linear combination).

Figure 6. The PW process of DSC. (Altering the depth of feature maps via linear combination).

Figure 7. Light-weight parallel DSC architecture (Extracting features at different scales using parallel multi-scale convolutional kernels).

Figure 8. Examples from the Cornell Grasp Dataset.

Figure 9. Examples from the Jacquard Grasp Dataset.

Figure 10. Coordinate system for evaluation metrics (Green box represents depth and RGB images captured by the depth camera).

Figure 11. Training loss variation in different predicted parameters (Curves (a,b) depict training loss curves for grasping angle, (c) for grasping quality, and (d) for grasping width).

Figure 12. Overall loss trends.

Figure 13. Comparison of grasping rectangle by different algorithms on the Cornell Grasp Dataset.

Figure 14. Comparison of grasping success rates of different algorithms on the cornell grasp dataset.

Figure 15. Comparison of speed of different algorithms on the cornell grasp dataset. (Note: since the speed of GRPN is 189.6, which significantly differs from the other algorithms, it is excluded from the chart).

Figure 16. MR2000-UR5.

Figure 17. Experimental process of grasping with MR2000-UR5 mobile robotic arm.

Figure 18. Comparison of algorithm results from single-object experiment.

Figure 19. Process of multi-object grasping experiment.

Table 1. Ablation study of Grasp-DSC.

Algorithm	Accuracy (%)	Speed (ms)
GG-CNN 2 [13]	65	20
GG-CNN 2 + DRSN	98.9	35.1
GG-CNN 2 + DSC	89.2	9.7
GG-CNN 2 + DRSN + DSC	96.6	14.4

Table 2. Comparison of results on the Cornell Grasp Dataset.

Authors	Algorithm	Accuracy (%)	Speed (ms)	Final Score
Chen et al. [35]	Multi-grasp	86.4	25.7	53.4
Asif et al. [36]	GraspNet	90.5	24.4	55.9
Zhang et al. [12]	ROI-GD	92.3	40.1	56.4
Karaoguz et al. [37]	GRPN	88.7	189.6	53.4
Shao et al. [38]	SAE + BN + SAE	95.5	-	-
Yu et al. [39]	Multilevel CNNs	95.	36.5	58.6
Xu et al. [40]	GK-Net	96.9	41.7	59.1
Wang et al. [3]	GPWRG	94.4	12.9	59.7
Hu et al. [41]	GGS-CNN	95.5	43.5	58.2
Song et al. [42]	Confinet + BEM	92.3	18.7	57.5
Yu et al. [43]	TsGNet	93.1	-	-
Zhai et al. [7]	FANet-CPU	95.3	25.6	58.7
Li et al. [44]	UPG	93.7	21.1	58.1
Duan et al. [45]	Faster R-CNN + CBAM	94.6	42.9	57.7
Ours	Grasp-DSC	96.6	14.4	60.7

Table 3. Comparison of results on the Jacquard Grasp Dataset.

Authors	Algorithm	Accuracy (%)
Morrison et al. [13]	GG-CNN 2	84
Depierre et al. [46]	Grasp Regression	85.7
Kumra et al. [14]	GR-ConvNet	89.5
Kumra et al. [8]	GR-ConvNet v2	91.4
Ours	Grasp-DSC	92.2

Table 4. Comparison of results from single-object grasping experiment.

Algorithm	Physical Grasp	Accuracy (%)	Time (s)	Final Score
UPG [44]	91/100	91	4.83	8.8
Faster R-CNN + CBAM [45]	94/100	94	4.96	8.6
Grasp-DSC	98/100	98	4.08	10.4

Time indicates the average time taken by the mobile robotic arm to complete one grasping task starting from the home position and returning to the home position.

Table 5. Comparison of results from multi-object grasping experiment.

Algorithm	Physical Grasp	Accuracy (%)	Time (s)	Final Score
UPG [44]	96/100	96	4.83	8.9
Faster R-CNN + CBAM [45]	90/100	90	4.98	8.6
Grasp-DSC	94/100	94	4.09	10.3

Time indicates the average time taken by the mobile robotic arm to complete one grasping task starting from the home position and returning to the home position.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Duan, J.; Ye, C.; Wang, Q.; Zhang, Q. A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution. Actuators 2025, 14, 50. https://doi.org/10.3390/act14020050

AMA Style

Duan J, Ye C, Wang Q, Zhang Q. A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution. Actuators. 2025; 14(2):50. https://doi.org/10.3390/act14020050

Chicago/Turabian Style

Duan, Jianguo, Chuyan Ye, Qin Wang, and Qinglei Zhang. 2025. "A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution" Actuators 14, no. 2: 50. https://doi.org/10.3390/act14020050

APA Style

Duan, J., Ye, C., Wang, Q., & Zhang, Q. (2025). A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution. Actuators, 14(2), 50. https://doi.org/10.3390/act14020050

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Light-Weight Grasping Pose Estimation Method for Mobile Robotic Arms Based on Depthwise Separable Convolution

Abstract

1. Introduction

2. Related Works

2.1. Deep Learning Methods for Robotic Grasping

2.2. Deep Residual Shrinkage Network (DRSN)

2.3. Depthwise Separable Convolution (DSC)

3. Methodology

3.1. Grasp-DSC Framework

3.2. Deep Residual Shrinkage Network

3.3. Depthwise Separable Convolution

3.4. Loss Function

4. Experimentation and Results from the Dataset

4.1. Dataset

4.2. Evaluation Metrics

4.3. Training Process

4.4. Results Comparison

5. Experiments and Results in Real-World Scenarios

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI