A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement

Rong, Dang; Gang, Feng

doi:10.3390/electronics13224424

Open AccessArticle

A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement

by

Dang Rong

^1,2,* and

Feng Gang

¹

School of Architecture, Tianjin University, Tianjin 300073, China

²

China Construction Engineering Design & Research Institute Co., Ltd., Beijing 100037, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(22), 4424; https://doi.org/10.3390/electronics13224424

Submission received: 30 September 2024 / Revised: 3 November 2024 / Accepted: 4 November 2024 / Published: 12 November 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this study, a hand pose estimation method based on GCN feature enhancement is proposed to address the problem of the time-consuming nature and neglection of the internal relationships between hand joint points, which results in the low accuracy of hand pose estimation. Firstly, a lightweight feature extraction network RexNet is used, and deep separable convolutions are used instead of ordinary convolutions to reduce the model parameters and computational complexity. Secondly, deconvolution is added to the backend of the network to obtain preliminary estimation results of joint points. Finally, the GCN feature enhancement module is used to modify the preliminary estimation results to improve the accuracy of hand pose estimation. The proposed method is tested for accuracy on the CMU-Hand and RHD datasets. The results show that the proposed method achieves an AUC metric of 80.1% on the CMU-Hand dataset and 97.0% on the RHD dataset, and the accuracy of hand pose estimation is high.

Keywords:

hand pose estimation; deep separable convolution; graph convolutional neural network

1. Introduction

With the development of computer technology, new and convenient methods of interaction between users and computers are receiving increasing attention. Gesture, as one of the main human–computer interaction methods in natural user interfaces, contains fine-grained information of human actions, which can not only handle more delicate tasks but also transmit rich and diverse information [1]. Based on the coordinate information of hand joints, gestures that need to be used in business scenarios can be flexibly defined to achieve an immersive interactive experience, which is widely used in virtual reality, automotive user interfaces, biomedicine, and other fields [2].

Hand posture estimation refers to accurately locating the positions of hand joint points from videos or images, thereby inferring the corresponding hand posture based on these positional relationships. Its development process can be roughly divided into wearable-device-based pose estimation methods [3] and computer-vision-based pose estimation methods. The method based on wearable devices is to extract hand joint information by wearing auxiliary tools [4] such as data gloves and colored gloves. Although the results of wearable-device-based methods are relatively accurate, there are the problems of expensive hardware devices and discomfort caused by long-term wearing of facilities, which goes against the concept of natural interaction.

The pose estimation methods based on computer vision can be mainly divided into traditional machine learning-based [5] methods and deep learning-based methods. The traditional machine learning-based methods mainly include filter-based [6] methods, random forest-based [7] methods, and random forest variant [8] methods. Li et al. [9] proposed a head pose estimation method based on Kalman filtering and random forest. Firstly, a 3D somatosensory camera Kinect developed by Microsoft is used, which integrates depth sensors, color cameras, and other components and is able to accurately acquire the depth information of the scene in a non-contact manner so as to obtain the target depth map, and then the Kalman filtering algorithm is used to predict the head’s position region in the depth map. Then, based on the sampling depth blocks within the predicted area, random forest is used for head pose estimation. Due to the limitations of manually marking features, traditional machine learning-based methods have difficulty outperforming deep learning-based pose estimation methods.

The methods for pose estimation based on deep learning can be divided into top–down methods and bottom–up methods. The bottom–up pose estimation method first detects the coordinates of the relevant joint points and then connects them into a skeleton based on the detected coordinates of joint points. Santavas et al. [10] proposed a lightweight 2D hand pose estimation method that utilizes a self-attention module to enhance network feature extraction capabilities, with network parameters of only 8.61M. Qiao et al. [11] used the OpenPose algorithm for the real-time estimation of human 2D poses based on monocular images. Cheng et al. [12] proposed the human pose estimation method HigherHRNet, which assigns training targets of different resolutions to corresponding feature pyramid levels through a new multi-resolution supervision strategy, generating scale-aware high-resolution heat maps that can accurately locate joint points in small-scale human images. Zhang et al. [13] proposed an action recognition method that combines self-attention and OpenPose. It combines adjacency matrices of different scales through the self-attention mechanism, guiding the graph convolutional network to enhance the dependency of related joint points, which improves the action recognition rate. George et al. [14] proposed an anchor-free bottom–up pose estimation method, which uses ResNet to directly learn the 2D offset fields of each pair of joint points to group them. It is used for character pose estimation and instance segmentation in multi-person images. This type of bottom–up method has higher accuracy in pose estimation but slower inference speed.

The top–down method is to first detect the hand and then perform pose estimation. Newell et al. [15] proposed the Hourglass algorithm, which uses stacked Hourglass modules to estimate the position of 2D joint points in the human body. It can better capture various spatial relationships related to the body and has achieved good results in human pose estimation. Xiao et al. [16] proposed the Simple Baseline (SBL) pose estimation algorithm using ResNet50 as the backbone network, which constructs a deconvolution module to replace the network backend structure. After obtaining the estimated results of the hand joint heat map, the heat map is converted into coordinate representation through a regression operation. Compared to Hourglass networks, its pose estimation accuracy is higher. Doosti et al. proposed a hand-object pose estimation method based on a graph convolutional neural network (GCN), HOPE-Net [17]. It uses ResNet10 as the backbone network to obtain the coordinates of 2D joint points of the hand, and then the 3D hand pose is obtained through the constructed Graph U-Net module. Lin et al. [18] proposed a hand joint point estimation method based on cascaded features and graph convolutional neural networks, which inputs the preliminary estimated joint point results into the adaptive graph convolution feature enhancement module and corrects the joint point coordinates to obtain the three-dimensional hand posture. This type of top–down method has fast inference speed but low accuracy.

To address the above issues, this paper first uses a lightweight feature extraction network RexNet [19] for end-to-end feature extraction, with a parameter size of only 13.9 M. Secondly, deconvolution is added to the backend of RexNet to estimate the heat map from the feature map in the simplest way. Finally, a GCN feature enhancement module is constructed to modify the heat map estimation results and improve the accuracy of hand joint point estimation. The main work of this article is as follows: (1) using a lightweight feature extraction network RexNet to replace the original ResNet50 as the backbone network for feature extraction, reducing model parameters; (2) replacing the last three layers of convolution in the network with a deconvolution module (Decov) to obtain the heat map of 21 joint points in the hand in the simplest way; (3) building a GCN feature enhancement module to modify the preliminarily predicted joint points and improve the accuracy of joint point estimation.

2. Methods

Hand pose estimation is a method for accurately obtaining hand joint point coordinates from videos or images by constructing a feature extraction network. In recent years, significant progress has been made in this field of research, and algorithms like Convolutional Gesture Machine, Hourglass, SBL, and HRNet have achieved good results, but the model complexity increases with the accuracy of hand pose estimation.

2.1. Related Theoretical Foundations

The SBL pose estimation algorithm provides a relatively simple baseline and achieves good results in human pose estimation, which provides an architectural design reference for subsequent research, and the overall structure of the algorithm is shown in Figure 1.

The algorithm uses ResNet50 as the backbone feature extraction network, where each convolutional layer consists of a continuous residual block structure that fuses the underlying and higher-level features by means of jump connections. Three Deconv modules containing normalized BatchNorm, Relu activation, and inverse convolution are added after the C5 stage of ResNet50 for estimating the heat map from the multi-resolution feature map. Finally, a 1 × 1 convolutional layer is added to generate heat maps of predicted heat maps for k joints, and the maximum response value of the heat maps is calculated to obtain the final predicted coordinate positions of the joints. Compared to the Hourglass network, the SBL algorithm simplifies the ResNet50 backend structure by removing the jump connections and multiple branches, making the heat map prediction network more concise.

When the number of network layers increases, more complex feature extraction can be carried out; theoretically, deeper models should achieve better results, but in practice, deep networks are prone to degradation, as too many layers of the network will lead to saturation or a decline in accuracy. ResNet, also known as residual network, solves the problem of deep CNN training through residual learning. The input image size is 224 × 224 × 3; the convolutional layer Conv1 consists of convolution, normalization, and the Relu activation function; and the convolutional layers Conv2_x to Conv5_x contain 3, 4, 6, and 3 consecutive residual block structures, respectively. The residual structure, as shown in Figure 2, consists of convolution and the Relu activation function through the jump connection to the bottom features and the high-level features to achieve fusion; then, the output layer uses the Softmax classifier to output the classification results.

2.2. The Proposed Method

In this paper, firstly, a lightweight feature extraction network RexNet is used to replace the original ResNet50 as the backbone network for hand feature extraction, reducing the number of model parameters. Secondly, the backend convolution and the fully connected layer of the network are replaced with three deconvolutions with batch normalization and ReLU activation, and the features are restored to pixel space. Then, the estimated heat maps {H1… H21} of 21 joint points of the hand are obtained through 1 × 1 convolution (The color of the highlights reflects the confidence level of being identified as a critical point), and the preliminary estimation results of the hand joint points are obtained through heat map regression. Finally, the features obtained by the feature extraction network RexNet through convolution are connected to the preliminarily estimated hand joint coordinates as input to the graph convolution (GCN) feature enhancement module. The 2D joint points preliminarily predicted by the Decov module are modified to improve the accuracy of hand joint point estimation and obtain the final predicted hand joint coordinates. The overall process is shown in Figure 3.

2.2.1. Backbone

In this paper, ResNet is replaced with RexNet as the backbone feature extraction network, with a network parameter size of only 13.9 M. It achieves end-to-end feature extraction. RexNet is an improvement based on MobileNetV2 [20], which preserves the deep separable convolution (DSC) and inverted residual block in MobileNetV2. The deep separable convolution can reduce computational and parameter complexity, while the inverted residual block can retain more feature information. The overall network structure is shown in Figure 4. It consists mainly of 16 stacked LinearBottleneck modules. Unlike the structure of ordinary networks, where dimensionality is first reduced and then increased, the LinearBottleneck module adopts an inverted residual structure, which first uses 1 × 1 convolution for dimensionality increase then performs 3 × 3 deep separable convolution and finally uses 1 × 1 convolution for dimensionality reduction to preserve more shallow features. When the step size is 1, residual connections are used to achieve feature fusion at different levels.

In conventional convolution, the convolution kernel performs a convolution operation for each input channel as shown in Figure 5. The core idea of deep separable convolution is to decompose a complete convolution operation into two steps, mainly including depthwise convolution (DW) and pointwise convolution (PW), which are used to reduce network parameters to improve computational efficiency.

Firstly, DW convolution is used. One convolution kernel of DW convolution is responsible for one channel, and each channel is only convolved by one convolution kernel. The number of output channels for each channel after the convolution operation is also 1. The outputs of all convolutional kernels are spliced before performing PW convolution. The DW convolution process is shown in Figure 6.

The size of the PW convolutional kernel is 1 × 1 × M. M is the number of channels in the previous layer. The convolution operation here will generate a new feature map through a weighted combination operation on the previous feature map in the depth direction, with the same number of output channels and input channels, greatly reducing the model parameters and computational complexity. The PW convolution process is shown in Figure 7.

Assuming the size of the input feature map is

h_{i} \cdot w_{i} \cdot d_{i}

, the size of the output feature map is

h_{0} \cdot w_{0} \cdot d_{0}

. The calculation consumption

S_{0}

of a standard convolution with a kernel size of

k \cdot k

is shown in Equation (1). Deep separable convolution consumption

S_{1}

is shown in Equation (2). Compared with standard convolution, it has less computational complexity.

S_{0} {= h}_{0} \times w_{0} \times d_{i} \times k \times k \times d_{0}

(1)

S_{1} {= h}_{0} \times w_{0} \times d_{i} (k^{2} + d_{0})

(2)

The LinearBottleneck module of RexNet mainly consists of an inverted residual block, as shown in Figure 8. Figure 8a shows the structure without a skip connection, and Figure 8b shows the structure with a skip connection. The inverted residual block first uses 1 × 1 convolution for dimensionality enhancement then proceeds to 3 × 3 deep separable convolutions, finally using 1 × 1 convolution for dimensionality reduction. When the step size is 1, residual connections are used to achieve feature fusion at different levels. The characteristics of the inverted residual block mainly include the following parts.

The change in the number of channels is carried out as follows. In the residual structure, the dimensionality is first reduced and then increased, which is an hourglass-shaped structure with two large ends and one small middle. But in the inverted residual block, 1 × 1 convolution is first used for dimensionality enhancement, and then features are extracted through 3 × 3 DW convolution. Finally, 1 × 1 convolution is used for dimensionality reduction, which is a shuttle-shaped structure with small ends and a large center.

The change in the convolution operation is achieved by replacing the standard convolution with DW convolution in the inverted residual block.

The change in the activation function is carried out as follows. The Relu activation function is uniformly used in the residual block, while in the inverted residual block, the Relu6 activation function is used in the first two layers to ensure good numerical resolution even at low accuracy on the mobile end. The last convolution uses the linear activation function. When information is non-linearly mapped from high-dimensional space to low-dimensional space, information collapse occurs, so linear activation functions are used when performing dimensionality reduction operations.

The jump connections fuse high-level feature information with low-level features to obtain more information.

2.2.2. Heat Map Estimation Based on Decov

After the image is input to the convolutional neural network for feature extraction, the size of the output image will decrease. We need to restore the image to its original size for further calculation. This operation of mapping the image from low-resolution to high-resolution is called upsampling, which is very common in network structures such as image segmentation. Bilinear interpolation, deconvolution, and so on are all operations that implement upsampling. Deconvolution, also known as transposed convolution, is essentially a convolution process. Unlike the convolution process, deconvolution performs a supplementary zero operation before the convolution, so that the size of the output matrix is the same as the required output shape.

The deconvolution process is shown in Figure 9. Firstly, a convolution matrix and a transposed convolution kernel matrix are defined. The convolution matrix is rearranged by the convolution kernel, and the size of the input image is enlarged by adding 0 in a certain ratio to obtain a matrix that performs a convolution operation by ordinary matrix multiplication. In the inverse convolution process, the convolution kernel that actually performs the convolution operation is the transposed convolution kernel. This transposed convolution kernel performs the convolution operation with the input image that has been processed by a complementary zero to obtain the output image. Inverse convolution is a technique used for upsampling in convolutional neural networks, and the parameters of these convolutional operations are learned through the training process and are not artificially predefined.

In this paper, based on RexNet described above, the last three layers of the structure are replaced with three anti-convolution modules. Heat maps are jointly estimated from high-resolution feature maps and low-resolution feature maps in a more direct and efficient way. Each deconvolution module includes deconvolution, batch normalization, and the Relu activation function (where the parameter is set to 256 channels, 4 convolutional kernels, 2 stripes, and 1 padding), which restores the convolutional feature map to pixel space size and then uses a 1 × 1 convolution to generate a predicted heat map {H1, H2… H21} of 21 joint points in the hand. The heat map calculation mainly takes the coordinates of the 21 annotated hand joint points as the center and uses a two-dimensional gaussian function to calculate the confidence level of each pixel covered by each joint point, that is, the predicted heat map Hi of joint point i. The formula is shown in Equation (3), where (x,y) represents the coordinates of the heat map pixel points and (u,v) represents the true coordinates of the joint points.

H_{i} (x, y) = \frac{1}{2 π σ^{2}} e^{\frac{- ({(x - u)}^{2} + {(y - v)}^{2})}{2 σ^{2}}}

(3)

2.2.3. Hand Pose Estimation Based on GCN Feature Enhancement

For the hand skeleton, it itself is a natural graph structure. In this paper, the constructed GCN feature enhancement module is used to obtain implicit relationships within hand joint points, and the preliminary estimation results of hand joint points obtained from the previous deconvolution module are corrected. Firstly, the heat map of the hand joint points obtained from the previous Decov module is transformed into coordinate representation by the integral regression method. The obtained 2D coordinates are connected with the features obtained from RexNet convolution; they are input into a three-layer adaptive GCN, and adjacent information is used to correct the 2D coordinates of the joint points. The GCN convolutional process is shown in Figure 10.

Firstly, the heat map of the hand joint obtained from the previous deconvolution module is converted into coordinate representation; that is, the heat map is first normalized, the pixel values of the heat image to are standardized to (0–1), and then an integration and summation operation is performed on the likelihood layer to estimate the joint position. The calculation process is shown in Equation (4), where

{\overset{\land}{J}}_{i}^{k}

represents the position estimation of the k-th joint, A represents the likelihood region, and

H_{k} (p)

represents the likelihood value at point p.

{\overset{\land}{J}}_{i}^{k} = \int_{p \in A} p \cdot H_{k} (p)

(4)

Then, the 2D coordinates of the hand joint points obtained from the deconvolution module regression are connected to the feature vectors generated by the RexNet convolution. They are input into the 3-layer adaptive GCN feature enhancement module, and adjacency information is used to correct the two-dimensional coordinates of joint points. The output feature calculation formula for each node is shown in Equation (5), and the calculation formula for

\tilde{A}

is shown in Equation (6), where A is the adjacency matrix of the graph,

\overset{\land}{D}

is the degree matrix of

\overset{\land}{A}

, X is the input feature matrix, and W is the weight matrix.

Y = Re l u (\tilde{A} X W)

(5)

\tilde{A} = {\overset{\land}{D}}^{- \frac{1}{2}} \overset{\land}{A} {\overset{\land}{D}}^{- \frac{1}{2}}

(6)

\overset{\land}{A} = A + I

(7)

The hand pose estimation method based on GCN feature augmentation couples the heat map module and coordinate transformation, enabling the GCN feature augmentation network to obtain more accurate poses and make the network learn the optimal adjacency matrix. Meanwhile, nodes can be connected to other nodes in the graph through weighted edges. The adaptive graph convolution operation can update the adjacency matrix A and weight matrix W in the backpropagation step, improving the accuracy of joint point estimation.

3. Experimental Results and Analysis

3.1. Experimental Setup

The training and testing environment for the experiments in this paper is the Linux operating system, the hardware configuration is NVIDIA GeForce RTX 3080 GPU, 40 GB RAM, and the RHD dataset [21], FreiHAND dataset [22], and CMU-Hand dataset [23] provided by Carnegie Mellon University are used for the experiments of hand pose estimation. The RHD is a synthesized dataset collected from 39 different hand gestures from 20 individuals, containing a total of 41,258 training samples and 2728 test samples, each of which provides an RGB image of the hand, a depth image of the hand, the 3D coordinates of the joint points of both hands, and the camera’s internal reference matrix. The FreiHAND dataset is the first multi-view hand pose estimation dataset released by the University of Freiburg and consists of 130,240 images. The CMU-Hand dataset consists of hands labeled with articulation positions in real images, hands labeled with articulation positions in synthetic images, and hands labeled with articulation positions in multi-camera angle recordings, for a total of 14,817 images in the dataset. The dataset contains a total of 14,817 images. A total of 100 epochs are trained in this experiment, and the batch size is 2.

3.2. Evaluating Indicators

In this paper, PCK (Percentage of Correct Keypoint), AUC, E_mean (mean error), E_median (median error), FPS, Params, and FLOPs are used as evaluation metrics, in which PCK denotes the average accuracy, i.e., the proportion of the normalized distance between the joints and their corresponding true values that is less than a set threshold value. The formula is shown in Equation (8). AUC denotes the area enclosed by the PCK curve and the X-axis. FPS denotes the number of frames per second, which is used to measure the inference speed of the model. Params refers to the number of parameters of the model, which is used to measure the complexity and the scale of the model. FLOPs denotes the number of floating-point operations, which is a measure of the amount of computation.

P C K = \frac{\sum_{i}^{n} σ (\frac{d_{i}}{d} \leq T)}{n}

(8)

where n denotes the number of key points of the hand,

d_{i}

denotes the distance between the predicted value and the labeled true value of the i-th joint point of the hand, and d is the human body normalization factor, which in this paper is taken to be the Euclidean distance from the center of the palm of the hand to the end of the middle finger. T is the agreed-upon threshold range, which is taken to be 30 mm in the experiments, and the

σ

operator denotes whether or not the predicted value of the key point is within the threshold range after calculation.

3.3. Comparative Analysis of Experimental Results

(1): Visualization experiment results.

In this paper, a test is conducted on the CMU-Hand dataset, and the visualization results are shown in Figure 11. Figure 11a,e represent the visualization results of the true labeled values of the joint points on the dataset, Figure 11b,f represent the visualization results of the joint point predicted by the algorithm [22], Figure 11c,g represent the heat map estimation results corresponding to the true labeled values, and Figure 11d,h represent the heat map estimation results corresponding to the predicted values. From Figure 11, it can be seen that the gap between the overall estimation effect and the real value is small, indicating that the estimation effect of the algorithm in this paper is relatively good.

(2): Comparative analysis of experimental results.

To verify the effectiveness of our algorithm in hand pose estimation, we compared it with other 2D joint point estimation methods, and the PCK curve is shown in Figure 12. From the figure, it can be seen that the SBL algorithm has the highest estimation accuracy, while the lightweight CNN algorithm has the lowest estimation accuracy [24]. In this paper, RexNet is used as the backbone network, which has higher accuracy while maintaining fewer parameters.

The detailed comparison results of this paper’s algorithm with other 2D pose estimation algorithms on the CMU-Hand dataset are shown in Table 1. Specifically, this paper’s method is compared with the SBL algorithm to use the lightweight RexNet as the backbone network to reduce the number of parameters, and the AUC value is only reduced by 0.025; compared with the lightweight CNN algorithm, the AUC value of this paper’s algorithm is improved by 0.251. Overall, this paper’s algorithm has a higher accuracy of hand pose estimation while keeping the parameters lower. We compare the performance of the mean error (E_mean) and median error (E_median) as evaluation indexes. E_mean can fully reflect the overall accuracy of the model: the lower its value, the closer the prediction result of the model is to the real label. The results show that this algorithm improves the accuracy of the model by nearly 2.5 times compared with the lightweight CNN algorithm. Although the E_mean value is 0.738 higher compared to the SBL, the proposed method is lighter, and the E_mean value is within the acceptable range. E_median reveals the stability of the model error and provides an intuitive indicator about the level of the sample error, and the experimental results show that the proposed method outperforms most of the methods. The proposed method takes 0.7 × 103 ms per iteration for training and about 0.3 × 102 ms per iteration for inference.

The experimental results of this paper’s method on the RHD dataset in comparison with recent years’ hand posture estimation methods are shown in Table 2. The experiments show that the proposed method outperforms the methods of [25,26,27] in terms of mean joint error, intermediate joint error, and AUC value, which verifies the validity of the proposed method, and due to the differences in the quality of the dataset and other differences, the test results of the proposed method on the RHD dataset are significantly better than the results on the CMU-Hand dataset.

In order to comprehensively evaluate the performance of the hand pose estimation model of the proposed method, three evaluation metrics, FPS, Params, and FLOP, are used to compare with two mainstream methods, METRO [28] and FastMETRO [29], on the FreiHAND dataset, and the experimental results are shown in Table 3.

The experimental results show that compared with other mainstream methods, the method in this paper has higher frames per second (FPS) and exhibits good real-time performance. For the model, the number of overall parameters is significantly reduced, which helps to reduce the storage requirement and computational complexity of the model. A comprehensive experimental analysis of the three evaluation metrics of FPS, Params, and FLOPs shows that the proposed method maintains high-precision hand pose estimation while also possessing faster inference speed, a lower number of parameters, and moderate computation, which makes it suitable for real-time application scenarios.

(3): Ablation experiment.

In order to verify the effectiveness of the algorithm in hand pose estimation when using different backbone feature extraction networks, ablation experiments were designed for comparison. The PCK curve display results are shown in Figure 13. From the figure, it can be seen that the estimation accuracy is higher when using ResNet as the backbone network, while the estimation accuracy is lower when using lightweight Squeezenet as the backbone network. In this paper, RexNet is used as the backbone network, which maintains fewer parameters while maintaining higher estimation accuracy.

Several sets of images were randomly selected to test the effectiveness of the algorithm in hand pose estimation when using different backbone networks. The detailed results are shown in Figure 14a–i. The first, second, and third columns of the figure show the estimation results using ResNet, RexNet, and Squeezenet as the backbone networks. From the figure, it can be seen that the algorithm can accurately estimate the position of hand joint points when using ResNet and RexNet; however, when using Squeezenet, the estimation results of joint points may deviate and be inaccurate, as shown in Figure 14c,f. This paper uses RexNet as the backbone network, with the parameter size of only 13.9 M, which has higher estimation accuracy while maintaining a lower number of parameters.

In order to verify the improvement effect of the deconvolution module and GCN feature enhancement module added to the model structure in this paper, ablation experiments were designed to test the effectiveness of the module. The detailed experimental comparison results are shown in Table 4. From the table, it can be seen that after adding the improved module to different backbone networks, the AUC value of hand pose estimation has been improved.

The backbone in the table represents the algorithms adopting different backbone networks. From the table, it can be seen that adding the Decov module to the Squeezenet, ResNet50, and RexNet backbone networks increased the AUC value of hand pose estimation by 0.03, 0.08, and 0.038, which proves the effectiveness of adding the Decov module in this paper. After adding GCN feature enhancement modules to the Squeezenet, ResNet50, and RexNet backbone networks, their AUC values increased by 0.048, 0.111, and 0.066, which proves the effectiveness of the improved modules in this paper.

4. Discussion

In this paper, significant progress has been made in the task of hand pose estimation, which not only significantly reduces the number of parameters of the model and reduces the computation and storage requirements, making the model lighter and more efficient, but also lays a solid foundation for subsequent hand pose estimation by providing more accurate and robust feature representations. The accuracy and efficiency of hand pose estimation are also improved, making the model perform well in complex hand posture estimation tasks and achieving a good balance between lightweight and accuracy.

Some potential limitations and challenges in the field of hand pose estimation in future work still need to be further explored and addressed. The issue of computational intensity of deep convolution versus hardware utilization may make it difficult to achieve efficient real-time processing on certain hardware platforms. To address this challenge, algorithm optimization strategies and how to incorporate recent advances in dedicated hardware can be explored to further improve computational efficiency and resource utilization for more efficient real-time processing.

5. Conclusions

This paper proposes a hand pose estimation method based on GCN feature enhancement to address the issue of low accuracy caused by the neglect of internal relationships between hand joint points in hand pose estimation methods. Firstly, a lightweight feature extraction network, RexNet, is used to replace the original ResNet50 as the backbone network for feature extraction. Secondly, the last three layers of convolution in the network are replaced by a deconvolution module (Decov), and the heat maps of 21 joint points in the hand are obtained through 1 × 1 convolution. Finally, a GCN feature enhancement module is constructed to modify the preliminarily predicted joint points and improve the accuracy of hand joint point estimation. The method proposed in this paper is tested on the dataset, and the experimental results show that the algorithm maintains fewer computational parameters while achieving higher estimation accuracy.

Author Contributions

Conceptualization, D.R.; methodology, D.R.; software, D.R.; validation, D.R.; formal analysis, F.G.; data curation, F.G.; writing—original draft preparation, D.R.; writing—review and editing, D.R. and F.G.; visualization, D.R. and F.G.; supervision, F.G. All authors have read and agreed to the published version of the manuscript.

Funding

This Project Supported by Key Laboratory of Tourism Information Fusion Processing and Data Ownership Protection, Ministry of Culture and Tourism (Project NO. 2024TPDP01).

Data Availability Statement

The data are contained within this article.

Conflicts of Interest

Author Dang Rong was employed by the company “China Construction Engineering Design & Research Institute Co., Ltd.”. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Zhou, X.; Chen, J.; Yang, Z.; Liu, W. Manipulation Action Recognition Based on Gesture Feature Fusion. Comput. Eng. Appl. 2021, 57, 169–175. [Google Scholar]
Zhang, W.; Lin, Z.; Cheng, J.; Ke, M.; Deng, X.; Wang, H. Survey of Dynamic Hand Gesture Understanding and Interaction. J. Softw. 2021, 32, 3051–3067. [Google Scholar]
Wang, R.; Popovic, J. Real-time hand-tracking with a color glove. Acm Trans. Graph. 2009, 28, 1–8. [Google Scholar]
Xu, C.; Nanjappa, A.; Zhang, X.; Cheng, L. Estimate Hand Poses Efficiently from Single Depth Images. Int. J. Comput. Vis. 2016, 116, 21–45. [Google Scholar] [CrossRef]
Guo, X.; Quan, T.; Pan, Y. Position Inferring of Hand Joints Based on Kinect. Comput. Appl. Softw. 2020, 37, 5. [Google Scholar]
Yu, H.; Tang, X.; Liu, J.; Chen, Y.; Huang, C. Robust Single Fingertip Tracking Method Based on Plam Posture Self-adaption. J. Comput.-Aided Des. Comput. Grap 2013, 25, 1793–1800. [Google Scholar]
Sun, X.; Wei, Y.; Liang, S.; Tang, X.; Sun, J. Cascaded hand pose regression. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 824–832. [Google Scholar]
Tang, D.; Chang, H.; Tejani, A.; Kim, T. Latent Regression Forest: Structured Estimation of 3D Hand Poses. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1374–1387. [Google Scholar] [CrossRef]
Li, C.; Zhong, F.; Ma, X.; Qin, X. Real-Time Head Pose Estimation Based on Kalman Filter and Random Regression Forest. J. Comput.-Aided Des. Comput. Graph. 2017, 29, 2309–2316. [Google Scholar] [CrossRef]
Santavas, N.; Kansizoglou, I.; Bampis, L.; Karakasis, E.; Gasteratos, A. Attention! A Lightweight 2D Hand Pose Estimation Approach. IEEE Sens. J. 2021, 21, 11488–11496. [Google Scholar] [CrossRef]
Qiao, S.; Wang, Y.; Li, J. Real-time human gesture grading based on OpenPose. In Proceedings of the 2017 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Shanghai, China, 14–16 October 2017; pp. 1–6. [Google Scholar]
Cheng, B.; Xiao, B.; Wang, J.; Shi, H.; Huang, T.; Zhang, L. HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5385–5394. [Google Scholar]
Zhang, F.; He, T. Action Recognition Combined with Lightwight Openpose and Attention-Guided Graph Convolution. Comput. Eng. Appl. 2022, 58, 8. [Google Scholar] [CrossRef]
Papandreou, G.; Zhu, T.; Chen, L.; Gidaris, S.; Tompson, J.; Murphy, K. Personlab: Person pose estimation and instance segmentation with a part-based geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 282–299. [Google Scholar]
Newell, A.; Yang, K.; Jia, D. Stacked Hourglass Networks for Human Pose Estimation. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Berlin/Heidelberg, Germany, 2016; Volume 9912, pp. 483–499. [Google Scholar]
Xiao, B.; Wu, H.; Wei, Y. Simple Baselines for Human Pose Estimation and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Volume 11210, pp. 472–487. [Google Scholar]
Doosti, B.; Naha, S.; Mirbagheri, M.; Crandall, D. HOPE-Net: A Graph-Based Model for Hand-Object Pose Estimation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 6607–6616. [Google Scholar]
Lin, Y.; Lin, S.; Lin, Z. 3D Hand Pose Estimation Algorithm Based on Cascaded Features and Graph Conyolution. Chin. J. Liq. Cryst. Disp. 2022, 37, 736–745. [Google Scholar] [CrossRef]
Ma, S.; Zhang, Q.; Li, T.; Song, H. Basic motion behavior recognition of single dairy cow based on improved Rexnet 3D network. Comput. Electron. Agric. 2022, 194, 0168–1699. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zimmermann, C.; Brox, T. Learning to estimate 3d hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4903–4911. [Google Scholar]
Zimmermann, C.; Ceylan, D.; Yang, J. FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGBImages. In Proceedings of the Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 813–822. [Google Scholar]
Simon, T.; Joo, H.; Matthews, I.; Sheikh, Y. Hand Keypoint Detection in Single Images Using Multiview Bootstrapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4645–4653. [Google Scholar]
Ge, L.; Liang, H.; Yuan, J.; Thalmann, D. 3D Convolutional Neural Networks for Efficient and Robust Hand Pose Estimation from Single Depth Images. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1991–2000. [Google Scholar]
Chen, Z.; Sun, Y. Joint-wise 2D to 3D lifting for hand pose estimation from a single RGB image. Appl. Intell. 2023, 53, 6421–6431. [Google Scholar] [CrossRef]
Lin, F.; Wilhelm, C.; Martinez, T. Two-hand global 3D pose estimation using monocular rgb. In Proceedings of the IEEE CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2373–2381. [Google Scholar]
Guo, S.; Cai, Q.; Qi, L. CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 4896–4907. [Google Scholar]
Lin, K.; Wang, L.; Liu, Z. End-to-end human pose and mesh reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 1954–1963. [Google Scholar]
Cho, J.; Kim, Y.; Oh, T. Cross-attention of disentangled modalities for 3dhuman mesh recovery with transformers. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 342–359. [Google Scholar]

Figure 1. SBL algorithm process.

Figure 2. Residual structure.

Figure 3. Overall process.

Figure 4. RexNet structure.

Figure 5. Conventional convolutional structures.

Figure 6. Depthwise convolution.

Figure 7. Pointwise convolution.

Figure 8. Inverted residual structure.

Figure 9. Deconvolution process.

Figure 10. GCN convolutional process.

Figure 11. Joint point estimation results. (The dots represent the positions of the joints of the hand, and the lines connecting these joints represent the skeletal structure of the hand).

Figure 12. Experimental results of PCK curve comparison on CMU-Hand dataset.

Figure 13. Results of PCK curve ablation experiments on the CMU-Hand dataset.

Figure 14. Point estimation results of different backbone networks. (The different colored lines are the skeletal structure of the hand, the blue circles are the joints of the hand, and the red boxes are the locations where the offset occurred).

Table 1. Comparison results with other algorithms on the CMU-Hand dataset.

Method	Dataset	E_Mean	E_Median	AUC
SBL	CMU-Hand	5.937	3.474	0.826
LeamableGroups-Hand		6.237	4.767	0.816
Hourglass		8.340	5.283	0.759
Lightweight CNN		16.1	12.85	0.55
Our method		6.675	4.228	0.801

Table 2. Comparison results with other algorithms on the RHD dataset.

Method	Dataset	E_Mean	E_Median	AUC
Chen [25]	RHD	10.49	8.69	0.962
Lin [26]		11.14	12.47	0.942
Guo [27]		-	10.58	0.965
Our method		10.21	8.34	0.970

Table 3. Comparison results of overall model performance.

Method	FPS	Params	FLOPs
METRO [28]	19.55	183.80 M	41.47 G
FastMETRO [29]	21.88	133.90 M	30.56 G
Our method	38.62	41.80 M	26.17 G

Table 4. Comparison results of ablation experiments.

Method	Backbone	E_Mean	E_Median	AUC
Squeezenet	Squeezenet	11.883	9.029	0.673
Squeezenet + Decov		9.386	8.162	0.703
Squeezenet + Decov + GCN		9.012	6.847	0.721
ResNet50	ResNet	8.360	5.691	0.746
ResNet50 + Decov		5.937	3.474	0.826
ResNet50 + Decov + GCN		3.378	1.621	0.857
RexNet	RexNet	8.759	5.970	0.735
RexNet + Decov		7.275	4.628	0.773
RexNet + Decov + GCN (this paper)		6.675	4.228	0.801

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Rong, D.; Gang, F. A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement. Electronics 2024, 13, 4424. https://doi.org/10.3390/electronics13224424

AMA Style

Rong D, Gang F. A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement. Electronics. 2024; 13(22):4424. https://doi.org/10.3390/electronics13224424

Chicago/Turabian Style

Rong, Dang, and Feng Gang. 2024. "A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement" Electronics 13, no. 22: 4424. https://doi.org/10.3390/electronics13224424

APA Style

Rong, D., & Gang, F. (2024). A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement. Electronics, 13(22), 4424. https://doi.org/10.3390/electronics13224424

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Hand Attitude Estimation Method Based on GCN Feature Enhancement

Abstract

1. Introduction

2. Methods

2.1. Related Theoretical Foundations

2.2. The Proposed Method

2.2.1. Backbone

2.2.2. Heat Map Estimation Based on Decov

2.2.3. Hand Pose Estimation Based on GCN Feature Enhancement

3. Experimental Results and Analysis

3.1. Experimental Setup

3.2. Evaluating Indicators

3.3. Comparative Analysis of Experimental Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI