Next Article in Journal
A Secure and Efficient Authentication Scheme for Large-Scale IoT Devices Based on Zero-Knowledge Proof
Previous Article in Journal
A Deep Learning-Based Low-Overhead Beam Tracking Scheme for Reconfigurable Intelligent Surface-Aided Multiple-Input and Single-Output Systems with Estimated Channels
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition

1
School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, China
2
Postdoctoral Research Workstation of Northeast Asia Service Outsourcing Research Center, Harbin 150028, China
3
Post-Doctoral Flow Station of Applied Economics, Harbin 150028, China
4
Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin 150028, China
*
Author to whom correspondence should be addressed.
Electronics 2024, 13(18), 3733; https://doi.org/10.3390/electronics13183733
Submission received: 19 August 2024 / Revised: 12 September 2024 / Accepted: 18 September 2024 / Published: 20 September 2024
(This article belongs to the Topic Computer Vision and Image Processing, 2nd Edition)

Abstract

:
Dynamic gesture recognition based on skeletal data has garnered significant attention with the rise of graph convolutional networks (GCNs). Existing methods typically calculate dependencies between joints and utilize spatio-temporal attention features. However, they often rely on joint topological features of limited spatial extent and short-time features, making it challenging to extract intra-frame spatial features and long-term inter-frame temporal features. To address this, we propose a new GCN architecture for dynamic hand gesture recognition, called a spatio-temporal dynamic attention graph convolutional network (STDA-GCN). This model employs dynamic attention spatial graph convolution, enhancing spatial feature extraction capabilities while reducing computational complexity through improved cross-channel information interaction. Additionally, a salient location channel attention mechanism is integrated between spatio-temporal convolutions to extract useful spatial features and avoid redundancy. Finally, dynamic multi-scale temporal convolution is used to extract richer inter-frame gesture features, effectively capturing information across various time scales. Evaluations on the SHREC’17 Track and DHG-14/28 benchmark datasets show that our model achieves 97.14% and 95.84% accuracy, respectively. These results demonstrate the superior performance of STDA-GCN in dynamic gesture recognition tasks.

1. Introduction

Gesture is a form of body language enabling natural communication through palm centering and finger placement, categorized into static and dynamic gestures [1,2]. Static gestures are represented by a single image, while dynamic gestures encompass a series of actions, such as finger slides and full palm transformations. With the growing importance of dynamic gestures in applications like sign language communication, robot control, and intelligent driving [3], their recognition techniques have become a research hotspot in computer vision and human–computer interaction fields.
Gesture recognition methods are mainly divided into two categories: image-based methods, which use RGB or RGB-D image sequences, and skeleton-based methods, which use 2D or 3D coordinate sequences of hand joints. The former relies on visual information to extract gesture features, while the latter recognizes gestures by analyzing geometric changes in joint positions. Despite the remarkable achievements of research relying on RGB video data [4], the recognition effect is affected by feature extraction errors due to the interference of background, illumination and occlusion. These challenges limit the development of RGB video in the field of gesture recognition. Compared with the RGB video data modality, hand skeleton encoding can represent the key features of human posture and joint action in a concise and abstract way. Moreover, it is robust to viewpoint changes, action speed, external appearance, and body scale. With the advent of depth cameras (such as Microsoft Kinect [5] and Intel RealSense [6]) and continuous improvements in pose estimation algorithms (such as OpenPose [7]), researchers are now able to obtain skeletal data quickly and accurately, which has greatly promoted the development of gesture recognition research.
Human skeletal data, whether body or hand bones, naturally constitute topological structures. For example, the sequence diagram of the hand skeleton in Figure 1 illustrates the structure in which the joints are the nodes and the bones are the edges. The traditional convolutional neural network (CNN) [8,9,10,11] processes the spatial information in each action pose by reorganizing the joint point coordinates into a two-dimensional planar graph, while some methods [12,13,14] use recurrent neural networks (RNNs) to represent skeletal data as a sequence vector composed of joint point coordinates, which is able to capture the temporal dependence of actions. However, compared with traditional data, skeleton data have disorder and dimensional variability: nodes in skeleton data do not have a fixed order, and the number of neighbor nodes is also uncertain, which increases the complexity of data processing. Therefore, traditional data processing methods with fixed dimensions and fixed order are difficult to apply to skeletal data. However, graph convolutional networks (GCN) [15,16,17,18,19] can directly deal with such non-Euclidean data, and become a powerful tool for dealing with skeletal data problems by using adjacency information and feature information between nodes for feature aggregation.
However, compared with general action recognition, gesture recognition relies more on precise trajectories and time series of joint actions. For example, recognizing an action like “waving” may simply involve analyzing the action trajectory of the arm, whereas recognizing an action like “typing” requires capturing the subtle actions of each finger and their chronological order. This high dependence on time series information makes gesture recognition face greater challenges in feature extraction and processing long-term problems. In order to solve these problems, this paper proposes a spatio-temporal dynamic attention graph convolutional network based on skeleton data (STDA-GCN). This method enhances the interaction between channels by dynamic attention graph convolution, and effectively captures the action changes at different time scales by multi-scale temporal dynamic convolution, which is especially suitable for dealing with the needs of subtle action and time dependence in gesture recognition. In addition, STDA-GCN reduces model complexity while improving the spatial feature extraction ability by aggregating multiple dynamic convolution kernels. In order to obtain more comprehensive context information, the salient location channel attention mechanism (SLCAM) is also introduced to further enhance the dependence between channels.
In this paper, a novel improved spatiotemporal graph convolutional network for dynamic gesture recognition is proposed. The main contributions of this paper are as follows:
  • Designed to enhance the expressive power of spatial graph convolution, a dynamic attention graph convolutional layer aggregates multiple convolution kernels through an attention mechanism. This improves the model’s flexibility in recognizing hand actions by enhancing cross-channel information interaction.
  • A salient location channel attention mechanism is proposed. By calculating the prominent location of the channel in the query matrix, the key information of the context is better obtained. A method based on graph convolution is used to represent the relationship between the feature vertices with an adjacency matrix, and non-local operation is carried out to recalibrate the channel features.
  • Dynamic convolution multi-scale time modeling is designed to extract temporal features from gestures of different lengths through dynamic temporal convolution of different expansion rates, which improves the recognition ability of short to long hand actions.
  • Experimental results on SHREC’17 Track and DHG-14/28 datasets show that STDA-GCN exhibits highly competitive performance. Through ablation experiments and comparison experiments, STDA-GCN achieves 97.14% (14 gestures) and 95.84% (28 gestures) accuracy on SHREC’17 Track dataset, respectively. The accuracy of 94.2% (14 gestures) and 92.1% (28 gestures) is achieved on the DHG-14/28 dataset.

2. Related Work

2.1. Skeleton-Based Dynamic Hand Gesture Recognition

In recent years, deep learning methods have enabled quick and reliable recognition of skeletal data, making deep learning the mainstream approach for dynamic gesture recognition based on skeletal data. CNN-based skeletal gesture recognition methods typically use specific transformations to convert gesture skeleton data into pseudo-images, which are then processed by the network. Devineau [8] et al. proposed a CNN architecture that classifies gestures by processing hand-skeletal joint position sequences both intra-frame and inter-frame in parallel. Inspired by Temporal Convolutional Networks (TCNs) [9], Hou et al. [10] proposed an end-to-end spatio-temporal attention residual temporal convolutional network (STA-Res-TCN). The first use of 3D CNNs in the field of skeleton-based action and gesture recognition was introduced in the literature [11].
RNNs typically use the output of the previous state as the input for the current state, forming a recursive structure that is particularly effective for sequential data. Wang et al. [12] proposed a two-stream RNN architecture to model the spatial configurations and temporal dynamics of skeletal data. Nunez et al. [13] introduced a lightweight CNN + LSTM model to process skeletal time series, with CNNs extracting features at each time step and LSTMs handling temporal evolution. Chen et al. [14] proposed a motion feature augmentation network (MFA-Net) to enhance the usability of noisy gesture recognition data.
However, CNNs and RNNs are limited in non-Euclidean spaces [20,21]. To capture subtle behavioral differences in complex scenarios within skeleton datasets and improve model accuracy and robustness, scholars have explored the application of GCN models. Yan et al. [15] first proposed a spatio-temporal graph convolutional network (ST-GCN), which treats body joints as vertices and body skeletons as edges in a spatio-temporal graph. However, the fixed parameters of each layer limit the flexibility of the graph network. Building on this, Li et al. [16] proposed a graph convolutional network (HG-GCN) for skeleton gesture recognition. They constructed a hand skeleton structure with 21 joint points and added three types of extra edges: connecting the tip of a finger to the base of an adjacent finger, the third joint of a finger to the second joint of the pinky finger, and the tip of a finger to its third joint. Song et al. [17] introduced a multi-stream improved spatio-temporal graph convolutional network (MS-ISTGCN) for dynamic hand gesture recognition, which learns distant hand–joint relationships through adaptive spatial graph convolution and extends spatio-temporal graph convolution to extract temporal features. However, the model shares the same adjacency matrix across frames, limiting its ability to capture temporal information. To address the issue of sharing the same topology structure across different frames, Liu et al. [19] proposed a spatio-temporal decoupled graph convolutional network (TD-GCN) based on CTR-GCN [18], enabling the network to extract deeper information.

2.2. Spatiol-Temporal Graph Convolution Network

In spatio-temporal graph convolution, gesture skeleton data can be naturally regarded as a graph structure. For a T frame video with N joints per frame, this process can be described as representing a gesture video sequence containing T frames as a graph G = (V, E), where V represents a set of n hand joints and E represents a set of m skeletons. The graph in GCN is represented as a set of points and edges of the input data, where the skeleton data for gestures represents joints as points and their natural connections in the hand as edges. The model of GCN is represented by f(X, A), where X R N × D is the feature representation of each node, and A R N × N is the adjacency matrix of the undirected graph, which can effectively express the structure of the graph. A i , j = 1 when there is an edge from node i to node j; otherwise, A i , j = 0 . Graph convolution uses the features of adjacent nodes to update the feature representation of each node:
H l + 1 = σ D ˜ 1 2 A ˜ D ˜ 1 2 H l W l
where H ( l ) R N × D is the feature representation on the l layer diagram, and the feature representation of the initial input data is H 0 = X . A ˜ = A + I N is the adjacency matrix of an undirected graph with added self-join, and I N is represented by adding a self-join to each node in the graph. D ˜ 1 2 A ˜ D ˜ 1 2 means that the degree matrix D ˜ is used to normalize A ˜ symmetrically to prevent instability of the value. W ( l ) is the trainable weight matrix (let W ( 0 ) be [1]), and σ is the activation function.

2.3. Attention Mechanism

In recent years, attention mechanisms have been applied to various tasks with remarkable results. References [22,23,24] applied an attention mechanism to dynamic hand gesture recognition. Yang et al. [22] added a self-attention mechanism to ST-GCN, assigning different weights to different joints and temporal frames, enabling the network to pay more attention to those important joints and temporal frames, so as to enhance the global correlation of spatial features. The problem that the root joint is not closely connected to the distal joint is solved. Slama et al. [23] introduced a Transformer mechanism to capture inter frame relationships in gesture sequences through temporal self-attention module, which enhances the model’s ability to understand the temporal of gesture actions. Since the spatial and temporal features of gestures are decoupled, the model performs better in complex gesture recognition tasks. MIAH et al. [24] combined graph convolutional networks with deep neural networks to propose a multi-branch attention model. For graph-based convolutional network channels, spatial attention module is used first, and then a temporal attention module is used to generate spatio-temporal features. The other channel generates spatio-temporal features through the reverse order of the first branch. The last branch uses the general deep neural network module to extract the general deep learning features, which realizes the universality and efficiency of the model. Compared with previous studies, our proposed prominent location channel attention mechanism can effectively extract important information in space and capture global features well.

3. Proposed Methodologies

3.1. Network Architecture

In skeleton-based hand gesture recognition, graph convolutional networks (GCNs) typically consist of spatial graph convolutions and temporal convolutions. The input data for the STDA-GCN model comprises the coordinates of hand joints, represented by dimensions (N, C, T, V, M), where N denotes the batch size, C denotes the three-dimensional coordinates of the joints, T represents the number of frames in the gesture video, V indicates the 22 hand joints, and M represents the number of hands involved in the video. The basic structure of the STDA-GCN model is designed according to literature [15], and the overall framework is shown in Figure 2.
The proposed model architecture comprises 10 layers of spatial dynamic attention graph convolutions, salient location channel attention layers, and dynamic attention temporal convolution layers. For the salient location channel attention layer, SLCAM first selects the top k most important bone locations from the input bone feature 3D coordinate information by calculating the saliency measure of the query matrix. Then, this salient location information is regarded as the vertices in the graph, and the relationship between the feature points is represented by the adaptively learned adjacency matrix, so as to further strengthen the interaction between the features of different channels. Based on the parameter setting in the literature [19], the number of input channels of the model is designed as 64, 64, 64, 128, 128, 258, 256. The input channels were increased from 64 to 128 to 256 in order to capture more complex features layer by layer. By setting the convolution step size to 2, the model can handle a wider range of skeletal nodes and expand the receptive field. The fifth and eighth layers reduce the length of the frame sequence to 2 and 4 times by convolution operation, respectively, so as to reduce the computational complexity and ensure that the model can effectively learn and extract features.
Finally, after processing through these layers, the skeletal data undergo global spatial pooling along the channels and global temporal pooling along the frames in the tenth layer. Global pooling aggregates the information of the entire skeletal sequence, retaining crucial information while reducing the computational load, thereby enhancing the training and inference efficiency of the model. A fully connected layer and softmax are then used to predict the classification results of the joint stream, joint motion stream, and skeleton stream. The results of these three streams are fused to predict the final hand gesture recognition outcome.

3.2. Spatial Graph Convolution Layer

In previously studied GCNs, the graph topology for gesture recognition tasks is typically fixed, relying solely on physical connections. This approach does not adequately capture the underlying relationships between the skeleton joints. To enhance the performance of graph convolutional networks, a common strategy is to allow the neighbor matrix to be updated during the training process, thereby adaptively learning the relationships between nodes.
In this paper, we propose Dynamic Attention Graph Convolution (DAGC), which neither increases the width nor the depth of the network. Instead, it enhances the expressive power of spatial graph convolution by aggregating multiple convolution kernels through an attention mechanism. This approach better learns the intra-frame spatial relationships within the skeleton data and reduces the model complexity.
The core idea of dynamic convolution is to utilize a set of parallel convolution kernels in each layer instead of relying on a single convolution kernel. These convolution kernels are generated through an input-dependent attention mechanism, with attention weights computed similarly to the squeeze-and-excitation (SE) mechanism [25]. Initially, global average pooling compresses the spatial information of each channel of the input skeletal features into a single value. Then, the normalized attention weights for the K convolutional kernels are produced using two fully connected layers (with a ReLU activation in between) and a softmax layer. Following the optimal performance found in [26] for k = 4 in classification tasks, this paper also sets k to 4. Finally, dynamic weighting and aggregation are performed by combining multiple convolutional kernels based on their weights, as illustrated in Figure 3. This method is computationally efficient and provides stronger representation capabilities, as the aggregation of convolutional kernels is performed in a nonlinear manner.
Specifically, dynamic convolution employs a set of K parallel convolutional kernels w ˜ k , b ˜ k that are dynamically weighted and aggregated to form new convolutional kernels w ˜ and biases b ˜ through an input-dependent attention mechanism k x (as described in Equations (2) and (3)). Equation (4) constrains the attentional output to 1, facilitating the learning of the attention model. The attention weights are related to the inputs, and multiple parallel convolutional kernels are dynamically combined through this input-dependent attention mechanism. This approach enhances computational efficiency due to the small size of the convolutional kernels and provides stronger representational power by aggregating attention k x in a nonlinear manner. In essence, we aggregate k linear functions, with the output y at the convolutional layer expressed as w ˜ k x + b ˜ k shown in Equation (5), where g is the activation function.
w ˜ x = k = 1 k k x w ˜ k
b ˜ ( x ) = k = 1 k k x b ˜ k
0 k x 1 , k = 1 k k x = 1
y = g w ˜ T x x + b ˜ x
The DAGC proposed in this study involves the aggregation of three parallel time-dependent and channel-dependent adjacency matrices, as described in Equation (6). The DAGC updates its features f i using the weights w ˜ of the aggregated neighboring vertices’ v i features. Each module of the DAGC comprises three dynamic convolutional layers: the first layer is a convolutional layer that transforms the inputs into a number of spatial relation channels for computing various spatial relations. The second layer is a convolutional layer that converts the input into the number of output channels necessary for feature computation. The third layer changes the spatial relation channels into the number of output channels, aligning the spatial relations with the features. This dynamic convolutional layer provides a richer representation of each input, effectively capturing the complexity and dynamic characteristics of the skeletal data, and enabling flexible adaptation to different action patterns.
f i = v j N v i a i j x j w ˜ x

3.3. Channel Attention Mechanism

In recent years, the attention mechanism has garnered considerable interest across various types of neural networks, leading to numerous studies that have successfully utilized it to achieve significant results. For instance, references [22,23] applied the attention mechanism to dynamic gesture recognition, which indeed improved performance. However, this approach also resulted in a significant increase in the number of parameters.
In this study, we introduce the salient location channel attention mechanism (SLCAM) between the spatial graph convolution and temporal convolution blocks. First, the SLCAM captures more comprehensive contextual information by selecting salient location information within the channels. Subsequently, the graph convolution method is applied to the channel attention to enhance inter-channel dependencies. This is achieved by treating the feature information from the salient locations as vertices in the graph and performing non-local operations on these features through an adaptively learned adjacency matrix.
The skeletal data output from the DAGC module is first used as the input feature matrix X. This matrix X is passed through two 2D convolutional layers to generate the query matrix Q and the value matrix V (Equation (7)). The saliency measure is computed as the sum of squares of the query matrix. This sum of squares is then aggregated to obtain the saliency score Q{pow}, from which the top k salient locations are selected to generate the key matrix K.
The attention matrix A is computed from the key matrix K derived from the salient locations. The product KTK computes the similarity between each query location and all key locations, capturing the global dependencies between features. Normalization is then performed using the softmax function (Equation (8)). The value matrix V is multiplied by the attention matrix A to generate the output skeletal feature Y (Equation (9)). This output is then transformed by a 1 × 1 convolution and finally summed with the input X to further process the features, as illustrated in Figure 4.
Q = h X ,   V = θ X ,   K = S Q
Q R n × c ,   V R n × c ,   K R k × c
A = s o f t max K T K c
Y = V A
A R c × c ; X , Y R n × c
Subsequently, the graph convolution theory is introduced into the channel attention mechanism, enabling the direct modeling and optimization of relationships between skeletal points and capturing non-local dependencies among them, as illustrated in Figure 5. This method employs GCNs to directly perform Non-Local (NL) operations on features, thereby capturing and representing relationships between features more efficiently. In graph convolutional networks, the convolution operation on a vertex vi can be expressed as shown in Equation (10). Here, bi denotes the set of vertices neighboring the output vertex vj, and c(vj) is the normalization function. The weight of each feature point is then obtained by the Adaptive Graph Channel Module (AGCM) and mapped back to the original feature mapping (Equation (11)).
The AGCM contains three parts of the adjacency matrix: the constant matrix (A0) represents the feature vertices themselves. The self-attentive mechanism graph (A1) is obtained through one-dimensional convolutional and softmax layers (Equation (12)) and represents the weights of the feature vertices, and the learnable adjacency matrix (A2) generates dependencies for any two feature vertices by back-propagation optimization to capture different levels of feature relationships (Equation (13)).
The AGCM dynamically adjusts the weights of each feature point, allowing the model to adaptively emphasize or suppress different features. This is particularly important for handling complex skeletal data, as different actions may exhibit varying saliency at different skeletal points. In skeletal data, different points may contain similar or redundant information. AGCM addresses this by introducing a bottleneck structure, which reduces feature redundancy and enhances the independence and effectiveness of each skeletal point’s feature representation. This, in turn, improves the model’s discriminative ability.
f o u t v i = v j b i 1 c v j w i j f i n v j
f o u t = w f i n A ˜
A 1 = T s o f t max W f i n
f o u t = w f i n A 0 × A 1 + A 2

3.4. Temporal Convolution Layer

Dynamic attention temporal convolution (DATC), similar to spatial map convolution, is also employed in the temporal convolution of this study to enhance the learning capability between frames of skeletal data. To capture gesture information of varying lengths, we utilize a multi-scale temporal aggregation structure, as shown in Figure 6. This structure comprises three branches designed to extract inter-frame features of skeletal data through extended temporal convolution.
First Branch: This branch consists of two convolutional layers. The output of the first dynamic convolutional layer is used as the input for the second K × 1 convolutional layer, which is expanded after normalization and nonlinear activation. Here, K represents the size of the convolutional kernel. The introduction of dynamic convolution facilitates the fusion of features with different time steps, particularly through dilated convolution, which captures long-term dependencies more effectively with dynamic weight adjustment.
Second Branch: This branch includes a dynamic time convolution layer followed by a normalization layer. The dynamic time convolution better fuses features from different time steps, accommodating the varying significance of different time steps in time series data. By dynamically adjusting the weights of the convolutional kernel, the model can more effectively fuse information from the gesture skeletal data capturing subtle differences between time steps. The normalization layer ensures feature stability and consistency.
Third Branch: This branch combines convolution, normalization, activation, and pooling operations. Maximal pooling is used to capture the primary temporal features, while convolution and normalization maintain a normalized and nonlinear representation of the features, capturing richer information about the actions.

4. Experimental Evaluation

We conducted experiments on two widely used public dynamic gesture datasets, SHREC’17 Track and DHG-14/28, to evaluate the performance of the proposed STDA-GCN model (as illustrated in Figure 7). The model comprises three parallel dynamic attention convolutions, a salient location channel attention mechanism, and a multi-scale dynamic attention temporal convolution. Additionally, we performed ablation experiments to verify the effectiveness of each module within the model.

4.1. Datasets

SHREC’17 Track dataset [27]: This features gestures performed by 28 right-handed participants, captured using an Intel RealSense short-range depth camera. Each gesture was performed 1 to 10 times, resulting in 2800 sequences. Each frame includes a 640 × 480 depth image and the coordinates of 22 joints in both 2D and 3D spaces. The dataset comprises 14 distinct gestures categorized into coarse (whole-hand actions) and fine (hand shape changes) gestures. The dataset contains the following 14 gestures: Grab (G), Tap (T), Expand (E), Pinch (P), Rotation Clockwise (R-CW), Rotation Counter-Clockwise (R-CCW), Swipe Right (S-R), Swipe Left (S-L), Swipe Up (S-U), Swipe Down (S-D), Swipe X (S-X), Swipe + (S+), Swipe V (S-V) and Shake (Sh).
The DHG-14/28 dataset [28]: This utilized the same data collection methodology as the SHREC’17 Track dataset. The Dynamic Hand Gesture 14/28 dataset contains 14 sequences of gestures executed in two ways: using one finger and the whole hand. Each gesture was performed by five participants in 20 ways, generating 2800 sequences. All participants were right-handed.

4.2. Evaluation Metrics

Top-1 accuracy (Acc (Top-1)) is a common metric used to measure the performance of models in skeleton data recognition tasks [29,30]. Top-1 accuracy represents the proportion of the highest probability class predicted by the model that agrees with the actual class. In this study, Acc (Top-1) was used to evaluate the performance of the STDA-GCN model in the gesture recognition task (Equation (14)).
t o p 1 = i V φ ( c l a s s i t r u e = r a n k 1 ( c l a s s i p r e d ) ) V
In this context, φ is a judgment function that takes a value of 0 if the predicted class does not match the actual class, and 1 if they match. c l a s s i t r u e represents the true class of the i-th gesture, r a n k 1 ( c l a s s i p r e d ) represents the inferred class with the highest probability score for the i-th gesture, and V represents the number of hand joints.

4.3. Training Details

All experiments were conducted using the PyTorch framework. After extensive comparisons, we employed the Adam optimizer for the SHREC’17 Track dataset with a batch size of 32 and a maximum of 100 training epochs, reducing the learning rate at the 30th and 70th epochs by a factor of 10. For the DHG-14/28 dataset, we used the SGD optimizer with a batch size of 32 and a maximum of 150 training epochs.

4.4. Ablation Study and Discussion

The benchmark model [19] provided detailed metrics for each gesture type in the training on the DHG-14/28 dataset but did not include overall metrics for the 14-gesture and 28-gesture datasets. Therefore, we supplemented the experimental results for the 14-gesture and 28-gesture datasets in the ablation study, as shown in Table 1, to facilitate comparison with the STDA-GCN model in both the ablation and comparison experiments of this study.
We conducted ablation experiments on the SHREC’17 Track and DHG-14/28 datasets by removing each added component in three action modes. The three action modes are as follows: joint flow (J), representing joint coordinates; joint motion flow (JM), representing coordinate differences between temporally neighboring coordinates; and bone flow (B), representing coordinate differences between spatially connected joints. Experiments evaluated the model using the J, JM, and combined 3s (J + JM + B) modes.
In the ablation study, we evaluated the effectiveness of each module in the STDA-GCN model using two public dynamic gesture datasets, SHREC’17 Track and DHG-14/28. The original model includes a dynamic attention convolution module and a salient location channel attention mechanism. As shown in Table 2, removing the dynamic attention convolution module resulted in a 0.24% decrease in the gesture metrics for the 14-gesture data and a 0.33% decrease for the 28-gesture data. This demonstrates that the dynamic attention convolution effectively aggregates multiple parallel convolution kernels based on input attention, improving the learning of intra- and inter-frame relationships within the skeletal data and enhancing feature representation.
Further, removing the salient location channel attention module led to an additional 0.12% decrease in accuracy for the 14-gesture data and a 0.38% decrease for the 28-gesture data in the SHREC’17 Track dataset. This indicates the SLCAM module’s effectiveness in capturing global gesture features and weighting different channels, enabling the model to focus on important feature channels.
Overall, these results suggest that the modules in the STDA-GCN model complement each other through different mechanisms, collectively enhancing the model’s performance.
As shown in Table 3, removing the dynamic attention convolution module led to a 0.48% decrease in r the 14-gesture data and a 0.24% decrease for the 28-gesture data. Further removal of the salient location channel attention module resulted in an additional 0.35% decrease in accuracy for the 14-gesture data and a 0.95% decrease for the 28-gesture data in the SHREC’17 Track dataset.
For the DHG-14/28 dataset, as shown in Table 4, removing the dynamic attention convolution module resulted in a 0.71% decrease in metrics for the 14-gesture data and a 2.14% decrease for the 28-gesture data. Further removal of the salient location channel attention module led to an additional 0.71% decrease in accuracy for the 14-gesture data and a 0.72% decrease for the 28-gesture data.
As shown in Table 5, removing the dynamic attention convolution module resulted in a 0.72% decrease for the 14-gesture class and a 1.42% decrease for the 28-gesture class. Further removal of the salient location channel attention module led to an additional 1.42% decrease in accuracy for the 14-gesture class and a 0.72% decrease for the 28-gesture class.

4.5. Comparison with Other GCN and Discussion

Subsequently, we compared our proposed STDA-GCN model with previously developed GCN models for 3D skeleton-based gesture recognition. As shown in Table 6, on the SHREC’17 TRACK dataset, our model achieved an accuracy of 97.14% for the 14 gestures and 95.84% for the 28 gestures, demonstrating a significant improvement over current skeleton-based dynamic gesture models, which proves the effectiveness of our model.
The ST-GCN model in reference [15] did not conduct experiments specifically for gesture recognition tasks, while the later study in reference [23] applied the ST-GCN model to gesture recognition experiments. Therefore, this research refers to the experimental results of the ST-GCN model from [23] on gesture recognition tasks and compares them with the experimental results of the model proposed in this paper. ST-GCN was the first to apply graph convolutional networks to human action recognition, laying the foundation for spatio-temporal feature extraction using graph structures. The model achieved an accuracy of 92.7% for 14 gestures and 87.7% for 28 gestures. The HG-GCN model [17] was the first to establish a skeletal structure for gesture recognition, adding three types of edges to the 22 hand joints to describe joint linkage actions. It achieved an accuracy of 92.8% for 14 gestures and 88.3% for 28 gestures. The TD-GCN model [19] enhanced the temporal frame representation by using different adjacency matrices for different time frames, achieving an accuracy of 97.02% for 14 gestures and 95.36% for 28 gestures. The STr-GCN [23], STA-GCN [31], and ResGCNeXt [32] models are new approaches that incorporate attention mechanisms into spatio-temporal graph convolution networks. STr-GCN combined spatial graph convolution with transformers, achieving an accuracy of 93.39% for 14 gestures and 89.2% for 28 gestures. STA-GCN introduced dual-stream spatio-temporal attention, achieving an accuracy of 95.4% for 14 gestures and 91.8% for 28 gestures. The ResGCNeXt model incorporated attention mechanisms after spatio-temporal graph convolution to enhance spatial and channel feature learning, achieving an accuracy of 95.36% for 14 gestures and 93.1% for 28 gestures.
Figure 8 and Figure 9 show the confusion matrices for 14 and 28 gestures on the SHREC’17 TRACK dataset. We can see that for gestures like Swipe X (S-X) and Swipe + (S+), which mainly rely on spatial transformations, the model achieved an accuracy of 1. This indicates that the introduction of dynamic attention graph convolution in the STDA-GCN model has enhanced cross-channel information interaction, allowing the model to more effectively capture subtle spatial variations in hand actions. For gestures like Swipe Right (S-R) and Swipe Left (S-L), which are more dependent on temporal changes, the model also performed exceptionally well, demonstrating the effectiveness of the proposed multi-scale temporal dynamic attention convolution. This mechanism better captures the transition of hand actions from short-term to long-term variations, improving the model’s ability to detect inter-frame action changes, and ultimately significantly enhancing the recognition accuracy of these dynamic gestures.
Table 7 shows that our model also demonstrates excellent performance on the DHG-14/28 dataset, achieving accuracies of 94.2% for 14 gestures and 92.1% for 28 gestures. Figure 10 and Figure 11 show the confusion matrices for 14 and 28 gestures on the DHG-14/28 dataset. We found that gestures involving spatial transformations, such as Grab (G) and Expand (E), and inter-frame actions like Rotation Clockwise (RC) and Rotation Counter-clockwise (RCC), performed well. However, due to the highly similar spatial actions of Grab (G) and Pinch (P), these actions are often misclassified as one another. Therefore, further optimization is needed to improve the model’s ability to distinguish between complex or similar gesture classes.

5. Conclusions

We propose a spatio-temporal dynamic attention graph convolutional network (STDA-GCN) for dynamic gesture recognition. Unlike previous approaches, STDA-GCN employs dynamic attention convolution in the spatial graph convolution module to enhance cross-channel information interaction by aggregating multiple dynamic convolution kernels, thereby improving the learning of action changes between hands. By incorporating a salient location channel attention mechanism between spatio-temporal graph convolutions, the model effectively suppresses redundant features and increases efficiency. Additionally, we utilize multi-scale temporal dynamic attention convolution to better capture short- and long-term hand action information and inter-frame actions. Experimental results on SHREC’17 Track and DHG-14/28 datasets show that STDA-GCN exhibits highly competitive performance. Through ablation experiments and comparison experiments, STDA-GCN achieves 97.14% (14 gestures) and 95.84% (28 gestures) accuracy on SHREC’17 Track dataset, respectively. The accuracy of 94.2% (14 gestures) and 92.1% (28 gestures) is achieved on the DHG-14/28 dataset. However, for the discrimination of certain similar gestures, such as Grab (G) and Pinch (P), there is still room for further optimization of the model’s ability to extract spatial features. In the future, we plan to continue to optimize the performance of the model in distinguishing similar hand action, and combine semantic analysis to improve the performance of the model on the human interaction hand action dataset.

Author Contributions

Conceptualization, Y.C., X.H. and X.C.; methodology, Y.C., X.H. and X.C.; validation, Y.C. and Y.L.; investigation, Y.C., X.H. and Y.L.; resources, Y.C.; data curation, Y.C. and X.H.; writing—original draft preparation, Y.C., X.H. and X.C.; writing—review and editing, Y.L., X.H. and X.C.; visualization, Y.C. and X.H.; supervision, X.H. and W.H.; project administration, X.H.; funding acquisition, X.H. All authors have read and agreed to the published version of the manuscript.

Funding

This study is awarded financial assistance under Heilongjiang Postdoctoral Fund to pursue scientific research in Heilongjiang Province in 2019. The level of financial assistance is II and the amount of financial assistance is RMB 70,000. Serial number: LBH-Z19072.

Data Availability Statement

The data provided in this study is available upon request by contacting the corresponding author. This study utilized two publicly available datasets, which can be downloaded from the following links: http://www-rech.telecom-lille.fr/shrec2017-hand/ (accessed on 17 September 2024) and http://www-rech.telecom-lille.fr/shrec2017-hand/ (accessed on 17 September 2024).

Acknowledgments

Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing and Postdoctoral research workstation of Northeast Asia Service Outsourcing Research Center provide academic support.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Rautaray, S.S.; Agrawal, A. Vision based hand gesture recognition for human computer interaction: A survey. Artif. Intell. Rev. 2015, 43, 1–54. [Google Scholar] [CrossRef]
  2. Cheng, H.; Yang, L.; Liu, Z. Survey on 3D hand gesture recognition. IEEE Trans. Circuits Syst. Video Technol. 2016, 26, 1659–1673. [Google Scholar] [CrossRef]
  3. Dabwan, B.A.; Jadhav, M.E. A review of sign language and hand motion recognition techniques. Int. J. Adv. Sci. Technol. 2020, 29, 4621–4635. [Google Scholar]
  4. Hussain, A.; Khan, S.U.; Rida, I.; Khan, N.; Baik, S.W. Human centric attention with deep multiscale feature fusion framework for activity recognition in Internet of Medical Things. Inf. Fusion 2024, 106, 102211. [Google Scholar] [CrossRef]
  5. Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
  6. Keselman, L.; Iselin Woodfill, J.; Grunnet-Jepsen, A.; Bhowmik, A. Intel realsense stereoscopic depth cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  7. Guo, H.; Wang, G.; Chen, X.; Zhang, C. Towards good practices for deep 3D hand pose estimation. arXiv 2017, arXiv:1707.07248. [Google Scholar]
  8. Devineau, G.; Moutarde, F.; Xi, W.; Yang, J. Deep learning for hand gesture recognition on skeletal data. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; IEEE: Piscataway, NJ, USA. [Google Scholar]
  9. Soo Kim, T.; Reiter, A. Interpretable 3D human action analysis with temporal convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  10. Hou, J.; Wang, G.; Chen, X.; Xue, J.H.; Zhu, R.; Yang, H. Spatial-temporal attention res-TCN for skeleton-based dynamic hand gesture recognition. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018; pp. 273–286. [Google Scholar]
  11. Tu, J.; Liu, M.; Liu, H. Skeleton-based human action recognition using spatial temporal 3D convolutional neural networks. In Proceedings of the 2018 IEEE International Conference on Multimedia and Expo (ICME), San Diego, CA, USA, 23–27 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
  12. Wang, H.; Wang, L. Modeling temporal dynamics and spatial configurations of actions using two-stream recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  13. Núñez, J.C.; Cabido, R.; Pantrigo, J.J.; Montemayor, A.S.; Vélez, J.F. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognit. 2018, 76, 80–94. [Google Scholar] [CrossRef]
  14. Chen, X.; Wang, G.; Guo, H.; Zhang, C.; Wang, H.; Zhang, L. Mfa-net: Motion feature augmented network for dynamic hand gesture recognition from skeletal data. Sensors 2019, 19, 239. [Google Scholar] [CrossRef] [PubMed]
  15. Yan, S.; Xiong, Y.; Lin, D. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
  16. Li, Y.; He, Z.; Ye, X.; He, Z.; Han, K. Spatial temporal graph convolutional networks for skeleton-based dynamic hand gesture recognition. EURASIP J. Image Video Process. 2019, 2019, 78. [Google Scholar] [CrossRef]
  17. Song, J.-H.; Kong, K.; Kang, S.-J. Dynamic hand gesture recognition using improved spatio-temporal graph convolutional network. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6227–6239. [Google Scholar] [CrossRef]
  18. Chen, Y.; Zhang, Z.; Yuan, C.; Li, B.; Deng, Y.; Hu, W. Channel-wise topology refinement graph convolution for skeleton-based action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021. [Google Scholar]
  19. Liu, J.; Wang, X.; Wang, C.; Gao, Y.; Liu, M. Temporal decoupling graph convolutional network for skeleton-based gesture recognition. IEEE Trans. Multimed. 2023, 26, 811–823. [Google Scholar] [CrossRef]
  20. Bruna, J.; Zaremba, W.; Szlam, A.; LeCun, Y. Spectral networks and deep locally connected networks on graphs. arXiv 2014, arXiv:1312.6203. [Google Scholar]
  21. Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 321, 4–24. [Google Scholar] [CrossRef] [PubMed]
  22. Yang, S.; Li, Q.; He, D.; Wang, J.; Li, D. Global Correlation Enhanced Hand Action Recognition Based on NST-GCN. Electronics 2022, 11, 2518. [Google Scholar] [CrossRef]
  23. Slama, R.; Rabah, W.; Wannous, H. Str-gcn: Dual spatial graph convolutional network and transformer graph encoder for 3D hand gesture recognition. In Proceedings of the 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG), Waikoloa Beach, HI, USA, 5–8 January 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
  24. Miah, A.S.M.; Hasan, A.M.; Shin, J. Dynamic hand gesture recognition using multi-branch attention based graph and general deep learning model. IEEE Access 2023, 11, 4703–4716. [Google Scholar] [CrossRef]
  25. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  26. Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  27. De Smedt, Q.; Wannous, H.; Vandeborre, J.P.; Guerry, J.; Le Saux, B.; Filliat, D. Shrec’17 track: 3D hand gesture recognition using a depth and skeletal dataset. In Proceedings of the 3DOR-10th Eurographics Workshop on 3D Object Retrieval, Lyon, France, 23–24 April 2017. [Google Scholar]
  28. De Smedt, Q.; Wannous, H.; Vandeborre, J.P. Skeleton-based dynamic hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
  29. Liu, Y.; Yuan, J.; Tu, Z. Motion-driven visual tempo learning for video-based action recognition. IEEE Trans. Image Process. 2022, 31, 4104–4116. [Google Scholar] [CrossRef] [PubMed]
  30. Xie, Z.; Chen, J.; Wu, K.; Guo, D.; Hong, R. Global Temporal Difference Network for Action Recognition. IEEE Trans. Multimed. 2023, 25, 7594–7606. [Google Scholar] [CrossRef]
  31. Zhang, W.; Lin, Z.; Cheng, J.; Ma, C.; Deng, X.; Wang, H. Sta-gcn: Two-stream graph convolutional network with spatial–temporal attention for hand gesture recognition. Vis. Comput. 2020, 36, 2433–2444. [Google Scholar] [CrossRef]
  32. Peng, S.-H.; Tsai, P.-H. An efficient graph convolution network for skeleton-based dynamic hand gesture recognition. IEEE Trans. Cogn. Dev. Syst. 2023, 15, 2179–2189. [Google Scholar] [CrossRef]
Figure 1. Sequence diagram of the hand skeleton [17].
Figure 1. Sequence diagram of the hand skeleton [17].
Electronics 13 03733 g001
Figure 2. The architecture consists of the STDA-GCN block structure, which includes a dynamic attention graph convolution layer, a salient location channel attention layer, and a dynamic attention temporal convolution layer.
Figure 2. The architecture consists of the STDA-GCN block structure, which includes a dynamic attention graph convolution layer, a salient location channel attention layer, and a dynamic attention temporal convolution layer.
Electronics 13 03733 g002
Figure 3. Dynamic attention convolutional layer.
Figure 3. Dynamic attention convolutional layer.
Electronics 13 03733 g003
Figure 4. Obtaining salient location information.
Figure 4. Obtaining salient location information.
Electronics 13 03733 g004
Figure 5. Adaptive graph channel module flow diagram.
Figure 5. Adaptive graph channel module flow diagram.
Electronics 13 03733 g005
Figure 6. Multi-branch dynamic temporal convolution module.
Figure 6. Multi-branch dynamic temporal convolution module.
Electronics 13 03733 g006
Figure 7. The basic block of our STDA-GCN. The graph convolution of the basic block combines the output of the three DA-GC blocks, after which SLCAM is added, and finally, dynamic temporal convolution is used for gesture recognition.
Figure 7. The basic block of our STDA-GCN. The graph convolution of the basic block combines the output of the three DA-GC blocks, after which SLCAM is added, and finally, dynamic temporal convolution is used for gesture recognition.
Electronics 13 03733 g007
Figure 8. Confusion matrix of the proposed method on SHREC’17 Track dataset with 14 gestures.
Figure 8. Confusion matrix of the proposed method on SHREC’17 Track dataset with 14 gestures.
Electronics 13 03733 g008
Figure 9. Confusion matrix of the proposed method on SHREC’17 Track dataset with 28 gestures.
Figure 9. Confusion matrix of the proposed method on SHREC’17 Track dataset with 28 gestures.
Electronics 13 03733 g009
Figure 10. Confusion matrix of the proposed method on DHG14-28 dataset with 14 gestures.
Figure 10. Confusion matrix of the proposed method on DHG14-28 dataset with 14 gestures.
Electronics 13 03733 g010
Figure 11. Confusion matrix of the proposed method on DHG14-28 dataset with 28 gestures.
Figure 11. Confusion matrix of the proposed method on DHG14-28 dataset with 28 gestures.
Electronics 13 03733 g011
Table 1. Supplement experimental results.
Table 1. Supplement experimental results.
MethodDHG-14/28
Setting14 G (%)28 G (%)
J9088.57
JM89.2988.57
J + JM + B93.991.4
Table 2. Accuracy comparisons of the different data streams and combinations on SHREC’17 TRACK dataset (J).
Table 2. Accuracy comparisons of the different data streams and combinations on SHREC’17 TRACK dataset (J).
MethodSHREC’17 Track
Setting14 G (%)28 G (%)
baseline96.3193.69
STDA-GCN w/o SLCAM96.4394.07
STDA-GCN w/SLCAM+DAC96.6794.40
Table 3. Accuracy comparisons of the different data streams and combinations on SHREC’17 TRACK dataset (JM).
Table 3. Accuracy comparisons of the different data streams and combinations on SHREC’17 TRACK dataset (JM).
MethodSHREC’17 Track
Setting14 G (%)28 G (%)
baseline95.3691.19
STDA-GCN w/o SLCAM95.7192.14
STDA-GCN w/SLCAM + DAC96.1992.38
Table 4. Accuracy comparisons of the different data streams and combinations on DHG14/28 dataset (J).
Table 4. Accuracy comparisons of the different data streams and combinations on DHG14/28 dataset (J).
MethodDHG-14/28
Setting14 G (%)28 G (%)
baseline89.2988.57
STDA-GCN w/o SLCAM9089.29
STDA-GCN w/SLCAM + DAC90.7191.43
Table 5. Accuracy comparisons of the different data streams and combinations on DHG14/28 dataset (JM).
Table 5. Accuracy comparisons of the different data streams and combinations on DHG14/28 dataset (JM).
MethodDHG-14/28
Setting14 G (%)28 G (%)
baseline89.2988.57
STDA-GCN w/o SLCAM90.7189.29
STDA-GCN w/SLCAM + DAC91.4390.71
Table 6. Accuracy comparison with state-of-the-art methods on SHREC’17 TRACK dataset.
Table 6. Accuracy comparison with state-of-the-art methods on SHREC’17 TRACK dataset.
MethodSHREC’17 Track
Setting14 G (%)28 G (%)
ST-GCN [15]92.787.7
HG-GCN [16]92.888.3
STr-GCN [23]93.3989.2
STA-GCN [31]95.491.8
ResGCNeXt [32]95.3693.1
TD-GCN [19]97.0295.36
ours97.1495.84
Table 7. Accuracy comparison with state-of-the-art methods on DHG-14/28 dataset.
Table 7. Accuracy comparison with state-of-the-art methods on DHG-14/28 dataset.
MethodDHG-14/28
Setting14 G (%)28 G (%)
HG-GCN [16]89.285.3
ST-GCN [15]91.287.1
STA-GCN [31]91.587.7
TD-GCN [19]93.991.4
ours94.292.1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Han, X.; Cui, Y.; Chen, X.; Lu, Y.; Hu, W. Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition. Electronics 2024, 13, 3733. https://doi.org/10.3390/electronics13183733

AMA Style

Han X, Cui Y, Chen X, Lu Y, Hu W. Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition. Electronics. 2024; 13(18):3733. https://doi.org/10.3390/electronics13183733

Chicago/Turabian Style

Han, Xiaowei, Ying Cui, Xingyu Chen, Yunjing Lu, and Wen Hu. 2024. "Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition" Electronics 13, no. 18: 3733. https://doi.org/10.3390/electronics13183733

APA Style

Han, X., Cui, Y., Chen, X., Lu, Y., & Hu, W. (2024). Spatio-Temporal Dynamic Attention Graph Convolutional Network Based on Skeleton Gesture Recognition. Electronics, 13(18), 3733. https://doi.org/10.3390/electronics13183733

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop