1. Introduction
Gesture is a form of body language enabling natural communication through palm centering and finger placement, categorized into static and dynamic gestures [
1,
2]. Static gestures are represented by a single image, while dynamic gestures encompass a series of actions, such as finger slides and full palm transformations. With the growing importance of dynamic gestures in applications like sign language communication, robot control, and intelligent driving [
3], their recognition techniques have become a research hotspot in computer vision and human–computer interaction fields.
Gesture recognition methods are mainly divided into two categories: image-based methods, which use RGB or RGB-D image sequences, and skeleton-based methods, which use 2D or 3D coordinate sequences of hand joints. The former relies on visual information to extract gesture features, while the latter recognizes gestures by analyzing geometric changes in joint positions. Despite the remarkable achievements of research relying on RGB video data [
4], the recognition effect is affected by feature extraction errors due to the interference of background, illumination and occlusion. These challenges limit the development of RGB video in the field of gesture recognition. Compared with the RGB video data modality, hand skeleton encoding can represent the key features of human posture and joint action in a concise and abstract way. Moreover, it is robust to viewpoint changes, action speed, external appearance, and body scale. With the advent of depth cameras (such as Microsoft Kinect [
5] and Intel RealSense [
6]) and continuous improvements in pose estimation algorithms (such as OpenPose [
7]), researchers are now able to obtain skeletal data quickly and accurately, which has greatly promoted the development of gesture recognition research.
Human skeletal data, whether body or hand bones, naturally constitute topological structures. For example, the sequence diagram of the hand skeleton in
Figure 1 illustrates the structure in which the joints are the nodes and the bones are the edges. The traditional convolutional neural network (CNN) [
8,
9,
10,
11] processes the spatial information in each action pose by reorganizing the joint point coordinates into a two-dimensional planar graph, while some methods [
12,
13,
14] use recurrent neural networks (RNNs) to represent skeletal data as a sequence vector composed of joint point coordinates, which is able to capture the temporal dependence of actions. However, compared with traditional data, skeleton data have disorder and dimensional variability: nodes in skeleton data do not have a fixed order, and the number of neighbor nodes is also uncertain, which increases the complexity of data processing. Therefore, traditional data processing methods with fixed dimensions and fixed order are difficult to apply to skeletal data. However, graph convolutional networks (GCN) [
15,
16,
17,
18,
19] can directly deal with such non-Euclidean data, and become a powerful tool for dealing with skeletal data problems by using adjacency information and feature information between nodes for feature aggregation.
However, compared with general action recognition, gesture recognition relies more on precise trajectories and time series of joint actions. For example, recognizing an action like “waving” may simply involve analyzing the action trajectory of the arm, whereas recognizing an action like “typing” requires capturing the subtle actions of each finger and their chronological order. This high dependence on time series information makes gesture recognition face greater challenges in feature extraction and processing long-term problems. In order to solve these problems, this paper proposes a spatio-temporal dynamic attention graph convolutional network based on skeleton data (STDA-GCN). This method enhances the interaction between channels by dynamic attention graph convolution, and effectively captures the action changes at different time scales by multi-scale temporal dynamic convolution, which is especially suitable for dealing with the needs of subtle action and time dependence in gesture recognition. In addition, STDA-GCN reduces model complexity while improving the spatial feature extraction ability by aggregating multiple dynamic convolution kernels. In order to obtain more comprehensive context information, the salient location channel attention mechanism (SLCAM) is also introduced to further enhance the dependence between channels.
In this paper, a novel improved spatiotemporal graph convolutional network for dynamic gesture recognition is proposed. The main contributions of this paper are as follows:
Designed to enhance the expressive power of spatial graph convolution, a dynamic attention graph convolutional layer aggregates multiple convolution kernels through an attention mechanism. This improves the model’s flexibility in recognizing hand actions by enhancing cross-channel information interaction.
A salient location channel attention mechanism is proposed. By calculating the prominent location of the channel in the query matrix, the key information of the context is better obtained. A method based on graph convolution is used to represent the relationship between the feature vertices with an adjacency matrix, and non-local operation is carried out to recalibrate the channel features.
Dynamic convolution multi-scale time modeling is designed to extract temporal features from gestures of different lengths through dynamic temporal convolution of different expansion rates, which improves the recognition ability of short to long hand actions.
Experimental results on SHREC’17 Track and DHG-14/28 datasets show that STDA-GCN exhibits highly competitive performance. Through ablation experiments and comparison experiments, STDA-GCN achieves 97.14% (14 gestures) and 95.84% (28 gestures) accuracy on SHREC’17 Track dataset, respectively. The accuracy of 94.2% (14 gestures) and 92.1% (28 gestures) is achieved on the DHG-14/28 dataset.
3. Proposed Methodologies
3.1. Network Architecture
In skeleton-based hand gesture recognition, graph convolutional networks (GCNs) typically consist of spatial graph convolutions and temporal convolutions. The input data for the STDA-GCN model comprises the coordinates of hand joints, represented by dimensions (N, C, T, V, M), where N denotes the batch size, C denotes the three-dimensional coordinates of the joints, T represents the number of frames in the gesture video, V indicates the 22 hand joints, and M represents the number of hands involved in the video. The basic structure of the STDA-GCN model is designed according to literature [
15], and the overall framework is shown in
Figure 2.
The proposed model architecture comprises 10 layers of spatial dynamic attention graph convolutions, salient location channel attention layers, and dynamic attention temporal convolution layers. For the salient location channel attention layer, SLCAM first selects the top k most important bone locations from the input bone feature 3D coordinate information by calculating the saliency measure of the query matrix. Then, this salient location information is regarded as the vertices in the graph, and the relationship between the feature points is represented by the adaptively learned adjacency matrix, so as to further strengthen the interaction between the features of different channels. Based on the parameter setting in the literature [
19], the number of input channels of the model is designed as 64, 64, 64, 128, 128, 258, 256. The input channels were increased from 64 to 128 to 256 in order to capture more complex features layer by layer. By setting the convolution step size to 2, the model can handle a wider range of skeletal nodes and expand the receptive field. The fifth and eighth layers reduce the length of the frame sequence to 2 and 4 times by convolution operation, respectively, so as to reduce the computational complexity and ensure that the model can effectively learn and extract features.
Finally, after processing through these layers, the skeletal data undergo global spatial pooling along the channels and global temporal pooling along the frames in the tenth layer. Global pooling aggregates the information of the entire skeletal sequence, retaining crucial information while reducing the computational load, thereby enhancing the training and inference efficiency of the model. A fully connected layer and softmax are then used to predict the classification results of the joint stream, joint motion stream, and skeleton stream. The results of these three streams are fused to predict the final hand gesture recognition outcome.
3.2. Spatial Graph Convolution Layer
In previously studied GCNs, the graph topology for gesture recognition tasks is typically fixed, relying solely on physical connections. This approach does not adequately capture the underlying relationships between the skeleton joints. To enhance the performance of graph convolutional networks, a common strategy is to allow the neighbor matrix to be updated during the training process, thereby adaptively learning the relationships between nodes.
In this paper, we propose Dynamic Attention Graph Convolution (DAGC), which neither increases the width nor the depth of the network. Instead, it enhances the expressive power of spatial graph convolution by aggregating multiple convolution kernels through an attention mechanism. This approach better learns the intra-frame spatial relationships within the skeleton data and reduces the model complexity.
The core idea of dynamic convolution is to utilize a set of parallel convolution kernels in each layer instead of relying on a single convolution kernel. These convolution kernels are generated through an input-dependent attention mechanism, with attention weights computed similarly to the squeeze-and-excitation (SE) mechanism [
25]. Initially, global average pooling compresses the spatial information of each channel of the input skeletal features into a single value. Then, the normalized attention weights for the K convolutional kernels are produced using two fully connected layers (with a ReLU activation in between) and a softmax layer. Following the optimal performance found in [
26] for k = 4 in classification tasks, this paper also sets k to 4. Finally, dynamic weighting and aggregation are performed by combining multiple convolutional kernels based on their weights, as illustrated in
Figure 3. This method is computationally efficient and provides stronger representation capabilities, as the aggregation of convolutional kernels is performed in a nonlinear manner.
Specifically, dynamic convolution employs a set of
K parallel convolutional kernels
that are dynamically weighted and aggregated to form new convolutional kernels
and biases
through an input-dependent attention mechanism
(as described in Equations (2) and (3)). Equation (4) constrains the attentional output to 1, facilitating the learning of the attention model. The attention weights are related to the inputs, and multiple parallel convolutional kernels are dynamically combined through this input-dependent attention mechanism. This approach enhances computational efficiency due to the small size of the convolutional kernels and provides stronger representational power by aggregating attention
in a nonlinear manner. In essence, we aggregate k linear functions, with the output y at the convolutional layer expressed as
shown in Equation (5), where
g is the activation function.
The DAGC proposed in this study involves the aggregation of three parallel time-dependent and channel-dependent adjacency matrices, as described in Equation (6). The DAGC updates its features
using the weights
of the aggregated neighboring vertices’
features. Each module of the DAGC comprises three dynamic convolutional layers: the first layer is a convolutional layer that transforms the inputs into a number of spatial relation channels for computing various spatial relations. The second layer is a convolutional layer that converts the input into the number of output channels necessary for feature computation. The third layer changes the spatial relation channels into the number of output channels, aligning the spatial relations with the features. This dynamic convolutional layer provides a richer representation of each input, effectively capturing the complexity and dynamic characteristics of the skeletal data, and enabling flexible adaptation to different action patterns.
3.3. Channel Attention Mechanism
In recent years, the attention mechanism has garnered considerable interest across various types of neural networks, leading to numerous studies that have successfully utilized it to achieve significant results. For instance, references [
22,
23] applied the attention mechanism to dynamic gesture recognition, which indeed improved performance. However, this approach also resulted in a significant increase in the number of parameters.
In this study, we introduce the salient location channel attention mechanism (SLCAM) between the spatial graph convolution and temporal convolution blocks. First, the SLCAM captures more comprehensive contextual information by selecting salient location information within the channels. Subsequently, the graph convolution method is applied to the channel attention to enhance inter-channel dependencies. This is achieved by treating the feature information from the salient locations as vertices in the graph and performing non-local operations on these features through an adaptively learned adjacency matrix.
The skeletal data output from the DAGC module is first used as the input feature matrix X. This matrix X is passed through two 2D convolutional layers to generate the query matrix Q and the value matrix V (Equation (7)). The saliency measure is computed as the sum of squares of the query matrix. This sum of squares is then aggregated to obtain the saliency score Q{pow}, from which the top k salient locations are selected to generate the key matrix K.
The attention matrix
A is computed from the key matrix
K derived from the salient locations. The product
KTK computes the similarity between each query location and all key locations, capturing the global dependencies between features. Normalization is then performed using the softmax function (Equation (8)). The value matrix
V is multiplied by the attention matrix
A to generate the output skeletal feature
Y (Equation (9)). This output is then transformed by a 1 × 1 convolution and finally summed with the input
X to further process the features, as illustrated in
Figure 4.
Subsequently, the graph convolution theory is introduced into the channel attention mechanism, enabling the direct modeling and optimization of relationships between skeletal points and capturing non-local dependencies among them, as illustrated in
Figure 5. This method employs GCNs to directly perform Non-Local (NL) operations on features, thereby capturing and representing relationships between features more efficiently. In graph convolutional networks, the convolution operation on a vertex
vi can be expressed as shown in Equation (10). Here,
bi denotes the set of vertices neighboring the output vertex
vj, and
c(
vj) is the normalization function. The weight of each feature point is then obtained by the Adaptive Graph Channel Module (AGCM) and mapped back to the original feature mapping (Equation (11)).
The AGCM contains three parts of the adjacency matrix: the constant matrix (A0) represents the feature vertices themselves. The self-attentive mechanism graph (A1) is obtained through one-dimensional convolutional and softmax layers (Equation (12)) and represents the weights of the feature vertices, and the learnable adjacency matrix (A2) generates dependencies for any two feature vertices by back-propagation optimization to capture different levels of feature relationships (Equation (13)).
The AGCM dynamically adjusts the weights of each feature point, allowing the model to adaptively emphasize or suppress different features. This is particularly important for handling complex skeletal data, as different actions may exhibit varying saliency at different skeletal points. In skeletal data, different points may contain similar or redundant information. AGCM addresses this by introducing a bottleneck structure, which reduces feature redundancy and enhances the independence and effectiveness of each skeletal point’s feature representation. This, in turn, improves the model’s discriminative ability.
3.4. Temporal Convolution Layer
Dynamic attention temporal convolution (DATC), similar to spatial map convolution, is also employed in the temporal convolution of this study to enhance the learning capability between frames of skeletal data. To capture gesture information of varying lengths, we utilize a multi-scale temporal aggregation structure, as shown in
Figure 6. This structure comprises three branches designed to extract inter-frame features of skeletal data through extended temporal convolution.
First Branch: This branch consists of two convolutional layers. The output of the first dynamic convolutional layer is used as the input for the second K × 1 convolutional layer, which is expanded after normalization and nonlinear activation. Here, K represents the size of the convolutional kernel. The introduction of dynamic convolution facilitates the fusion of features with different time steps, particularly through dilated convolution, which captures long-term dependencies more effectively with dynamic weight adjustment.
Second Branch: This branch includes a dynamic time convolution layer followed by a normalization layer. The dynamic time convolution better fuses features from different time steps, accommodating the varying significance of different time steps in time series data. By dynamically adjusting the weights of the convolutional kernel, the model can more effectively fuse information from the gesture skeletal data capturing subtle differences between time steps. The normalization layer ensures feature stability and consistency.
Third Branch: This branch combines convolution, normalization, activation, and pooling operations. Maximal pooling is used to capture the primary temporal features, while convolution and normalization maintain a normalized and nonlinear representation of the features, capturing richer information about the actions.
4. Experimental Evaluation
We conducted experiments on two widely used public dynamic gesture datasets, SHREC’17 Track and DHG-14/28, to evaluate the performance of the proposed STDA-GCN model (as illustrated in
Figure 7). The model comprises three parallel dynamic attention convolutions, a salient location channel attention mechanism, and a multi-scale dynamic attention temporal convolution. Additionally, we performed ablation experiments to verify the effectiveness of each module within the model.
4.1. Datasets
SHREC’17 Track dataset [
27]: This features gestures performed by 28 right-handed participants, captured using an Intel RealSense short-range depth camera. Each gesture was performed 1 to 10 times, resulting in 2800 sequences. Each frame includes a 640 × 480 depth image and the coordinates of 22 joints in both 2D and 3D spaces. The dataset comprises 14 distinct gestures categorized into coarse (whole-hand actions) and fine (hand shape changes) gestures. The dataset contains the following 14 gestures: Grab (G), Tap (T), Expand (E), Pinch (P), Rotation Clockwise (R-CW), Rotation Counter-Clockwise (R-CCW), Swipe Right (S-R), Swipe Left (S-L), Swipe Up (S-U), Swipe Down (S-D), Swipe X (S-X), Swipe + (S+), Swipe V (S-V) and Shake (Sh).
The DHG-14/28 dataset [
28]: This utilized the same data collection methodology as the SHREC’17 Track dataset. The Dynamic Hand Gesture 14/28 dataset contains 14 sequences of gestures executed in two ways: using one finger and the whole hand. Each gesture was performed by five participants in 20 ways, generating 2800 sequences. All participants were right-handed.
4.2. Evaluation Metrics
Top-1 accuracy (Acc (Top-1)) is a common metric used to measure the performance of models in skeleton data recognition tasks [
29,
30]. Top-1 accuracy represents the proportion of the highest probability class predicted by the model that agrees with the actual class. In this study, Acc (Top-1) was used to evaluate the performance of the STDA-GCN model in the gesture recognition task (Equation (14)).
In this context, is a judgment function that takes a value of 0 if the predicted class does not match the actual class, and 1 if they match. represents the true class of the i-th gesture, represents the inferred class with the highest probability score for the i-th gesture, and V represents the number of hand joints.
4.3. Training Details
All experiments were conducted using the PyTorch framework. After extensive comparisons, we employed the Adam optimizer for the SHREC’17 Track dataset with a batch size of 32 and a maximum of 100 training epochs, reducing the learning rate at the 30th and 70th epochs by a factor of 10. For the DHG-14/28 dataset, we used the SGD optimizer with a batch size of 32 and a maximum of 150 training epochs.
4.4. Ablation Study and Discussion
The benchmark model [
19] provided detailed metrics for each gesture type in the training on the DHG-14/28 dataset but did not include overall metrics for the 14-gesture and 28-gesture datasets. Therefore, we supplemented the experimental results for the 14-gesture and 28-gesture datasets in the ablation study, as shown in
Table 1, to facilitate comparison with the STDA-GCN model in both the ablation and comparison experiments of this study.
We conducted ablation experiments on the SHREC’17 Track and DHG-14/28 datasets by removing each added component in three action modes. The three action modes are as follows: joint flow (J), representing joint coordinates; joint motion flow (JM), representing coordinate differences between temporally neighboring coordinates; and bone flow (B), representing coordinate differences between spatially connected joints. Experiments evaluated the model using the J, JM, and combined 3s (J + JM + B) modes.
In the ablation study, we evaluated the effectiveness of each module in the STDA-GCN model using two public dynamic gesture datasets, SHREC’17 Track and DHG-14/28. The original model includes a dynamic attention convolution module and a salient location channel attention mechanism. As shown in
Table 2, removing the dynamic attention convolution module resulted in a 0.24% decrease in the gesture metrics for the 14-gesture data and a 0.33% decrease for the 28-gesture data. This demonstrates that the dynamic attention convolution effectively aggregates multiple parallel convolution kernels based on input attention, improving the learning of intra- and inter-frame relationships within the skeletal data and enhancing feature representation.
Further, removing the salient location channel attention module led to an additional 0.12% decrease in accuracy for the 14-gesture data and a 0.38% decrease for the 28-gesture data in the SHREC’17 Track dataset. This indicates the SLCAM module’s effectiveness in capturing global gesture features and weighting different channels, enabling the model to focus on important feature channels.
Overall, these results suggest that the modules in the STDA-GCN model complement each other through different mechanisms, collectively enhancing the model’s performance.
As shown in
Table 3, removing the dynamic attention convolution module led to a 0.48% decrease in r the 14-gesture data and a 0.24% decrease for the 28-gesture data. Further removal of the salient location channel attention module resulted in an additional 0.35% decrease in accuracy for the 14-gesture data and a 0.95% decrease for the 28-gesture data in the SHREC’17 Track dataset.
For the DHG-14/28 dataset, as shown in
Table 4, removing the dynamic attention convolution module resulted in a 0.71% decrease in metrics for the 14-gesture data and a 2.14% decrease for the 28-gesture data. Further removal of the salient location channel attention module led to an additional 0.71% decrease in accuracy for the 14-gesture data and a 0.72% decrease for the 28-gesture data.
As shown in
Table 5, removing the dynamic attention convolution module resulted in a 0.72% decrease for the 14-gesture class and a 1.42% decrease for the 28-gesture class. Further removal of the salient location channel attention module led to an additional 1.42% decrease in accuracy for the 14-gesture class and a 0.72% decrease for the 28-gesture class.
4.5. Comparison with Other GCN and Discussion
Subsequently, we compared our proposed STDA-GCN model with previously developed GCN models for 3D skeleton-based gesture recognition. As shown in
Table 6, on the SHREC’17 TRACK dataset, our model achieved an accuracy of 97.14% for the 14 gestures and 95.84% for the 28 gestures, demonstrating a significant improvement over current skeleton-based dynamic gesture models, which proves the effectiveness of our model.
The ST-GCN model in reference [
15] did not conduct experiments specifically for gesture recognition tasks, while the later study in reference [
23] applied the ST-GCN model to gesture recognition experiments. Therefore, this research refers to the experimental results of the ST-GCN model from [
23] on gesture recognition tasks and compares them with the experimental results of the model proposed in this paper. ST-GCN was the first to apply graph convolutional networks to human action recognition, laying the foundation for spatio-temporal feature extraction using graph structures. The model achieved an accuracy of 92.7% for 14 gestures and 87.7% for 28 gestures. The HG-GCN model [
17] was the first to establish a skeletal structure for gesture recognition, adding three types of edges to the 22 hand joints to describe joint linkage actions. It achieved an accuracy of 92.8% for 14 gestures and 88.3% for 28 gestures. The TD-GCN model [
19] enhanced the temporal frame representation by using different adjacency matrices for different time frames, achieving an accuracy of 97.02% for 14 gestures and 95.36% for 28 gestures. The STr-GCN [
23], STA-GCN [
31], and ResGCNeXt [
32] models are new approaches that incorporate attention mechanisms into spatio-temporal graph convolution networks. STr-GCN combined spatial graph convolution with transformers, achieving an accuracy of 93.39% for 14 gestures and 89.2% for 28 gestures. STA-GCN introduced dual-stream spatio-temporal attention, achieving an accuracy of 95.4% for 14 gestures and 91.8% for 28 gestures. The ResGCNeXt model incorporated attention mechanisms after spatio-temporal graph convolution to enhance spatial and channel feature learning, achieving an accuracy of 95.36% for 14 gestures and 93.1% for 28 gestures.
Figure 8 and
Figure 9 show the confusion matrices for 14 and 28 gestures on the SHREC’17 TRACK dataset. We can see that for gestures like Swipe X (S-X) and Swipe + (S+), which mainly rely on spatial transformations, the model achieved an accuracy of 1. This indicates that the introduction of dynamic attention graph convolution in the STDA-GCN model has enhanced cross-channel information interaction, allowing the model to more effectively capture subtle spatial variations in hand actions. For gestures like Swipe Right (S-R) and Swipe Left (S-L), which are more dependent on temporal changes, the model also performed exceptionally well, demonstrating the effectiveness of the proposed multi-scale temporal dynamic attention convolution. This mechanism better captures the transition of hand actions from short-term to long-term variations, improving the model’s ability to detect inter-frame action changes, and ultimately significantly enhancing the recognition accuracy of these dynamic gestures.
Table 7 shows that our model also demonstrates excellent performance on the DHG-14/28 dataset, achieving accuracies of 94.2% for 14 gestures and 92.1% for 28 gestures.
Figure 10 and
Figure 11 show the confusion matrices for 14 and 28 gestures on the DHG-14/28 dataset. We found that gestures involving spatial transformations, such as Grab (G) and Expand (E), and inter-frame actions like Rotation Clockwise (RC) and Rotation Counter-clockwise (RCC), performed well. However, due to the highly similar spatial actions of Grab (G) and Pinch (P), these actions are often misclassified as one another. Therefore, further optimization is needed to improve the model’s ability to distinguish between complex or similar gesture classes.
5. Conclusions
We propose a spatio-temporal dynamic attention graph convolutional network (STDA-GCN) for dynamic gesture recognition. Unlike previous approaches, STDA-GCN employs dynamic attention convolution in the spatial graph convolution module to enhance cross-channel information interaction by aggregating multiple dynamic convolution kernels, thereby improving the learning of action changes between hands. By incorporating a salient location channel attention mechanism between spatio-temporal graph convolutions, the model effectively suppresses redundant features and increases efficiency. Additionally, we utilize multi-scale temporal dynamic attention convolution to better capture short- and long-term hand action information and inter-frame actions. Experimental results on SHREC’17 Track and DHG-14/28 datasets show that STDA-GCN exhibits highly competitive performance. Through ablation experiments and comparison experiments, STDA-GCN achieves 97.14% (14 gestures) and 95.84% (28 gestures) accuracy on SHREC’17 Track dataset, respectively. The accuracy of 94.2% (14 gestures) and 92.1% (28 gestures) is achieved on the DHG-14/28 dataset. However, for the discrimination of certain similar gestures, such as Grab (G) and Pinch (P), there is still room for further optimization of the model’s ability to extract spatial features. In the future, we plan to continue to optimize the performance of the model in distinguishing similar hand action, and combine semantic analysis to improve the performance of the model on the human interaction hand action dataset.