1. Introduction
Dynamic gesture recognition, as a popular task in computer vision, is widely used in various scenarios such as autonomous driving, smart home, and smart healthcare. Recent related works [
1,
2,
3,
4,
5,
6,
7,
8,
9] have achieved good performance in terms of common gesture datasets [
10,
11].
According to the modality of inputs, the existing work on dynamic gesture recognition can be classified into three categories: image [
5,
11,
12,
13], skeleton [
4,
7,
8,
9,
10,
14,
15,
16,
17], and point clouds [
1,
2,
3,
6]. The RGB or RGB-D images are easily obtained and the large data size makes the models converge faster. However, images are susceptible to irrelevant factors, such as occlusion and background. Recently, more and more work [
1,
2,
3] has begun to focus on the 3D point cloud. The point cloud contains the latent spatial information of objects and maintains shape features. The first step of those works [
1,
2,
3,
6] is to convert depth images to point clouds without using skeletons. The data size of the depth image (for example, 128 points) is larger than that of the skeleton (22 joints) [
10]. The bigger data size is positive for making the models converge. However, the neighborhood of the search point in the images is composed of some pixels which are joint points and other pixels that have less correlation with hand gesture. Instead, the skeleton information from the coordinates of the hand joints is more robust to irrelevant factors such as lighting or occlusion [
4].
The correlation between skeletons and hand movements is greater. Therefore, our work explores the point cloud from both skeletons and depth images for better recognition performance. We convert the skeleton information of each gesture into pairs of point clouds. Every gesture is uniformly divided into different groups along the time dimension. The source point cloud comes from the first frame of each group, and its corresponding target point cloud depends on the time interval. Then, dynamic features are captured from pairs of point clouds.
Due to the irregularity of the point clouds, it is challenging to track the point-by-point correspondence between different frames [
1]. Previous works [
1,
2,
3,
6] indirectly represent motion features by capturing the features of adjacent points. Kinet [
1] keeps the adjacent points in the same feature-level ST-surfaces, and the normal vector of the surface can represent the dynamic features well. PointLSTM [
2] optimizes the long short-term memory (LSTM) by combining information from the current frame and neighbors from the past frame while preserving the spatial structure. However, as shown in
Figure 1, most of the works extract the dynamic features of the search point in the feature space of Neighborhood 1. In Neighborhood 1, the adjacent points of the search point are learned from the pixels on depth images. These points are related to the search radius of the ball query, the number of samples for adjacent points, and the time interval
. The works suppose that search point and its neighborhood keep consistency in movement. Therefore, the learned motion features of adjacent points (gray) are similar. And, detailed information is ignored that changes rapidly in time, such as fine-grained features (orange).
Some works aim to learn fine-grained dynamic features for higher accuracy while capturing coarse-grained features. MAE [
7] is self-supervised in skeleton data to generalize with different hand gestures. Graph convolutional network (GCN)-based approaches [
3,
4,
5,
6,
16] are widely explored by defining spatio-temporal graphs to incorporate spatial connectivity. SOGTNet [
6] introduces multi-head attention to extract global features and an improved DGCNN to capture local features. ST-SGCN [
5] pays more attention to directional or sparse interactions between the hand joints when learning subtle interactions. And, they introduce the cross-spatio-temporal module. FPPR-PCD [
3] captures local features using DGCNN and the global position using DenseNet [
18]. However, they only convert depth images to point clouds or extract skeletons from images. And, those models have a lower performance on SHREC’17 [
10], which is a mixture of fine and coarse gestures. That is because coarse gestures lack distinctive features. And, precision in directed interactions is important when identifying fine gestures. Therefore, it is necessary to express the subtle gesture changes more accurately.
Scene flow estimation can provide the map of each 3D point in the previous frame to each point in the next frame. In other words, scene flow estimation aims to capture the motion vector of each point [
19]. By converting the skeleton data into scene flow, we can capture dynamic information such as the trajectory, speed, direction, and acceleration of joint movements. This not only enables researchers to intuitively analyze actions, but also allows for the capture of the details of dynamic gestures. There have been many mature scene flow estimators [
20,
21,
22,
23,
24,
25,
26,
27,
28]. Scene flow estimation is widely used in those areas such as robotics (path planning, collision avoidance), autonomous driving (tracking vehicles or people) and visual surveillance, which contain multiple tracked objects and complex spatial information. Recently, works have aimed to build self-supervised scene flow estimators that do not use ground-truth flow as labels [
23,
24,
29,
30]. Finding point cloud correspondences is widely applied to self-supervised scene flow estimators [
23,
31]. These works are inspired by Lang’s work [
32]. Lang et al. [
32] suggests using latent space similarity and point features rather than regressing the true flow. However, to the best of our knowledge, they have not been used in gesture recognition in dynamic point clouds [
1]. And, no work has generated ground-truth flow in the existing gesture datasets. In addition to the irregularity of point clouds that makes the training time and inference time longer [
23], the other primary reason is that most scene flow estimators rely on large datasets with ground-truth flow (FlyingThings3D [
33] and KITTI [
34,
35]) to make the regression learning network converge.
The paper proposes fusing skeleton-based scene flow for gesture recognition on point clouds (FSS-GR) for accurate recognition, as shown in
Figure 1. We aim to use scene flow to measure the fine-grained feature of dynamic gestures represented by skeletons. Then, the scene flow and coarse-grained features are fused with different strategies. It is necessary to design an automatic converter to convert skeletons. In our work, gesture skeletons are converted into pairs of point clouds. And, point clouds are fed into self-supervised estimators to obtain scene flow. According to the time interval between different source and target point clouds and different scene flow measurement indicators, four different scene flows are estimated. The different time intervals and indicators are related to the feature space of spatial and temporal neighborhoods. Our work learns fine-grained features in Neighborhood 2, as shown in
Figure 1. In Neighborhood 2, the features of adjacent points are captured from hand joints, which have a stronger correlation with gestures. The size of Neighborhood 2 is related to the number of samples for adjacent (target) points and the time interval
. By setting
and the number of samples for source points, the size of Neighborhood 2 is smaller than that of Neighborhood 1. If we extract the scene flow in Neighborhood 2, the feature space of every gesture will be grouped into more different neighborhoods. The estimated 3D scene flow is more detailed and represents the subtle differences between the search point (red) and the adjacent point (blue). To avoid the high time cost of the scene flow estimator, the estimated scene flow is printed as the scene flow dataset.
As shown in
Figure 1, multi-stream FSS-GR adds an independent scene flow branch to learn fine-grained motion features. The prediction scores from three branches are averaged as the final outputs during the testing phase. Two-stream FSS-GR fuses scene flow before the fusion of low-level static features and high-level dynamic features to supplement the fine-grained dynamic features.
In this paper, the main contributions are as follows:
Point clouds used in previous gesture recognition methods are generated from depth images. FSS-GR explores the point cloud from both skeletons and depth images. Skeletons and depth images form an effective information supplement. And, bones with small data volumes are highly correlated with hand movements. Compared with works that feed skeletons to GCN-based networks, this work is the first to transform the skeleton information into pairs of point clouds. The time interval between source point clouds and target point clouds and fewer hand joints make the feature space of a search point smaller;
We measure fine-grained features using scene flow. Then, the scene flow and coarse-grained features are fused with different strategies. An automatic converter is designed to convert skeletons, and four scene flow datasets are obtained: SHRECsft, SHRECsfe, SHRECsfe2, and SHRECsfe3. Those datasets can be fused with other static and dynamic features in gesture recognition to reduce the time cost. FSS-GR fuses scene flow and coarse-grained dynamic features with two strategies. Multi-stream FSS-GR includes an independent scene flow branch. Two-stream FSS-GR fuses the scene flow before the fusion of low-level static features and high-level dynamic features. The code is available at
https://github.com/shawn-fei/fss-gr.git (accessed on 25 January 2025);
Comparative experiments are conducted on various datasets to show efficacy in the performance, efficiency in parameters, and computational complexity of FSS-GR. Experiments are conducted on the SHREC’17 dataset [
10] and DHG [
14]. Noticeably, on SHREC’17, multi-stream FSS-GR obtains 1.4% and 0.8% accuracy gains in comparison with Kinet [
1] and SOGTNet [
6].
The structure of this paper is presented as follows. In
Section 2, we analyze the related work on dynamic gesture recognition and flow estimation. In
Section 3, we introduce FSS-GR. Initially, preprocessing is described in
Section 3.1. In
Section 3.2, two strategies are designed to fuse the scene flow with dynamic features. In
Section 4, the experimental setup, results and cost analysis are presented, discussed, and analyzed. In
Section 5, we conclude the contributions and limitations of FSS-GR.
4. Experiment
To evaluate the performance of FSS-GR, comprehensive experiments are conducted on the dynamic gesture recognition dataset SHREC’17 and DHG. And, FSS-GR is compared with the state-of-the-art works on dynamic gesture recognition from the perspective of recognition performance and cost. An ablation study is also performed to verify the effectiveness of each module in FSS-GR.
4.1. Experimental Setup
4.1.1. Datasets
SHREC’17 [
10]: SHREC’17 is a public dynamic gesture dataset that not only provides the coordinates of 22 hand joints in the 3D world, but also provides depth images. It contains 2800 videos composed of 28 types of gestures, including 1960 videos (70%) in the training set and 840 videos (30%) in the test set.
SHRECsft, SHRECsfe, SHRECsfe2, and SHRECsfe3 are four datasets about the scene flow of dynamic gestures. These datasets are generated based on the
Skeletons_world provided by SHREC’17. The 3D coordinates of 22 joints per frame for each dynamic gesture are provided in
Skeletons_world. In this paper, the skeletal information is processed into 32 pairs of point clouds, each of which contains a source point cloud and a target point cloud. However, the ground truth flow corresponding to the source point cloud is not provided in
Skeletons_world. Therefore, a self-supervised scene flow estimator is used to extract the scene flow for each joint. Each scene flow dataset consists of a 3D scene flow of joints for 2800 videos. There are two aspects of differences between the four datasets: time interval and the metrics of scene flow, as detailed in
Table 1.
DHG [
14]: This dataset includes 2800 dynamic gesture sequences. There are 28 types of dynamic gesture that were conducted by 20 subjects using the whole hand. The modalities of the sequences are skeletons and depth images. The skeletons are the coordinates of 22 hand joints in the 3D world. Because there is no official division between the training set and test set, this work follows the common means of evaluation, which is leave-one-subject-out cross-validation.
4.1.2. Experimental Configuration
Experiments about scene flow estimation are conducted on GPU equipped with V100-32G. The manufacturer of V100-32G is NVIDIA, whose headquarters is located in Santa Clara, California, United States. We equip the GPU with a GPU Cloud service provider named Gpushare Cloud. Gpushare Cloud is located in Shanghai, China.To obtain the same key frames, when processing the initial skeletons and depth images, we uniformly sample along the timeline of each dynamic gesture, following common practice. The quantity of key frames for each gesture, denoted as
T, is set to 32, which aligns with the prevalent number of key frames utilized in recent works. Before the fusion of scene flow and coarse-grained features, the number of feature points in each frame, denoted as
N, is set to 16. During the learning of coarse-grained features, the greater the time interval
, the larger the search radius. For example, in the first layer, the search radius is an arithmetic sequence ranging from 0.5 to 0.6. The search radius in the second layer is twice that of the first layer. The number of samples for adjacent points is 64. Regarding the dynamic characteristics of source points in the source point cloud
X, adjacent points have similar features. The size of the neighborhood for each source point is denoted as
. In other words, each source point has similar motion features with adjacent
points. In our work,
is set to 12. For the dimensions of points and the hyperparameters of losses in the feature space, this paper retains the settings of the original work [
23]. In the training of the flow estimator SCOOP [
23], the batch size is set to 16 and the epoch is set to 100. In the training of the FSS-GR, the batch size is set to 8, the epoch is set to 250, the learning rate is 0.001, the decay step is 200,000, and the decay rate is 0.7.
Concerning the number of channels for MLP in M-FSS-GR, is set to 128 and is set to 1024. In M-FSS-GR, is 256.
FSS-GR is implemented in TensorFlow. All experiments are conducted on GPU equipped with V100-16G. To make a fair comparison, a static branch is trained first and followed by freezing, and finally, mainly a dynamic branch and a scene flow branch are trained. For other hyper-parameters such as learning rate and decay step, our work keeps the same settings of the original work [
1]. Classification accuracy is selected for the performance evaluation index as in the previous works. FLOPs (floating-point operations) and Params (the number of parameters) are the evaluation indicators of the time complexity and spatial complexity of model.
4.2. Performance
4.2.1. Comparison with the State-of-the-Art (SOTA)
FSS-GR is compared with recent advanced methods. We replicate approaches [
1,
2,
3,
6] on point clouds on the same GPU equipped with V100-16G.
As shown in
Table 2, FSS-GR achieves the best performance with 95.2% accuracy on SHREC’17, with gains of 1.4% and 0.8% compared to advanced works Kinet [
1] and SOGTNet [
6]. As shown in
Table 3, FSS-GR achieves the best performance with accuracy of 93.5% in DHG, with gains of 0.9% and 0.3% compared to the advanced works using SOGTNet [
6] and Shen’s annotation framework [
8]. This results from the fusion of fine-grained scene flow and coarse-grained dynamic features on point clouds. Moreover, as shown in
Table 4, the proposed method, M-FSS-GR, achieves gains of 0. 5% and 0. 8% in both static and dynamic branches compared to the method using point clouds in depth images. The difference between M-FSS-GR and M-FSS-GR’ is the number of channels for MLP (multi-layer perceptron). In M-FSS-GR, by fusing heterogeneous information through more branches, we can completely extract features from different granularities, so the accuracy can be improved.
As shown in
Table 5, we compare FSS-GR with advanced GCN-based approaches in recent years. From the perspective of modality, FSS-GR not only converts the depth images into point clouds, but also converts the raw skeleton data into point clouds. To our knowledge, in dynamic gesture recognition, FSS-GR is the only model to convert skeletons into point clouds. And, it has the highest accuracy on two datasets. This shows that the conversion of bones into point clouds is helpful to improve the accuracy of gesture recognition.
From the perspective of different modules of each model, FSS-GR and other GCN-based approaches have similar motivations. The fine-grained features of dynamic gestures are extracted from the scene flow branch of FSS-GR. The improved DGCNN in SOGTNet and FPPR-PD captures local features. The accuracy of the scene flow branch is low, but the accuracy of FSS-GR is the highest after integrating different types of features. This shows that the fusion of scene flow and coarse-grained features is effective.
These results indicate the following two aspects:
Compared to SOTA works, FSS-GR performs more accurate gesture recognition by capturing the scene-flow-assisted 3D static features and motion features;
FSS-GR characterizes the 3D motion of dynamic gestures more completely from skeletons and depth images, due to an automatic converter and fusion between different data granularities.
4.2.2. Comparison at Different Branches with Different Weights
It is also observed from
Table 4 that there are differences between the results of the static, dynamic, and scene flow branches of the same method. By comparing the overall accuracy after the aggregation of different branches, the four following points are concluded.
The first three observations further validate that the proposed method is better than previous methods due to the improvement in the ability to learn spatial information and capture temporal information. However, as described in the last observation, it is inefficient to use scene flow alone in classification. This work attributes this inefficiency to the fact that tracking the movements of each hand joint over very short time intervals allows the weighting of small motions to be increased, and these small motions have less correlation with the type of gesture. However, these tiny motions can be complemented with normal vectors that focus on the coarse-grained motion features, allowing the network to capture 3D motion more completely. Based on the above analysis, this work suspects that the learning of dynamic features should take into account both coarse-grained motion features and fine-grained motion features, and fine-grained motion features should take up a small proportion in gesture recognition.
In order to verify the impacts of the proportion of scene flow branches on the recognition accuracy during aggregation, first the proportion of scene flow branches is set from 0.1 to 0.5 during aggregation, and the spatial branch proportions are 0.5, 0.4, and 0.3. Then, the accuracy of M-FSS-GR and M-FSS-GR’ is compared with different proportions, and the results are shown in
Figure 5 and
Table 6. The two following points can be observed from
Figure 5.
When the static ratio is 0.3, the performance of M-FSS-GR and M-FSS-GR’ is better;
When the scene flow branch ratio is set to 0.1, 0.2, or 0.3, the model performance is improved. When the static ratio is 0.3, the scene flow branch ratio is 0.3, and when the dynamic ratio is 0.4, the performances of M-FSS-GR and M-FSS-GR’ reach their best.
The above results demonstrate that FSS-GR is effective in combining different fine-grained dynamic features.
4.2.3. Comparison on Different Scene Flow Datasets
When using different scene flow datasets, the performance of the same method is shown in
Table 7, and the results show the impact of time interval
and scene flow metrics on recognition accuracy.
Differences between the performance of M-FSS-GR and M-FSS-GR’:
The accuracy of M-FSS-GR is 1% higher on SHRECsft and SHRECsfe compared to SHRECsfe2 and SHRECsfe3;
The accuracy of M-FSS-GR’ is nearly 0.8% higher on SHRECsfe or SHRECsfe3 compared to SHRECsfe2;
M-FSS-GR is the best performing method on SHRECsft, SHRECsfe, and SHRECsfe2. Compared to other methods, its accuracy is improved by 0.4 to 0.6. It indicates that the tracking motion of points in a shorter period is beneficial for the network to learn the motion of different fine-grained sizes;
On SHRECsfe3, M-FSS-GR’ has the highest accuracy and outperforms the other two methods by more than 1%.
T-FSS-GR: Compared to Kinet [
1], the accuracy of T-FSS-GR is 0.7% higher in SHRECsfe. However, its performance is degraded on the other three datasets in which scene flow is denoted by
. And, its performance deteriorates as the time interval
increases.
Performance degradation results from the fact that, when fusing different fine-grained motion features, N(16) feature points not only fail to retain the fine-grained dynamic features, but also lose the coarse-grained dynamic features. describes the movement of joints more accurately than F, so the difference between and normal vector is greater, which makes the new fused motion features have a greater shift. This work tries to connect coarse-grained or fine-grained motion features after the new motion features, but the results show that the performance was only improved to the same level as that of SOTA, but the computational cost of the network was already higher than that of the SOTA.
The above results show that the performance of M-FSS-GR is superior to that of T-FSS-GR. This is because it is better to add a scene flow branch than to fuse normal vector and scene flow to form a new motion feature. The two motion features with different fine-grained dimensions lose less information during the training phase.
4.3. Cost Analysis
Table 8 gives the floating-point operations (FLOPs) and the number of parameters (Params) for FSS-GR. From the perspective of Params, the number of parameters of our model (M-FSS-GR, M-FSS-GR’, and T-FSS-GR) is kept at a low level. In particular, the Params of T-FSS-GR is only 1.6 M. Compared with MAE and ST-SGCN, the Params of T-FSS-GR is reduced by 89.3% and 84.6%. This shows that FSS-GR requires less memory for training. The Kinet has the smallest number of parameters, which is 1.5 M. The Params of T-FSS-GR is very close to that of Kinet, which means that FSS-GR has a lower risk of overfitting and has a stronger generalization ability.
Although FSS-GR introduces computationally expensive scene flow estimation, and scene flow estimation is not synchronized with gesture recognition. The estimated scene flow is used as an intermediate synthetic dataset. However, from the perspective of FLOPs, the computational complexity of FSS-GR is high. The FLOPs of M-FSS-GR are 1.9% higher than those of Kinet. The FLOPs of M-FSS-GR’ and T-FSS-GR decrease slightly but remain in the same order of magnitude. This indicates that FSS-GR requires more computing resources and time to recognize gestures.
In terms of recognition performance, the accuracies of the models M-FSS-GR, M-FSS-GR’, and T-FSS-GR are the highest among all models. Although FLOPs are relatively high, the Params is comparatively low. This indicates that these models control the complexity of the model to a certain extent while maintaining high performance. Compared to Kinet, M-FSS-GR has an improvement in recognition accuracy, and the Params only increases by 0.8M. When comparing M-FSS-GR with other models, it demonstrates an improvement in recognition accuracy, and the Params has decreased significantly. By comparing the FLOPs, Params, and accuracy of different gesture recognition models, we observe that, while maintaining a high recognition accuracy, the Params of FSS-GR is relatively low, which validates the effectiveness of FSS-GR. In particular, the T-FSS-GR achieves a balance between recognition accuracy and computational cost. This indicates the scalability of T-FSS-GR.
In summary, FSS-GR has a high accuracy and a low number of parameters, which is suitable for deployment on equipment with limited resources. However, its computational efficiency needs to be improved. In the future, we will explore lightweight architectures to balance accuracy and FLOPs, which will be beneficial to improve the practicability of FSS-GR.