1. Introduction
With the rapid development of the Internet of Things technology, various smart devices have changed people’s lives. Human–computer interaction technologies, i.e., information interaction between humans and computers, have become essential. Since gestures have the advantages of easy learning, rich information, and simplicity, gesture recognition technology [
1] has become a research hotspot in recent years. Gesture recognition technology can be widely used in virtual games, automatic driving assistance systems, sign language recognition, and intelligent robot control. Currently, the main problems of the existing gesture recognition methods based on wearable sensors [
2,
3] and cameras [
4,
5] are that they are not convenient enough, the required equipment is expensive, and there is a risk of privacy leakage, which limits the wide application of gesture recognition systems in reality. Gesture recognition technology is more practical than ever before under the booming development of Wi-Fi sensing technologies, progressively transitioning from theoretical research to practical landing application stages due to their advantages of a contactless manner, low cost, good privacy, and the fact that they do not require line-of-sight propagation (LoS) [
1]. Specifically, the development of gesture recognition systems is moving from the single domain to the cross domain, from recognizing fixed types of gestures to new types of gestures. In addition, a gesture recognition system is increasingly deployed in the mobile environment, and its model has also transformed from heavyweight to lightweight to meet the requirements of mobile device deployment.
Wi-Fi sensing technologies recognize a gesture by analyzing a gesture’s feature, extracted from the channel state information (CSI) of Wi-Fi signals, which are generated during the execution of the gesture. A convolution neural network [
6,
7,
8], an important neural network model of deep learning, has excellent feature extraction capabilities. Therefore, Wi-Fi-based gesture recognition methods mainly adopt deep learning algorithms to recognize gestures [
9,
10,
11,
12]. However, these methods concentrate on single domain recognition. Once they face new types of gestures or gestures performed in a new domain, the recognition performance will dramatically degrade, and a large amount of testing data from the new domain is needed to adjust the model. This problem is called a “domain shift” and is a substantial challenge for improving the practicality of the gesture recognition system. In addition, deep-learning-based gesture recognition systems usually have a complex neural network model. Due to the limitations of storage space and computation consumption, the storage and calculation of neural network models on mobile devices are other substantial challenges. Therefore, designing a lightweight gesture recognition system with good recognition performance in the new domain using a small amount of data is an essential aspect of facilitating the application of gesture recognition technology.
Recently, there has been an increasing amount of literature adopting the transfer learning technique [
13,
14,
15], generative adversarial networks [
16], or a manually designed domain-independent feature body-coordinate velocity profile [
17] to eliminate the domain shift problem. However, the excellent performance of these methods depends on high amounts of data, and the manual modeling method needs to analyze complex CSI data. Since the influence pattern of gestures on Wi-Fi signals is complicated, the model of velocity profiles is complicated as well.
In addition, inspired by the few-shot learning technique [
18,
19,
20,
21,
22], Zou et al. [
23] and Zhou et al. [
24] combined a few-shot network and adversarial learning to remove domain-related information. Lan et al. [
25] proposed a few-shot multi-task classifier to address the domain shift problem. The basic idea is to initialize the parameters of the classifier so that the classifier can quickly adapt to a new domain. Yang et al. [
26] proposed a Siamese recurrent convolutional architecture to remove structured noise and used convolution neural network (CNN)-long short-term memory (LSTM) to extract temporal-spatial features. Although these methods can eliminate the domain shift problem with a small amount of data, they require more computation. Their complex models with many parameters are not suitable for deployment.
To address the challenges mentioned above, we proposed WiGR, a novel, practical Wi-Fi-based gesture recognition system. The key structure of WiGR is an improved few-shot learning network, which consists of a feature extraction subnetwork and a similarity discrimination subnetwork. The feature extraction subnetwork adopted a 2-D convolutional kernel [
6] to simultaneously extract the spatial features and temporal dynamics of gestures. Similar to the relation network [
22], the similarity discrimination subnetwork uses a learning-based neural network as the similarity measurement method to determine the type of gesture, and this is more accurate than using fixed functions as measurement methods [
18,
19,
20,
21]. The whole network can learn a transferable similarity evaluation ability from the training set and apply the learned knowledge to the new testing domain via an episode-based training strategy [
20] to eliminate the problem of domain shift. In addition, there is evidence that lightweight networks [
27,
28,
29,
30,
31] play a crucial role in mobile deployment. Therefore, we introduce depthwise separable convolution and an inverted residual layer of a linear bottleneck [
30,
31] in a few-shot learning network to reduce model computations and parameters. Simultaneously, to reduce the complexity of the model while the recognition performance does not decrease accordingly, we introduce a squeeze and excitation (SE) block [
32] to improve the quality of the features generated from the network by explicitly modeling the interdependence between the network convolution feature channels. Later extensive experiments on two data sets (CSI-Domain Adaptation (CSIDA) and SignFi [
10]) demonstrate that WiGR can achieve excellent recognition performance in cross-domain evaluation, and our network design dramatically reduces the model computations.
Our contributions can be summarized as follows:
We designed a novel Wi-Fi-based gesture recognition system called WiGR that is more practical than existing gesture recognition systems. The practicality is reflected in its ability to recognize new gestures or gestures performed in new domains using just a few new samples.
A lightweight few-shot learning network, which consists of a feature extraction subnetwork and a similarity discrimination subnetwork, is proposed to address the hard domain shift problem. Lightweight and effective blocks are introduced in the network to achieve lower computational complexity and high performance.
We made a CSIDA data set, which includes CSI traces with various domain factors, to simulate real scenes. The CSIDA data set was helpful for us to verify the accuracy of the proposed WiGR in cross-domain evaluation.
Extensive experiments on the SignFi data set and the CSIDA data set show the superiority of the proposed WiGR over existing gesture recognition systems in terms of cross-domain accuracy and computational complexity.
4. Results
We conducted extensive experiments on two data sets (SignFi [
10] and CSIDA) to verify WiGR’s effectiveness. We implemented the proposed system on a PyTorch 1.8.0 framework on an Intel(R) Xeon(R) CPU E5-2630 v4 @2.20GHz with an Nvidia Titan X Pascal GPU and 32.0 GB of RAM.
The SignFi data set and our CSIDA data set are both Wi-Fi data sets and include CSI data with various domain factors. The SignFi data set includes two domain factors, i.e., different environments and users. The CSIDA data set includes three domain factors, i.e., different environments, users, and locations.
4.1. Recognition Performance Evaluation
Recognizing new types of gestures. The ability to recognize new types of gestures is important for enhancing the scalability of a gesture recognition system. The few-shot learning method, the key technology used in this paper, can realize the generalization of the model through just a few support samples. This is the key difference between the few-shot learning method and other domain adaptation methods. To verify the ability of the proposed system to identify new types of gestures through just a few samples, we compared it with other few-shot learning methods on the SignFi data set and the CSIDA data set.
Table 5 demonstrates that the WiGR model can achieve 98.6%, 97.2%, and 95.8% accuracy when it recognizes 10, 20, and 30 new types of gestures, respectively, under the condition that 100 new types of gestures are used for training and each new gesture has three support samples. Compared with other methods, the improvement in accuracy is more than 10 percentage points.
Table 6 demonstrates that our WiGR model has better recognition performance than the other few-shot learning models. When WiGR was trained with three old types of gestures, it achieved 91.4% and 84.9% recognition accuracies for two and three new types of gestures, and each new type of gesture had three support samples. Because our CSIDA data set does not have enough training types for training, the recognition accuracy dropped slightly compared with the SignFi data set. In general, the accuracy of the proposed WiGR model is remarkably higher than the other few-shot learning models in all evaluations.
Cross-domain evaluation. To verify that the proposed WiGR system does play a role in cross-domain recognition, we conducted extensive cross-domain experiments by splitting the data set according to the layout of the environment, the user who performs the gestures, and the user’s location. We compared our model with other traditional gesture recognition systems, such as WiGeR [
33], which utilizes a classifier with a DTW algorithm, and WiCatch [
34], which employs SVM with a MUSIC signal processing algorithm. In addition, since the proposed WiGR adopts components of a CNN to construct a network, then, to verify the superiority of the CNN-based WiGR, the selected comparison systems were based on machine learning algorithms (i.e., WiGeR and WiCatch) or based on only a sample structure of a CNN without the capability of cross-domain recognition (i.e., SignFi [
10]). Moreover, Siamese-LSTM [
26], using a Siamese network that consists of a CNN and LSTM to address domain shift problems, is a typical few-shot domain adaptive method and was used as a baseline method. These competitive methods were useful in verifying the effectiveness of our WiGR model in cross-domain evaluation.
Cross-environment evaluation. For the environmental shift, we used CSI data from two different environments. All the data from one environment were used for training, while data from the other environment were used for testing.
Figure 7 shows the accuracy for recognizing gestures that are collected in a new environment with three support samples for each gesture, where A → B denotes that A is the training set and B is the testing set. We can see that traditional machine learning methods, such as WiGeR and WiCatch, and an ordinary convolutional network, such as SignFi, have almost no shift ability when testing samples from a totally new environment, while our proposed WiGR model could achieve an average accuracy of 98% and 88% using the SignFi and CSIDA data sets, respectively, and therefore remarkably outperforms the other methods.
Cross-user evaluation. For the user shift, we evaluated all methods in the same environment to control variables, and then conducted leave-one-user-out cross-validation using CSI traces from different users. In other words, we adopted CSI traces collected by some users as the training set and utilized the CSI traces of the other users as the testing set.
Figure 8 shows the results of recognizing new user’s gestures, and each gesture has three support samples. From
Figure 8, we can see that the cross-user recognition accuracies of WiGeR, WiCatch, and SignFi are no more than 80%, but still better than the cross-environment performance. The reason is that the training data set has abundant user domain information for extracting common features. Our WiGR model achieves state-of-the-art performance with a recognition average recognition accuracy of 92% and 91% using the SignFi and CSIDA data sets, respectively. Compared with the domain-adaptive Siamese-LSTM, our method improves its performance by about 10%, which demonstrates that WiGR alleviates the problem of domain shift effectively by learning transferable knowledge from the training set and using the features extracted from the support samples to recognize gestures.
Cross-location evaluation. For the location shift, we evaluated all the methods in the same environment to control variables, and then performed leave-one-location-out cross validation using CSI traces. As shown in
Figure 9, our proposed WiGR model still shows excellent performance with an average recognition accuracy 90.8%, and, therefore, outperforms other methods. In addition, when the testing CSI data are collected at Loc. 1 and Loc. 3, the recognition performance is slightly reduced compared with the data collected at Loc. 2. This is because that the user performed gestures at Loc. 1 or Loc. 3 is very close to Rx or Tx. In this case, the user’s body will block more signals, resulting in weaker signal propagation, which in turn affects gesture recognition performance.
Different users have different physical body conditions, gesture speeds, and hand movements for the same gestures, and there are two different layout environments. Moreover, different locations can result in different signal propagation paths. These three factors may result in different CSI signal patterns, even for the same gesture. However, due to the excellent feature extraction capabilities of the CNN, CNN-based gesture recognition systems (i.e., WiGR and Siamese-LSTM) have superior cross-domain recognition performance compared to other gesture recognition systems based on traditional machine learning methods (i.e., WiGeR and WiCatch). Although SignFi also adopts the components of a CNN, the structure of SignFi is too simple to play a role in cross-domain recognition. Additionally, the WiGR model can learn more robust transferable knowledge through supervised training, thereby eliminating the influence of individual, environmental, and location factors on gestures, which allows WiGR to achieve gesture recognition under a new domain with only a few samples.
4.2. Model Complexity Analysis
The complexity of a gesture recognition model, affecting storage space and computational cost, plays a vital role in mobile deployment. We utilized indicators
Params and
MACs to reflect the complexity of the model. Params refers to the model’s parameters—the smaller the value, the smaller the storage space required by the model.
MACs refer to the calculations required by the model, and a smaller value corresponds to fewer computing resources be consumed.
M is an abbreviation for million. The key network of the WiGR is an improved few-shot learning model, in which lightweight blocks are introduced. Therefore, to verify the effectiveness of these lightweight blocks, we compared them with normal few-shot learning models [
18,
20,
21].
Table 7 shows that the WiGR outperforms other popular few-shot learning methods [
18,
20,
21,
22] about the model’s complexity by a clear margin, and we can see that Params and MACs have the smallest value when scaling factor fac = 1/6. Thus, the value of fac also plays an important role in the model’s computational complexity. Experimental results show that WiGR is a state-of-the-art lightweight gesture recognition model by far when
fac = 1/6.
4.3. The Influence of The Number of Antennas
Since only some high-end mobile devices are tailored for multiple input, multiple output (MIMO) communication with several antennas, it is necessary to study the influence of the number of antennas on the recognition performance of the WiGR model.
With different numbers of receiving antennas, we conducted cross-domain recognition evaluation and single-domain recognition evaluation. Specifically, the CSI data collected in Room 2 are selected as the test data in cross-environment evaluation, the CSI data performed by User 5 are selected as test data in cross-user evaluation, and the CSI data collected in Location 3 are selected as test data in cross-location evaluation. In single-domain recognition evaluation, we selected some CSI data of six gestures performed by User 1 at Location 1 of Room 1 as training data, and the remaining CSI data of each gesture as testing data. Similarly, there are three support samples provided for each gesture. From
Table 8, we can see that the larger the number of receiving antennas, the better the recognition performance. This is because multiple receiving antennas can transmit richer CSI data, which helps the WiGR model recognize gestures more accurately. In addition, when only one transmitting antenna and one receiving antenna are used, the cross-domain recognition accuracy of the WiGR model can only reach 70.2–73.2%, and the single-domain recognition accuracy of the WiGR model can reach 91.3%. To a certain extent, it can still show a cross-domain recognition ability and a good single-domain recognition ability, although the effect is not as good as using MIMO.
5. Discussion
There are several limitations to our proposed WiGR, and they can become fruitful directions of further investigation. Firstly, we only discuss the impact of finite domains (i.e., environment, users, and locations). In fact, CSI signals will also be affected by the orientation of the face [
17] and other signal sources. These factors need to be considered in future work.
Secondly, in many human–computer interaction scenes, such as virtual games, automatic driving assistance systems, sign language recognition, and intelligent robot control, the distance between the user and the transmitter/receiver, or the distance between the transmitter and the receiver, is not fixed. Therefore, we simply set these distances according to [
17,
41]. In future work, we will focus on a specific application scenario (e.g., controlling a mobile phone with gestures) and discuss the setting of distance based on the application scenario.
Finally, in our experiment, the gestures were performed in the LoS. Wi-Fi signals do not require LoS propagation. Therefore, we are interested in expanding WiGR to the LoS scenario. For example, we can separate the transmitter and receiver with a wall, and then study the impact on the Wi-Fi signal in this case.