1. Introduction
The COVID-19 virus has been spreading all over the world for more than one and a half years, which has led many people to be isolated at home, or quarantined in some specified spaces. Therefore, people’s physical activity is restricted. However, as reported by a previous study, physical inactivity causes more than 5 million deaths worldwide, which does great harm to the finances of public health systems [
1]. López-Bueno et al. investigated changes in physical activity (PA) levels during the first week of confinement in Spain where participants reduced their weekly PA levels by 20% [
2]. The study of Matos et al. shows that body weights increased, and the weekly energy expenditure and quality of life were reduced for Brazilians during the pandemic [
3]. Thus, people should maintain their levels of physical activity to stay healthy when they are isolated at home or quarantined. A human activity recognition (HAR) system can be applied to monitor the persons at risk of COVID-19 virus infection to manage their activity status. In addition, the HAR can also be used in the telecare and/or health management by observing the fitness of healthy people or patients infected with COVID-19 in daily life, such as time spent exercising and resting [
4,
5]. Therefore, the research on HAR has received much attention in recent years.
To perform HAR, various sensors are used to extract the human activity data [
6,
7]. Image-based and sensor-based methods are two commonly used data sensing methods [
8]. The image-based methods usually use visual sensing devices, such as video cameras and photo cameras [
9,
10], to monitor human activities. However, their major disadvantages include the invasion of privacy, large size, and the limitation of indoor installation (without mobility). On the other hand, in sensor-based method the user needs to wear various sensors, such as an accelerometer, gyroscope, and strain gauge, on their wrist or limbs [
11,
12]. Though it has advantage of conducting ubiquitous HAR, wearing sensors on the body for a long time makes the users uncomfortable. As reported by the National Development Commission, the number of smartphone users in Taiwan reached 29.25 million in June 2019, indicating 1.24 smartphones per person [
13]. Since smartphones have already penetrated into people’s daily life, many studies on HAR using the accelerometers and gyroscopes embedded in smartphones have been conducted.
In the work of [
14], the logistic model trees (LMT) machine learning method was employed to recognize human activity. The experimental results indicated that the accuracy of the LMT method reaches 94.0%. Bao et al. [
15] firstly segmented the action signals with 128 samples and 50% overlapping, then a geometric template matching algorithm was used to classify each space into a corresponding action. In the last stage, the Bayesian principle and voting rule were combined to fuse the results of a k-nearest neighbor classifier. Cruciani et al. [
16] performed HAR utilizing the gold standard human crafted features and one-dimension (1D) convolutional neural network (CNN) which achieved an F1-Score of 93.4%. Wu and Zhang [
17] employed an CNN model to execute HAR, attaining an accuracy of 95%. Wang et al. [
18] proposed an attention-based CNN for the HAR using the weakly labeled activity data. In this work, to save the manpower and computing resources during the process of strictly data labeling, a weakly supervised model based on recurrent attention learning (RAL) was presented [
19]. Taking the advantages of CNNs in learning complex activities and long short-term memory (LSTM) networks in capturing temporal information from time series data, He et al. [
20] suggested a combination of CNN and LSTM networks for the HAR. Recently, a deep neural network that combined convolutional layers with LSTM was also proposed [
21], which achieved an accuracy of 95.8%. Yu et al. [
22] proposed a multilayer parallel LSTM network for the HAR. Sikder et al. [
23] presented a two-channel CNN for the HAR, which employed the frequency and power features of the activity data. Intisar and Zhao [
24] proposed a selective modular neural network (SMNN) that was a stacking architecture, consisting of a routing module and expert module, to enhance the accuracy of HAR. However, this model spent much time for the model training. The previous studies have suggested that the signals extracted by the accelerometer and gyroscope corresponding to different activities should be considered as temporal features. Since the recurrent neural network (RNN) can effectively describe the time dependency between different samples and the memory function, many studies have applied the LSTM to perform HAR [
19,
20,
21,
22,
23].
The signals of accelerometer, gyroscope, and strain gauges obtained from human activities could be considered as time series data. In recent years, various ensemble deep learning models have been proposed to solve the problem of time series classification. Fawaz et al. [
25] have proposed an ensemble of CNN models, named InceptionTime, to deal with the issue of time series classification. Karim et al. [
26] have transformed the LSTM fully convolutional network (LSTM-FCN) and attention LSTM-FCN (ALSTM-FCN) into a multivariate time series classification model by augmenting the fully convolutional block with a squeeze-and-excitation block. Xiao et al. [
27] have proposed a robust temporal feature network (RTFN) which consists of a temporal feature network (TFN) and an attention LSTM network for feature extraction in the problem of time series classification.
Some stacking deep neural networks also have been used to improve the activity recognition rate. The stacking architecture integrates various neural networks to gain the advantages for specific tasks. Li et al. [
28] used the Dempster-Shafer (DS) evidence theory to build the ensemble DS-CNN model for the event sound recognition. Batchuluun et al. [
29] employed the CNN stacked with LSTM and deep CNN followed by score fusion to capture more spatial and temporal features for the gait-based human identification. Du et al. [
30] applied two-dimension (2D) CNN which stacked up a gated recurrent unit (GRU) to obtain the features of micro-Doppler spectrograms. The features with the time-steps were recognized by the GRU for HAR.
The ensemble learning algorithm (ELA) is a technique that combines the predictions of multiple classifiers to form a single classifier, which generally results in a higher accuracy than that of any of the individual classifiers [
31,
32]. Its theoretical and practical studies have demonstrated that a good ELA was the individual classifiers in the ELA which accuracies are close and errors are distributed on the different parts [
33,
34]. In general, the ELA consists of two parts: how to generate differentiated individual classifiers and how to fuse them. In the generation of individual classifiers, two kinds of generation strategies, namely, the heterogeneous type and the homogeneous type are commonly employed. The former is that individual classifiers are generated using various learning algorithms. The latter uses the same learning algorithm, so different settings are necessary. Thus, Deng et al. [
35] adopted linear and log-linear stacking methods to fuse convolutional, recurrent and the fully connected deep neural networks (DNNs). Xie et al. [
36] proposed three DNN-based ensemble methods, which fused a series of classifiers whose inputs are the representation of intermediate layers.
This study aims to recognize the human activities with the data extracted from sensors embedded in the smart phone. An ELA combining GRU, stacking CNN+GRU and DNN was proposed to perform the HAR. The sensor data are the input samples of GRU and stacking CNN + GRU. The 561 parameters obtained from those raw sensor data are utilized as the input samples of DNN. Then, the outputs corresponding to stacking CNN + GRU, GRU, and DNN were combined to classify the six activities using the fully connected DNNs. The HAR dataset employed in this work is an open source provided by the UCI, School of Information and Computer Sciences [
28]. This dataset collects six sets of activity data via the accelerometer and gyroscope built into two smartphones. An extra feature vector consisting of 561 parameters is generated from time-domain and frequency-domain based on the raw sensor data. The experimental results showed that the proposed ELA scheme outperforms the existing studies.
3. Results
In this study, the hardware employed was an Intel Core i7-8700 CPU and a GeForce GTX1080 GPU. The operating system was the Ubuntu 16.04LTS software, the development system was Anaconda 3 for Python 3.7 version, the deep learning tool was Pytorch 1.10, and the compiler was a Jupyter Notebook. A series of experiments is conducted to evaluate performance of the GRU model, stacking CNN + GRU model, and proposed ELA model.
Figure 4 shows the training and validation curves attained by the GRU model. The loss functions obtained in training (blue line) and validation (orange line) are exhibited in
Figure 4a, and the accuracies attained in training (blue line) and validation (orange line) are illustrated in
Figure 4b. The accuracy achieved the best when epoch equals 37.
Table 7 shows the performance of the GRU model for the six activities. As shown, the average precision, recall, F1-score, and accuracy are 92.7%, 92.6%, 92.5%, and 92.5%, respectively. While all average measures of the six activities are higher than 90%, the GRU model demonstrated an unsatisfied performance on recognizing the sitting and standing activities, because their F1-scores are less than 90%.
Figure 5 shows the training and validation curves obtained by the stacking CNN + GRU model. The resultant loss function in training (blue line) and validation (orange line) are presented in
Figure 5a, and the obtained accuracies in training (blue line) and validation (orange line) are shown in
Figure 5b. The accuracy reached the best when epoch is equal to 39.
Table 8 illustrates the performance of the stacking CNN + GRU model for the six activities. As shown, the average precision, recall, F1-score, and accuracy are 93.0%, 92.9%, 92.9%, and 92.7% respectively. Though all average measures for six activities are higher than 90%, the stacking CNN + GRU model exhibited an unsatisfied performance on recognizing the sitting and standing activities, because their F1-scores are less than 90%.
Figure 6 shows the training and validation curves of the proposed ELA model which only combines GRU and stacking CNN + GRU. The obtained loss functions in training (blue line) and validation (orange line) are exhibited in
Figure 6a, and the attained accuracies in training (blue line) and validation (orange line) are illustrated in
Figure 6b. The accuracy achieved the best when epoch equals 10.
Table 9 shows performance of the ELA model without the 561 parameters which fused the outputs of three branches for activity classification. The average precision, recall, F1-score, and accuracy are 93.5%, 93.6%, 93.5%, and 93.4%, respectively. However, the performances of ELA without 561 parameters for recognizing the sitting and standing activities do not have the significant raise.
Figure 7 shows the training and validation curves of the proposed ELA model. The obtained loss functions in training (blue line) and validation (orange line) are exhibited in
Figure 7a, and the attained accuracies in training (blue line) and validation (orange line) are illustrated in
Figure 7b. The accuracy achieved the best when epoch equals 18.
Table 10 shows performance of the ELA model which fused the outputs of three branches for activity classification. The average precision, recall, F1-score, and accuracy are 96.8%, 96.8%, 96.8%, and 96.7%, respectively. Notably, the F1-scores of six activities are all higher than 90%. In addition, the F1-scores obtained for recognizing the sitting and standing activities are 91.7% and 92.9%, respectively, which achieved the significant improvement as compared to the GRU and stacking CNN+GRU models.
In order to further verify the effectiveness of the models employed in this study, the WISDM and OPPORTUNITY datasets were also employed. Since the WISDM and OPPORTUNITY datasets do not support the 561 time-domain and frequency-domain parameters, the proposed ELA model was not applied in the experiment, and therefore only the GRU and stacking CNN + GRU were used for performance evaluation.
Figure 8 illustrates the F1-score of the GUR and stacking CNN + GRU models employing UCI-HAR, WISDM and OPPORTUNITY datasets. The F1-score of the GRU and stacking CNN + GRU models based on the WISDM and OPPTUNITY datasets are 83.8% and 86.2%, and 91.7% and 87.4%, respectively, which are lower than those with HCI-HAR dataset.
Table 11 presents the computation time required for testing each activity sample based on the GRU, stacking CNN + GRU, and ELA models. The result indicates that the ELA spent the longest time (1.681 ms), and the GRU model spent the shortest time (0.031 ms).
4. Discussion
In pattern recognition, the procedure of the traditional methods is that feature vectors are firstly extracted from the raw data. Then, a suitable model based on the feature vectors is employed for classification [
41]. In recent years, a great success in complicated fields of pattern recognition is the DNN with more than three layers, which combines feature extraction and classification into a signal learning structure and directly constructs a decision function [
42]. The major core of generation strategies is to make individual classifiers that depend on errors and diversity to enhance the performance of classification, such as the commonly used Simple Average and Weighted Average scheme [
43]. In addition, some other schemes combining multiple classifiers are suggested, such as Dempster-Shafer Combination Rules [
44], Stacking Method [
32], and Second-Level Trainable Combiners [
45]. Ensemble learning has been proved to be able to improve the generalization ability effectively in both theory and practice [
46]. In this study, we have proposed the ELA model to classify the six activities. The specific point of the samples employed for model training is that they are the combination of feature vector extracted from the raw senor data. The feature vector are the time-domain and frequency-domain parameters generated from the raw senor data. In
Table 9, the performance of the ELA model which only combining GRU and stacking CNN + GRU is better than the individual GRU and stacking CNN + GRU models. However, the results of recognizing sitting and standing activities do not exhibit a satisfied performance, because their F1-scores are less than 90%. These results are the same as the results of individual GRU and stacking CNN+GRU models illustrated in
Table 7 and
Table 8.
According to
Table 7 and
Table 8, the performance of the stacking CNN + GRU model is slightly better than that of the GRU model. The comparative results indicate that, the averages of precision, recall, F1-score, and accuracy are 93.0% vs. 92.7%, 92.9% vs. 92.6%, 92.9% vs. 92.5%, and 92.7% vs. 92.5%. The major problems of the individual GRU and stacking CNN + GRU models are that two activities, the sitting and standing, cannot be classified well enough. However, in
Table 10, the proposed ELA model shows a significantly improved performance. Especially, the F1-scores of sitting and standing activities are higher than 90%. The averages of precision, recall, F1-score, and accuracy achieved by the proposed ELA are 96.8%, 96.8%, 96.8%, and 96.7%, respectively. Thus, extracting the useful features as the input patterns could effectively improve the performance in the practice for the ELA model.
Table 12 shows the comparative result of our method with other studies using the UCI-HAR dataset. Notably, the previous studies usually only used sensor signals to perform activity recognition with deep learning methods [
17,
18,
19,
20,
21,
22,
23], or used 561 parameters with machine learning methods [
14,
15,
16]. As shown, the proposed ELA model attains performance of F1-score and accuracy of 96.8% and 96.7%, respectively, which is among the best.
We have analyzed the confusion matrices of the GRU model and stacking CNN+GRU model as shown in
Figure 9. The result exhibits that the misclassification of standing and sitting activities occurs frequently. Therefore, in this study, the 561 time-domain and frequency-domain parameters were applied to enhance the HAR performance.
Figure 10 illustrates the feature differences of mean and standard division (SD) between the training and testing data of standing and sitting activities introduced by 561 parameters. In
Figure 10a, the blue line is the mean differences of 561 parameters between the training set of standing activity (x4_mean) and the testing set of sitting activity (x5_mean), and the orange line is the mean differences of 561 parameters between the training set (x4_mean) and testing set of the standing activity (x6_mean). We can find that the values in the blue line are much higher than the values in the orange line. In
Figure 10b, the blue line is the SD differences of 561 parameters between the training set of standing activity (x4_SD) and the testing set of sitting activity (x5_SD), and orange line is the SD differences of 561 parameters between the training set (x4_SD) and testing set of standing activity (x6_SD). We can find that the values in the blue line are also much higher than the values in the orange line. The results indicate that introducing the 561 parameters broadens the feature difference between training data of the standing activity and the testing data of the sitting activity, while decreasing the feature difference between training and testing data of the standing activity. In
Table 9, the results are distinct from those mentioned above if the 561 parameters are not included in the ELA scheme.
The proposed ELA scheme fused with deep learning and machine learning methods. In order to compare with the previous studies, we did not study the generalization of ELA model with k-fold cross validation. In UCI-HAR dataset, the training and testing samples have been separated. All previous studies used the same training and testing samples to validate the performance of their proposed methods. Moreover, since a long time is required to implement a system for real time application, it is difficult to see how well the proposed model works in actual (real life) testing in the current stage. Moreover, when the smartphone is charging or not placed at the waist, the HAR would not be done. This is also the limitation of this approach. In the near future, we will design a wearable device that has the accelerometer and gyroscope. The parameters of the proposed ELA model are embedded in a Movidius neural compute stick, like the Intel® Neural Compute Stick 2, to verify the HAR performance in the real scenario.