1. Introduction
Human activity recognition (HAR) is a transformative field with diverse applications in healthcare, fitness, military, and robotics. Caregivers can monitor older adults’ activities to detect the need for assistance [
1], while physical therapists can provide real-time feedback to ensure patients perform exercises correctly [
2]. In fitness and training scenarios, HAR can track movements, count steps, and calculate calorie expenditure to support overall wellness [
2]. Additionally, HAR is employed in surveillance systems to detect threats and inform decision-making in critical infrastructure and combat situations [
3]. In general, HAR aims to analyze and predict human behavior through activity signals collected from a variety of sensors such as magnetometers, gyroscopes, accelerometers, camera and LiDAR [
4]. However, as smartphones have become ubiquitous, HAR models leveraging their inertial sensors (e.g., gyroscope, accelerometers) have gained traction, providing an unobtrusive solution for monitoring daily activities [
4]. The research in this manuscript is focused only on HAR from mobile sensors including gyroscope (i.e., a sensor used for measuring the orientation and the angular velocity) and accelerometer (i.e., an electronic sensor for measuring the acceleration forces acting on an object). However, the computational complexity of traditional HAR models presents challenges for deployment on portable devices with limited resources.
Human-engineered feature characteristics are the fundamental components upon which shallow machine-learning models were constructed [
5,
6,
7,
8,
9]. However, feature engineering is inherently time-consuming and subject to the influence of human biases and assumptions. To address this issue, researchers have explored the use of deep learning, which has revolutionized HAR with its automatic feature extraction capabilities [
4]. Several relevant studies have investigated the methodologies of one-dimensional (1D) and two-dimensional (2D) convolutional neural networks (CNNs), as well as recurrent neural networks (RNNs). A 1D-CNN for HAR that employs a divide-and-conquer-based classifier with two stages was proposed by [
10]. The first stage includes a binary classifier for recognizing abstract activities designated “dynamic” or “static”, while the second stage includes two multi-class 1D-CNN models for identifying individual activities for each of the binary classifications. The disadvantage is that each stage is dependent on the one before it, because if any mistakes are made at the start, the model will not be able to achieve correct action recognition. Dua et al. [
11] proposed a multi-input CNN-GRU model for HAR comprising a three-head architecture and uses three different convolutional filter sizes to capture various spatio-temporal dependencies. Ragab et al. [
12] introduced the random search 1D-CNN for HAR. Lee et al. [
13] developed a deep learning model for semantic segmentation in HAR focusing on transition activities using a Multi-Channel CNN and an attention layer. An attention layer further refines feature focus, improving model performance on transition activity recognition. Zhang et al. [
14] proposed a 1DCNN-ResBiLSTM-Attention model that combines 1D-CNN, residual bidirectional Long Short-term Memory (BiLSTM), and attention mechanism to improve the accuracy of recognizing similar activities by leveraging the distinctive leg movement patterns, achieving enhanced performance in activity recognition. Mehmood et al. [
15] drew inspiration from DenseNet [
16] and proposed an architecture that utilizes inertial sensors, with all previous feature maps available to all the layers ahead of it.
2D-CNN were employed for HAR [
17,
18,
19]. Researchers [
20] proposed a new HAR approach featuring separate spatial and temporal feature extraction phases. The model utilized preprocessing techniques, spatial and temporal blocks, and attention mechanisms, achieving high
-scores across multiple datasets. Xia et al. [
21] used a hybrid LSTM-CNN. Wang et al. [
22] proposed an attention-based HAR method for weakly labeled data. The model leverages spatial-temporal feature integration and attention mechanisms to focus on relevant activity data, improving performance on noisy, weakly labeled datasets compared to CNN and LSTM-based approaches.
1D-CNNs stand out in their ability to effectively analyze time-series data, characterized by a single sequence of values. Unlike 2D-CNNs, which process data in two dimensions, 1D CNNs have fewer parameters, rendering them more efficient [
23]. Compared to RNNs, commonly employed for time-series data analysis as well, 1D CNNs offer computational efficiency and simplicity in training. Moreover, they mitigate the vanishing gradient issue often encountered in LSTMs [
24,
25]. Lego filters [
26,
27], CondConv [
28], and the matched filter CNN classifier [
29] are lightweight deep learning approaches for that outperform conventional models. These methods utilize modular filter units, dynamic expert kernels, and signal processing techniques to enhance accuracy and computational efficiency on diverse datasets. CNNs can extract spatial and temporal features for human action recognition. Larger temporal filters capture long-term patterns, while smaller ones excel at short-term changes. Filter size selection is a critical hyperparameter balancing performance and computational complexity.
Graph Neural Networks (GCNs) known for modeling complex interactive activities are also reported for HAR [
30,
31,
32,
33,
34,
35]. Ghalan and Aggarwal [
30] proposed a novel ensemble model, Graph Engineered EnsemCNN HAR (GE-EnsemCNN-HAR), combining CNNs with GCNs for improved classification of complex activities.Yang et al. [
31] presented the Graph Domain Adaptation (GDA) network, a novel approach for sensor-based HAR that enhances model generalization, especially with limited data. By leveraging a graph neural network with adaptive learning and a local residual structure, the GDA network effectively captured non-Euclidean relationships in sensor signals. Ref. [
32] introduced MG-WHAR, a novel method for wearable human activity recognition (WHAR) that models relationships among multi-sensors using graph structures. By constructing three types of graphs—based on body structure, sensor modality, and data patterns—MG-WHAR leverages multi-graph convolutional networks to enhance feature interactions and improve model performance. Belal et al. [
33] explored HAR using sensory data and demonstrated the effectiveness of feature fusion for improving recognition accuracy. By employing a Parameter-Optimized Multi-Stage Graph Convolutional Network (PO-MS-GCN) and a Transformer. The study highlighted the limitations of existing models in capturing both spatial and temporal features. Duhme et al. [
34] introduced Fusion-GCN, a method for multimodal action recognition that integrated various sensor data modalities into a graph for training with a GCN. By incorporating sensor measurements through additional node attributes or new nodes, Fusion-GCN flexibly fused RGB sequences, inertial measurements, and skeleton sequences. Huang et al. [
35] introduced a deep framework for micro-gesture classification that utilizes ensemble models based on hypergraph-convolution transformers. The proposed approach enhanced the self-attention mechanism to better capture complex correlations within the skeleton data. Furthermore, the method employed data grouping and model ensemble techniques to address the challenges posed by imbalanced datasets.
Residual networks (ResNets) [
36], a CNN variant with skip connections, have shown strong performance in using IMU data [
20]. Their ease of training, ability to learn from smaller datasets, and adaptability to data changes are key strengths [
37]. However, gradient noise can slow convergence and limit performance in certain tasks. Gated recurrent units (GRUs) [
38] and LSTM [
25], RNN variants, are strong for sequential data but face gradient issues. Residual connections in ResNets or transformers that leverage past/future context can mitigate these challenges in HAR. BiLSTM [
39] networks are variants of LSTM, offering clear advantages. By processing the data in both forward and backward directions, they can capture both past and future context, which is essential for understanding and predicting actions. However, implementing BiLSTMs for HAR also requires more computational resources due to the need to process data in both directions [
40]. Hybrids of CNNs with both GRUs and LSTMs have been also utilized for HAR [
21,
41,
42]. These hybrid approaches leverage the strengths of CNNs in extracting spatial features and RNN’s sequential learning capabilities; to identify spatial and temporal patterns in the data effectively. LSTMs are exceptionally skilled at modeling long-term dependencies in sequential data, while GRUs are simpler versions of LSTMs that offer similar capabilities with fewer parameters [
43], and less memory [
42].
Zhang et al. [
44] and Hassan et al. [
9] both explored the use of deep belief networks (DBN), for HAR. While Zhang et al. [
44] recommended DBNs for real-time activity recognition, Hassan et al. [
9] found that DBNs outperformed shallow classification methods like ANN and SVM, achieving the highest recognition rate and accuracy.
Human actions usually involve long-term space-time interactions [
45]. The use of transformers in human action recognition is due to attention mechanisms that better suppress redundancy and better model long-range interactions. Luptáková et al. [
46] explored using transformer models for recognizing human activities through time-series data from wearable sensors. Employing the self-attention mechanism, the transformer model processed sequences effectively without recurrent structures. The study also employed data augmentation techniques to artificially expand the training dataset, thereby enhancing the model’s generalization capabilities. When comparing results on leaderboards, CNN models are still preferred. However, vision transformer-based models outperform CNN recognition in terms of accuracy, which is a crucial factor for human action recognition. LIMU-BERT [
42], inspired by the bidirectional encoder representations from transformers (BERT) [
47], can extract features from IMU data, but faces issues with transfer learning and handling rare instances, limiting its utility as a pre-trained model for HAR. The MobileHART model [
48], a combination of transformers and CNNs, aims to be lightweight for smartphone deployment. However, it has a larger parameter count compared to other lightweight HAR models explored in the literature.
Yet, the computationally intensive nature of deep learning models hinders their implementation on resource-constrained platforms. Recurrent Neural Networks like LSTMs excel at modeling temporal dependencies but struggle with spatial data, while Convolutional Neural Networks excel at spatial feature extraction but lack temporal awareness [
21,
41,
42,
49]. Bidirectional LSTMs address this by capturing long-term dependencies by processing inputs in both forward and backward directions [
40]. We investigate whether applying data augmentation techniques like time reversal [
50] can emulate the contextual learning of BiLSTMs, potentially leading to more efficient HAR models.
Table A1 provides a concise summary of the performance of various models from reviewed scholarly works along each dataset.
The 1DCNN-ResBiLSTM-Attention model [
14] is a deep learning architecture that combines three components: a 1D CNN, residual BiLSTM, and attention mechanisms. Its primary goal is to improve the accuracy of similar action classification tasks. However, the integration of the additive operation with the BiLSTM increases the number of parameters in the model, which raises the question of whether it is possible to replace the BiLSTM with other components to reduce the number of parameters and keep the model lightweight while maintaining its performance. Can we approximate the nature of BiLSTM with the data flipping augmentation technique to read the input from two directions? Specifically, the following are the anticipated contributions of the manuscript:
The paper investigates replacing the computationally intensive BiLSTM component in a HAR model with a combination of standard and residual LSTMs, as well as convolutional networks, to reduce the number of parameters and maintain model performance on resource-constrained devices.
The study explores using data flipping augmentation to replicate the bidirectional context awareness provided by BiLSTMs, aiming to achieve similar performance with lower computational demands.
The proposed modifications are evaluated on multiple datasets (i.e., UCI-HAR, WISDM, and KU-HAR) collected from mobile phone sensors (i.e., gyroscope, accelerometer) to demonstrate the effectiveness of the proposed methods.
3. Results
Table 5,
Table 6 and
Table 7 present the results obtained by the different configurations on the UCI-HAR, WISDM, and KU-HAR datasets, respectively. Each table includes detailed metrics for various model configurations, highlighting the accuracy (
), precision (
P), recall (
R),
-score (
), the number of model parameters (
), and the training time in seconds. The substituted component is denoted by (Sub. Comp.).
In our analysis of
Table 5,
Table 6 and
Table 7, we observed that Group A, which includes BiLSTM and ResBiLSTM architectures, not only achieves the highest performance metrics but also demonstrates a substantial increase in the number of parameters compared to other models. This suggests that the higher performance of Group A comes at the cost of increased model complexity and computational requirements. Specifically, the ResBiLSTM often slightly outperforms the BiLSTM, which could be attributed to the additional parameters that might help us in learning the more nuanced features of the data.
Group B, comprising LSTM and ResLSTM, strikes a balance between performance and the number of parameters. These models have significantly fewer parameters than those in Group A, potentially making them more efficient in terms of computational resources while still maintaining good performance. The ResLSTM, which slightly outperforms the standard LSTM in most cases, shows how minor adjustments and additional parameters in the LSTM architecture can enhance performance without dramatically increasing complexity.
Group C, which includes CNN and ResCNN, consistently shows the lowest performance. However, it also tends to have fewer parameters, especially in the ResCNN variants. This highlights a crucial aspect: CNNs are less capable of effectively capturing temporal dependencies.
3.1. Performance Analysis for Results Obtained on the UCI-HAR Dataset
As shown in
Figure 3a, the accuracy of Group A, which used BiLSTM base models without data augmentation, was compared to the results of Group B and Group C. Group B and Group C both utilized data augmentation techniques, with Group B coupling data augmentation with LSTM-based models and Group C pairing it with CNN models. The results for the ResLSTM model were similar to, or even higher than, the performance of the other groups. The ResLSTM model achieves the highest accuracy score among the evaluated models with an accuracy of 96.34%, as shown in
Table 5. Additionally, the ResLSTM model has parameters that are lower than those of the CNN model, as shown in
Figure 3b.
Compared with other models in the literature review, the 1DCNN-ResBiLSTM [
14] model reported an accuracy obtained training and validation of 98.37%, but its test accuracy decreased to 95.96% (Group A). CNN-LSTM models [
21,
41] reported 97.89% and 95.8% test accuracies, respectively, higher than our LSTM and ResLSTM models in Group B. The confusion matrices and learning curves for each model provide detailed classification results and training dynamics, with early stopping used to determine the optimal number of epochs.
3.1.1. Numerical Analysis of Group A Models
The confusion matrix for the BiLSTM model in
Figure 4a shows high accuracy in identifying Walking and Laying with no misclassifications. However, Upstairs and Downstairs have some misclassifications, particularly with Walking. ’Sitting’ is often confused with Standing and Laying, while Standing has instances misclassified as Sitting. The confusion matrix for the ResBiLSTM model (
Figure 4b) shows high accuracy in identifying Laying with no misclassifications. However, Walking, Upstairs and Downstairs have some misclassifications, particularly with Walking. Sitting and Standing are often confused with each other.
The learning curve of the BiLSTM model in Group A using the UCI-HAR dataset in
Figure A1 and
Figure A2 show both model loss and accuracy over 60 epochs. The training and validation loss curves converge rapidly, stabilizing near zero, reflecting efficient learning and minimal error. Similarly, the accuracy curves for both training and validation data plateau near 1.0, indicating high accuracy and good generalization to unseen data.
The learning curve of the ResBiLSTM model using the UCI-HAR dataset in
Figure A1 and
Figure A2 shows both model loss and accuracy over 30 epochs. The training and validation loss curves converge rapidly, similar to the rest of the models.
3.1.2. Numerical Analysis of Group B Models
The confusion matrix for the LSTM model in
Figure 5a shows high accuracy in identifying Laying with no misclassifications. However, sitting and standing are often confused. The confusion matrix for the ResLSTM model in
Figure 5b shows high accuracy in identifying Laying and Downstairs with no misclassifications. Sitting and Standing are often confused with each other.
In
Figure A1 and
Figure A2 the learning curve of the LSTM model using the UCI-HAR dataset shows both model loss and accuracy over 25 epochs. The training and validation loss curves converge rapidly similar to the last models.
In
Figure A1 and
Figure A2 the learning curve of the ResLSTM model using the UCI-HAR dataset shows both model loss and accuracy over 60 epochs. The training and validation loss curves converge rapidly like the last models.
3.1.3. Numerical Analysis of Group C Models
The confusion matrix for the CNN model in
Figure 6a shows high accuracy in identifying Laying with no misclassifications. Sitting and Standing are often confused with each other. The confusion matrix for the ResCNN model in
Figure 6b shows high accuracy in identifying Laying and Walking with no misclassifications. Sitting and Standing are often confused with each other.
In
Figure A1 and
Figure A2 the learning curve of the CNN model using the UCI-HAR dataset shows both model loss and accuracy over 40 epochs. The training and validation loss curves converge rapidly like the last models.
In
Figure A1 and
Figure A2 the learning curve of the ResCNN model using the UCI-HAR dataset shows both model loss and accuracy over 40 epochs. The training and validation loss curves converge rapidly like the last models.
3.2. Performance Analysis for Results Obtained on the WISDM Dataset
The WISDM dataset is imbalanced since the “Walking” takes almost 39% of the class distribution, as opposed to 4% for the “Standing” class, for example [
56]. The
-score provides a crucial metric in this case.
The graphs in
Figure 7 represent the
-scores, and through it, we can see that (Group A) models demonstrated robust performance with the ResBiLSTM achieving the highest
-score of 97.71% as also shown in
Table 6. This illustrates the effectiveness of bidirectional architectures in grasping the complex temporal dependencies within the activity data. The standard BiLSTM also performed commendably with an
-score of 97.23% (
Table 6), reinforcing the capability of LSTM-based architectures in context recognition.
In contrast, (Group B) leveraged data augmentation, with the ResLSTM model as in
Table 6 illustrating an
-score of 97.20%, marginally lower than (Group A)’s top performer but significantly effective, demonstrating that residual connections can enhance LSTM’s performance by deepening the feature extraction process without excessive parameter increase.
Table 6 shows that (Group C) explored simpler CNN and ResCNN models, achieving
-scores of 96.40% and 96.45% respectively. These results are competitive, especially when considering the computational efficiency of CNNs. The ResCNN’s slight edge over the standard CNN highlights the benefit of integrating residual learning to bolster feature learning capabilities.
Comparing these outcomes to other models reported in the literature, such as the GRU, INC, ResNets, CBAM, and attention mechanisms [
20] with an
-score of 99.12%. This ensemble approach using GRUs and attention mechanisms and combining the RNN, CNN, and Attention base components, as in our models in (Group B), showcases a 2% decrease in the
-score.
3.2.1. Numerical Analysis of Group A Models
The confusion matrix in
Figure 8a shows the performance of the BiLSTM model across different activities. The model demonstrates high accuracy in identifying JOGGING and WALKING, with 1087 and 1271 correct classifications, respectively. However, there are some notable misclassifications, such as UPSTAIRS being confused with DOWNSTAIRS and WALKING, and SITTING being occasionally misclassified as STANDING. In
Figure 8b, the confusion matrix of the ResBiLSTM model using the WISDM dataset shows several misclassifications, especially for UPSTAIRS, which is often confused with DOWNSTAIRS and WALKING. Despite high accuracy in identifying JOGGING and WALKING, there are notable errors, such as SITTING being misclassified as STANDING and vice versa. The unbalanced nature of the dataset highlights these misclassifications.
The learning curves of the BiLSTM model using the WISDM dataset in
Figure A3 and
Figure A4 show both model loss and accuracy over 20 epochs. The training and validation loss curves converge rapidly, stabilizing near 0.1, reflecting efficient learning and minimal error. Similarly, the accuracy curves for both training and validation data plateau near 0.975, indicating high accuracy and good generalization to unseen data.
The learning curve in
Figure A3 and
Figure A4 of the ResBiLSTM model using the WISDM dataset shows both model loss and accuracy over 70 epochs. The training and validation loss curves converge rapidly, stabilizing near 0.1, reflecting efficient learning and minimal error. Similarly, the accuracy curves for both training and validation data plateau near 0.98, indicating high accuracy and good generalization to unseen data.
3.2.2. Numerical Analysis of Group B Models
The confusion matrices of the LSTM and ResLSTM models on the WISDM dataset (
Figure 9) reveal misclassifications, particularly for UPSTAIRS, which is often confused with DOWNSTAIRS and WALKING. Despite high accuracy for JOGGING and WALKING, there are errors like SITTING being misclassified as STANDING, likely due to dataset imbalance.
The LSTM model’s learning curves on the WISDM dataset (
Figure A3 and
Figure A4) show rapid convergence of training and validation loss near 0.1, with both training and validation accuracy plateauing at around 0.97, indicating effective learning and good generalization to unseen data. A similar observation can be drawn from
Figure A3 and
Figure A4 for the ResLSTM model.
3.2.3. Numerical Analysis of Group C Models
The CNN and ResCNN models’ confusion matrices on the WISDM dataset (
Figure 10) show difficulties classifying UPSTAIRS, often confused with DOWNSTAIRS and WALKING. While accurate for JOGGING and WALKING, there are significant errors, such as SITTING being misclassified as STANDING, likely due to dataset imbalance.
The CNN and ResCNN models’ learning curves on the WISDM dataset in
Figure A3 and
Figure A4 show rapid convergence of training and validation loss near 0.1, with accuracy plateauing around 0.97, indicating effective learning and good generalization, despite some fluctuations in the ResCNN’s validation accuracy.
3.3. Performance Analysis for Results Obtained on the KU-HAR Dataset
The KU-HAR dataset, similar to the WISDM dataset, suffers from an imbalanced class distribution [
57]. Therefore, the
-score is utilized for evaluating model performance.
Group A models including BiLSTM and ResBiLSTM deliver exceptional performance with
-scores of 98.22% and 98.72%, respectively (
Figure 11). Group B which includes LSTM and its enhanced variant, ResLSTM, shows notable effectiveness with
-scores of 97.78% and 96.00%, respectively. Data augmentation was employed to potentially enhance model performance by augmenting the dataset with time-reversed sequences. Group C, comprising basic CNN and ResCNN models, demonstrates their commendable performance with
-scores of 88.12% and 94.10%, respectively. The noticeable improvement in the ResCNN model underscores the benefits of integrating additive operations between layers to enhance feature extraction capabilities. Data augmentation techniques were similarly utilized in this group to improve model robustness and performance.
Our model achieves an
-score of 98.72% on the KU-HAR dataset, slightly exceeding the previously reported 98.16% for an attention-based Residual BiLSTM model [
14]. This improvement indicates effective learning of complex temporal patterns with high accuracy and reasonable computational load. In contrast, a Transformer-based model [
46] attained a higher 99.2%
-score, but requires significantly more resources, making our model more practical for applications prioritizing efficiency over absolute peak performance.
3.3.1. Numerical Analysis of Group A Models
For the BiLSTM model, the confusion matrix (
Figure 12a) indicates generally high performance, with high true positive rates. However, there are some misclassifications between similar activity classes. Similarly, the ResBiLSTM model also demonstrates reasonable performance, with the confusion matrix (
Figure 12a) again exhibiting high true positive rates for most activity classes, but facing some challenges in distinguishing closely related activities.
The model loss plot in
Figure A5 exhibits significant fluctuations in the early epochs, with the training loss spiking dramatically before gradually stabilizing. The validation loss also shows substantial oscillations, particularly in the initial stages. Looking at the model accuracy plot in
Figure A6, the training and validation accuracy curves start around 0.8 and steadily improve, eventually reaching their peak performance of around 0.98 after approximately 60 epochs.
The ResBiLSTM loss plot (
Figure A5) exhibits significant fluctuations in the early epochs, with the training loss spiking dramatically before gradually stabilizing. The validation loss also shows substantial oscillations, particularly in the initial stages. Looking at the model accuracy plot (
Figure A6), the training and validation accuracy curves start around 0.87 and gradually improve, eventually reaching their peak performance of around 0.99 after approximately 100 epochs.
3.3.2. Numerical Analysis of Group B Models
For the LSTM model, the confusion matrix (
Figure 13a) generally shows strong true positive rates along the diagonal. However, there are some misclassifications between similar activity classes such as “Sit-up” and “Walk-backward”. For the ResLSTM model, the confusion matrix (
Figure 13b) demonstrates the model’s strong classification abilities, with high true positive rates along the diagonal, indicating accurate recognition of most activity classes. However, there are some classifications between activities.
The learning curves for the LSTM and ResLSTM (
Figure A5 and
Figure A6) models exhibit significant early fluctuations, with training loss spiking before stabilizing, and validation loss showing substantial oscillations. However, the accuracy curves steadily improve, reaching peak performance around 0.98 and 0.97 after 80 and 40 epochs respectively. This pattern indicates the models effectively learned complex patterns, despite the initial instability in loss.
3.3.3. Numerical Analysis of Group C Models
The confusion matrix for the CNN model (
Figure 14a) and the ResCNN model (
Figure 14a) demonstrates strong performance on the KU-HAR dataset. The matrix shows high true positive rates along the diagonal, indicating the model’s ability to accurately classify most activity classes. However, there are some noticeable misclassifications, especially between similar activities such as “Sit-up” and “Walk-backward”.
The learning curves for both models (
Figure A5 and
Figure A6) show significant early fluctuations, with training loss spiking before stabilizing and validation loss exhibiting substantial oscillations. However, the accuracy curves steadily improve, reaching peak performance around 0.9 and 0.95 after 40 epochs. This pattern indicates effective learning of complex patterns despite the initial instability in loss, with the second model achieving higher peak accuracy.
3.4. Statistical Analysis
We performed the Wilcoxon test using DATAtab [
58], based on the 5-fold cross-validation (5CV) results. During each fold of the 5CV, with 75 epochs and early stopping applied, we obtained the validation accuracy for each model and then the test accuracy across all the datasets. The results were combined for each group: Group A includes BiLSTM and ResBiLSTM, Group B includes LSTM and ResLSTM, and Group C includes CNN and ResCNN. We used a significance level of 0.05 for our tests.
The Wilcoxon test results show a p-value of 0.215 when comparing Group A, which includes BiLSTM and ResBiLSTM, and Group B, which includes LSTM and ResLSTM. Since the p-value is greater than the typical alpha level of 0.05, we fail to reject the null hypothesis, indicating no statistically significant difference between the performance of Group A and Group B.
Additionally, when comparing Group A, which includes BiLSTM and ResBiLSTM, and Group C, which includes CNN and ResCNN, the Wilcoxon test results show a p-value of 0.005. This p-value is less than the significance level of 0.05, indicating a statistically significant difference between the performance of Group A and Group C.
Our results show that substituting BiLSTM with LSTM and data flipping (Group B) maintains performance, with no statistically significant difference from the original BiLSTM model (Group A). This makes the model lighter while preserving BiLSTM’s capabilities. However, replacing BiLSTM with CNN and data flipping (Group C) results in a significantly lower accuracy, despite also reducing model size. This suggests the LSTM-based approach can effectively approximate BiLSTM’s behavior, while the CNN-based approach lacks the same level of effectiveness, despite both techniques reducing model complexity. The findings address our research questions by demonstrating a viable path to optimize model architecture without sacrificing performance.
3.5. Input Sensor Impact
We conducted an analysis to determine the impact of different sensor types, specifically accelerometers and gyroscopes, on the performance of our proposed ResLSTM model. The UCI-HAR dataset was selected for this experiment as it provides comprehensive data from both accelerometers and gyroscopes, enabling a thorough evaluation of their individual and combined contributions to the model’s performance. The WISDM dataset was not utilized due to its limitation to only accelerometer data, which would not allow us to assess the impact of gyroscope data. Similarly, the KU dataset was excluded due to its complexity, and to keep inline with our previous testing on the UCI dataset. The ResLSTM model was trained using the same hyperparameters, which were for the dropout rates, 0.001 for the learning rate, and 32 for the batch size.
The results in
Table 8 indicate that the combined use of gyroscope and accelerometer data significantly improves the performance of the ResLSTM model across all evaluated metrics. The model achieved an accuracy of 96.34%, a precision of 96.35%, a recall of 96.33%, and an
-score of 96.32% when both sensor types were used. This represents a substantial improvement compared to using either sensor type alone. The model’s performance with only accelerometer data was notably better than with only gyroscope data, achieving an accuracy of 82.42% compared to 79.98%, suggesting that accelerometer data may provide more relevant features for the activity recognition task.
The inclusion of both sensor types not only enhances the model’s predictive capabilities but also increases the number of parameters, highlighting a trade-off between model complexity and performance. Despite the increased parameter count, the significant improvement in performance justifies the use of both sensors for this application, this can also suggest that the complementary nature of the data provided by these two types of sensors is crucial for capturing the nuances of human motion, thereby enhancing the model’s ability to accurately classify activities.
3.6. Window Size Impact
We also performed an analysis to understand how different window sizes impact the performance of the ResLSTM model. This experiment was conducted using the UCI-HAR dataset, which provides comprehensive data from both accelerometers and gyroscopes. The window sizes tested were 512, 256, 128 (previously used), 64, and 32. The same model architecture and hyperparameters were used for all experiments to ensure a fair comparison, which were
for the dropout rates, 0.001 for the learning rate, and 32 for the batch size. See
Table 9.
The results indicate that the window size of 128, which was previously used, yielded the highest performance across all evaluated metrics, with an accuracy of 96.34%. The next best window size was 256, with an accuracy of 94.13%. Smaller window sizes, such as 64 and 32, resulted in lower performance, with accuracies of 93.79% and 92.87%, respectively. The largest window size of 512 also showed a slightly lower performance than 256 and 128, with an accuracy of 93.52%.
These findings suggest that a moderate window size of 128 strikes the best balance between capturing sufficient temporal dependencies and maintaining computational efficiency. Larger window sizes, while potentially capturing more context, may introduce noise and increase computational complexity, which does not necessarily translate to better performance. Smaller window sizes, on the other hand, may not capture enough temporal context, leading to reduced model performance.
3.7. Comparison with State of the Art Transformer Model
To provide a comprehensive comparison of models within our research, we included an evaluation of the Transformer model alongside our proposed ResLSTM architecture. The goal was to determine how well the Transformer would perform in the context of human activity recognition using the UCI-HAR dataset. For the implementation, Transformer architecture from Keras library [
59] was used with a learning rate of 0.001, batch size of 32, activation function of ReLU, and dropout rate of 0.1.
The results (see
Table 10) indicate that the Transformer model achieved an accuracy of 91.18%, which did not surpass the performance of our ResLSTM model, which reached an accuracy of 96.34%. This outcome is significant, as it suggests that, for the specific task of human activity recognition with the UCI-HAR dataset, the ResLSTM model provides a superior performance. In terms of precision, recall, and
-score, the Transformer model also demonstrated strong but slightly lower performance metrics compared to the ResLSTM model. More specifically, the Transformer achieved a precision of 91.20%, recall of 91.17%, and an
-score of 91.15%, compared to the 96.35%, 96.33%, and 96.32% observed with the ResLSTM model, respectively.
Moreover, it is essential to also highlight the difference in model complexity, particularly the parameter count. The Transformer model, with 7,112,454 parameters, is significantly more complex than the ResLSTM model, which has 576,702 parameters. Despite this increased complexity, the Transformer did not outperform the ResLSTM model, indicating that higher parameter counts and model complexity do not necessarily lead to better performance.
4. Discussion
This research aimed to explore whether the ResBiLSTM model in HAR could be replaced by lighter components while maintaining high accuracy, and if the bidirectional nature of BiLSTM could be approximated through data augmentation techniques.
The findings show that the ResLSTM model achieved the highest accuracy of 96.34% in the UCI-HAR dataset, demonstrating its ability to effectively capture complex temporal sequences. This performance is notable, especially when compared to other high-performing models in the literature. In the WISDM dataset, the ResBiLSTM model achieved a higher -score of 97.71% compared to the ResLSTM model’s 97.20%, but the ResLSTM model required fewer parameters, indicating a substantial reduction in computational complexity. The minor difference in -score is offset by the reduction in parameters, suggesting the ResLSTM with flipped data could be an alternative. For the KU-HAR dataset, the ResBiLSTM model significantly outperformed the ResLSTM model, with an -score of 98.72% compared to 96.00%. However, the LSTM model was closer to the ResBiLSTM -score at 97.78% with the benefit of lower parameters. This indicates that for datasets with imbalanced class distributions, the bidirectional nature of BiLSTM provides a distinct advantage.
Across all datasets,
Figure 15 shows that the ResBiLSTM model consistently outperformed the ResCNN model in terms of accuracy and
-score, highlighting the superior temporal processing capability of BiLSTM over CNN, even with flipped data augmentation. The ResCNN model, while showing lower performance, had a significantly lower parameter count, suggesting it is more computationally efficient.
The results indicate that ResLSTM can achieve comparable accuracy with fewer parameters, making it a potential alternative, especially in datasets like UCI-HAR. However, in imbalanced datasets such as KU-HAR and WISDM, ResBiLSTM demonstrated superior performance. The replacement strategies achieved a reduction in model parameters, but the ability to effectively preserve the bidirectional input processing characteristic of BiLSTMs varied across datasets.
The misclassifications between the “Sitting” and “Standing” classes were observed and are primarily due to the similarity in sensor data. The accelerometer and gyroscope readings for these activities often overlap, especially when the person is relatively still in both positions. This similarity makes it challenging for models to distinguish between the two states. Additionally, transitional movements, such as sitting down or standing up, can cause temporary confusion, further complicating accurate classification. Sensor placement and sensitivity also play a crucial role; if the sensors are not positioned optimally or lack the necessary sensitivity, they might fail to capture the subtle differences in posture.
The overfitting was observed in the training of models on the KU-HAR dataset and can be attributed to several factors. A smaller or less diverse dataset, such as KU-HAR, can lead to the model memorizing training samples rather than learning generalizable features, resulting in high training accuracy but poor validation performance. Complex models like ResBiLSTM and ResCNN, with numerous parameters, are particularly prone to overfitting when the dataset does not provide sufficient examples for effective training. The rapid convergence followed by a divergence between training and validation loss in learning curves is indicative of this issue. Furthermore, a lack of robust regularization techniques, such as dropout, L2 regularization, or data augmentation, may have exacerbated the overfitting problem by allowing the model to learn noise in the training data.
This study has several limitations. Its heavy reliance on specific datasets (UCI-HAR, WISDM, KU-HAR) limits the generalizability of the results. Performance might vary significantly with different datasets, potentially limiting the models’ applicability to real-world scenarios. Variability in sensor types, placement, and sampling rates, which the study does not address, could also impact model accuracy and robustness. Moreover, the computational complexity of advanced models like ResBiLSTM and ResCNN might hinder their deployment in real-time or resource-constrained environments, such as wearable devices. The study also focuses on a predefined set of activities, whereas real-world applications often involve a broader and more complex range of activities not covered in this research.