1. Introduction
Wearable exoskeleton robots have recently emerged as a viable solution to assist human physical movements in various fields, such as muscular rehabilitation [
1], daily activity assistance [
2], and movement supports in manufacturing tasks [
3,
4]. Recognizing human activities would be helpful to better assist human actions via wearable robots by enhancing and customizing the robot control per activity [
5]. Since exoskeletons are generally integrated by multiple IMU sensors and encoders, it is possible to implement a human activity recognition (HAR) system based on these sensors with high accuracy and low latency response [
6].
HAR approaches have been widely reported with RGB-D cameras and IMU sensors via supervised machine-learning techniques. Typically, RGB-D video-based methods have been commonly applied in pose estimation [
7] and activity recognition [
8,
9]. These computer vision-based approaches generally require multiple fields of view but cannot directly measure body movements. However, sensor-based HAR approaches can overcome the limitations of vision systems using lightweight compact and body-attached wearable sensors, measuring the body movements directly. The sensor-based HAR typically works by utilizing body-mounted sensors, smartwatches, and wristbands [
10,
11,
12,
13,
14], collecting movement data directly from specific areas or positions. With these kinds of sensors, traditional machine-learning methods have been applied in HAR systems [
9,
15]. Subsequently, deep learning networks have become the leading solution [
9] for overcoming the limitations of conventional machine learning approaches [
9,
10]. The deep learning-based HAR studies were mainly carried out with multiple motion sensors deployed on the chest, waist, and wrist, capturing human body activities and hand gestures [
11], and achieving an overall HAR accuracy higher than 95% in [
15,
16,
17,
18,
19]. However, most of these studies have been conducted using offline setups, such as PCs or laptops, thus limiting practical HAR in real-time application environments. Only a few recent studies have proposed real-time systems based on edge computing devices [
20,
21,
22,
23]. These mentioned approaches were tested using a Raspberry Pi 3 board with embedded lightweight machine classification models such as k-nearest neighbors (KNN), convolutional neural networks (CNN), and recurrent neural networks (RNN). Despite the significant use of deep learning models for activity recognition, few studies still use traditional machine-learning methods for real-time HAR environments in edge devices [
24]. More than three IMU sensors were used to recognize the lower and upper limb activities such as walk, run, jog, and open doors. These studies reported a HAR accuracy of 96.28% with an overall latency of 1.325 s per recognition and an inference time of 115.18 ms in real-time tests. It should be noted that most of these studies relied on the activity signals directly measured from the body-attached sensors. Since the accuracy of HAR is highly affected by the sensor positions and deployments, it is necessary to investigate the feasibility of HAR based on the robot-mounted motion sensors.
To achieve HAR of the robot wearer’s activities, in [
4], a soft wearable robot was used for HAR of industrial assembling tasks in a controlled environment. A total of 12 IMU sensors were integrated into the wearable robot. This study presented a hybrid deep learning model composed of CNN and RNN layers, achieving only 77.5% accuracy due to the complexity of the assembling tasks with an offline setup. This study utilized embedded KNN and support vector machine (SVM) models into a Raspberry Pi 3 board for real-time HAR, yielding an inference time of 5.7 ms with an overall accuracy of 98.75%. However, only four simple tasks, walk, run, stairs ascend, and stairs descend, were recognized using a thigh-mounted IMU and one force-sensitive resistor on the same ankle leg.
In this work, we have implemented a real-time HAR system with an actual wearable exoskeleton robot using integrated motion sensors, an edge device, and embedded light deep learning models. We have aimed to achieve a reasonable inference and latency time for real-time HAR. Therefore, we first tested and evaluated five deep learning models for HAR on a PC. Then, among the PC-trained models, the best ones in terms of accuracy with an inference time under 10 ms were optimized and embedded into the selected edge device. Finally, we tested and validated the performance of the optimized models in the edge device during a continuous real-time test. The main contributions of our work are as follows: We have tested the feasibility of real-time HAR with a wearable robot with integrated motion sensors and embedded light deep learning models in an edge device. Secondly, the successful results of this work demonstrate a standalone HAR system that could be used to assist human motions and tasks using wearable robots. Finally, the presented HAR approach reduces the total latency response of the prior attempts, while maintaining a recognition accuracy higher than 97% in real-time.
The subsequent sections are organized as follows:
Section 2 provides a detailed description of the design of the HAR system.
Section 3 describes the results achieved on the PC and an edge device.
Section 4 discusses the results against prior works. Finally,
Section 5 describes the main drawback, possible future works, and the conclusion of our approach.
2. Materials and Methods
The components of the implemented real-time HAR system with a wearable robot are illustrated in
Figure 1. From the left,
Figure 1a shows the wearable exoskeleton robot and its embedded sensors used for data collection and HAR.
Figure 1b shows samples of time series data collected from the integrated IMU in the robot backpack. The implemented deep learning models for HAR and the computing devices used are listed in
Figure 1c,d, respectively. Finally, the HAR results are illustrated in
Figure 1e. The following sections describe each one of these components in more detail.
2.1. Wearable Exoskeleton Robot and Sensors
The WEX platform is a waist-assist wearable robot developed by Hyundai Rotem. It is designed to reduce the load on the spine, prevent musculoskeletal diseases, and assist in walking or lifting heavy objects. These actions are made possible by operating the integrated motors in the same direction as the human actions to enhance muscle strength.
The wearable robot structure is carried on the shoulders and fastened at the chest, waist, and thighs by belts. The robot weighs about 6 kg, including actuators, controller units, sensors, and batteries. The assist torque is generated by a set of two 170 BLDC motors on the hip joint. Each motor has one degree of freedom (DOF) near the hip and two passive DoFs in the thigh frame symmetrically. In addition, the robot has two main kinds of sensor elements. First, two rotary encoders are inside the actuator modules in charge of measuring the angle of the hip joint. Second, one nine-axis IMU sensor, composed of a triaxial accelerometer, a triaxial gyroscope, and a triaxial magnetometer, is integrated into the robot backpack located in the back lower section of the platform. Using all the elements of the WEX platform, the robot system can have a continuous operation time of approximately 2 h.
2.2. Activity Data Collection
Two kinds of datasets were collected to train, test, and validate the HAR system. For both datasets, we considered the following eight activities: stand, walk, bend, crouch, stand-up, sit-down, ascend and descend stairs. The activity signals were recorded from one IMU and two rotary encoder sensors integrated into the wearable exoskeleton robot.
The datasets were collected according to two protocols. In the first protocol, the same activity was repeated multiple times, and the activity signals during each iteration were recorded; this record is named the epoch dataset. The epoch dataset was used to train, validate, and optimize the deep learning models on the PC and in the edge device. Meanwhile, the second protocol is illustrated in
Figure 2, where the eight proposed activities were performed in a specific order to obtain a continuous activity record, naming it the continuous dataset. This continuous dataset was used to test the feasibility of HAR on the edge device with multiple actions in sequence. The data collection processes for both protocols are described in more detail in the following subsections.
2.2.1. Epoch Dataset
In the epoch dataset, the signals were collected from repetitive movements per activity from four male subjects, aged between 25 and 30 years old and with heights between 1.60 and 1.80 m. Each subject performed a set of 10 repetitions for a total of 15 trials (i.e., 150 movements per activity). Once the datasets were collected for each trial, the signals were separated into epochs of three seconds with a sampling rate of 50 Hz. The number of epochs per activity was divided into stand (402), walk (818), bend (1659), crouch (1388), stand-up (706), sit-down (1701), stairs ascend (816), and stairs descend (714), totaling 8222 raw epochs for all activities.
2.2.2. Continuous Activity Dataset
Each subject performed continuous activities twice in the continuous dataset according to the second protocol. During the recording procedure, the data labeling was assigned manually using physical buttons attached to the exoskeleton robot to mark the activity label on each timestep. A total of 332 epochs for 8.3 min were contained in each dataset per subject. The continuous protocol for both subjects was carried out indoors, including floors, corridors, and stairs. These continuous datasets were used to validate real-time continuous HAR with an edge device.
2.3. Data Preprocessing and Augmentation
For the epoch and continuous dataset, a set of preprocessing steps for sensor-based HAR were applied to clean and prepare the data for training and testing the models [
9]. First, the drop-out data technique was used to clean up the incorrect data due to hardware disconnection errors during the data recording process. The same drop-out was performed for the outliers. The mean value was removed for each epoch, followed by a global normalization using the maximum and minimum values of the records to preserve the magnitude information of each activity. Consequently, a moving average filter of 5 points was applied to all the epochs for signal smoothing and de-noising of the recorded signal. This technique is selected due to its low complexity and fast execution. Then to augment the epochs, we used a sliding window overlap technique [
25] to balance the epoch datasets of the eight activities. Finally, data segmentation was performed, dividing the epochs into training and validation datasets using an 80/20 ratio for five-fold tests.
2.4. Deep Learning Models for HAR
In this work, we have adopted and implemented five deep learning models for HAR: CNN, RNN, LSTM, Bi-LSTM, and GRU. These five models have shown their merits and advantages over previous sensor-based HAR works [
25,
26,
27,
28,
29].
Figure 3 shows the architecture of these models. The characteristics and implementation details of the five models are given in the following subsections.
2.4.1. Convolutional Neural Network
The CNN model is a neural network capable of extracting local dependencies by enforcing a sparse local connection from the input data. This model extracts features with a sliding kernel on each layer through data timesteps values. This model captures the data of unique patterns or features for each activity signal. For our application of HAR, a one-dimensional variant was selected, since this model could extract features at a low computational cost.
Our implemented CNN model, named CNN-3L, is shown in
Figure 3a. This model comprises one input layer with a length of 150 timesteps, followed by three CNN layers of 32 units with a kernel size of three and a rectifier linear unit (ReLU) as an activation function per layer. After each block, a max pooling layer with a pool size of two is applied to reduce the number of trainable parameters and control overfitting. Finally, a dense layer with 272 hidden units is added in conjunction with a SoftMax layer with eight output neurons.
2.4.2. Vanilla Recurrent Neural Network
The RNN model is a basic framework applied in natural language processing (NLP) or speech recognition problems due to its capability of extracting the features and patterns of sequential activity signals. Unlike feed-forward neural networks, the RNN model processes the data in a recurrent form using the hidden states, commonly referred to as memory components, on each node to retain sequential information from the past input data. This model presents a lower computational cost during training by sharing the weight values across the data timesteps. Regarding the improvement, against CNN models in time-series data, this model can handle an arbitrary input/output length, making it feasible for prediction applications based on prior data information.
Our implemented RNN model, named RNN-2L, is presented in
Figure 3b. It is composed of a total of two RNN layers with 32 units and ReLU activation functions. Then, it is followed by a dense layer with 88 hidden units and a SoftMax layer with eight output neurons.
2.4.3. Long-Short-Term Memory
The long short-term memory (LSTM) model is an enhanced version of RNN. It can overcome the vanishing gradient problem since it can retain feature information for a longer time. The model uses a mechanism comprising three gates, namely forget, input, and output gates. These structures allow the model to choose which information is stored and which gets forgotten, saving the long-term dependence in the context state. This process starts with the forgotten gate using the hidden state of the last state and the current input value to decide which relevant information is kept for the current LSTM cell. Then the input gate determines which new data can be added from the current time step. The new context state is updated with the result of these two gates. Finally, the output value is obtained between the initial context state and the current input, to create a new hidden and context state to use in the next LSTM model.
Our LSTM model, named LSTM-2L, is presented in
Figure 3c, in which a total of two LSTM layers with 128 and 64 units were implemented using a ReLU activation function. Then, a fully connected dense layer is used with 704 units and one SoftMax layer, with eight output neurons.
2.4.4. Bidirectional Long-Short-Term Memory
The bidirectional long short-term memory (Bi-LSTM) model allows the use of an input flow in two directions, backward and forward, unlike the baseline LSTM model, which only admits one single direction. This model can extract features relevant to the future and past time steps.
Our Bi-LSTM model, named Bi-LSTM-2L, is presented in
Figure 3d. The model is composed of two Bi-LSTM layers with 64 and 32 units, and a ReLU activation function. Then, it is followed by a dense layer with 352 neurons and one SoftMax layer with eight output neurons.
2.4.5. Gate Recurrent Unit
The gate recurrent unit (GRU) is a compact neuronal network version of LSTM that removes the context state. The GRU model only uses the hidden state to pass the prior relevant information. This model is used to retain the memory capability in a compact form, reducing the number of tensor operations and making the model faster to train.
Our implemented GRU model, named GRU-2L, is shown in
Figure 3e, where a total of two GRU layers with 128 units and a ReLU activation function are used. The model is followed by a fully connected dense layer with 704 neurons and one SoftMax with eight output neurons.
2.5. HAR Training and Evaluation on PC
The training and validating processes were carried out using the epoch datasets of the four subjects on a PC. For this process, a total of 892,839 training epochs and 224,209 validating epochs were used. For the training process, a PC with an Nvidia RTX 2070 GPU of 8 Gb memory was used with a learning rate of 0.0003 and a batch size of 64 for each model. All the models were written using Python 3.8 with TensorFlow and Keras. To evaluate the performance of the implemented five deep learning models, two conventional criterion metrics were used: accuracy and inference time. To calculate the accuracy, Equation (1) was used, where
and
represents the sample number of true positives, false negatives, and true negatives, respectively. On the other hand, the inference time
represents the time needed for the model to output a classification label. Meanwhile, the inference time is given as Equation (2), where
is the time value when the data is input to the model.
is the time value when the result classification label is obtained.
2.6. HAR Training and Evaluation on Edge Device
After the HAR classifiers were trained and validated on the PC, the best models in terms of inference time and accuracy were transferred into our edge computing device. For the edge computing of HAR, we selected Nvidia Jetson Nano, among other edge devices, due to its capacity for training, optimizing, and testing the implemented models on the edge device. This single-board computer is capable of these tasks via the integrated quad-core ARM Cortex-A57 CPU, a dedicated Nvidia graphics card 128-core Maxwell, with a 4 Gb of RAM, and an ARM operating system based on Ubuntu 18.04. Furthermore, due to the compatibility with multiple deep learning libraries such as TensorFlow, Python, and TensorRT, the models implemented were optimized, decreasing the computational cost and inference time for each classification. After testing the five models on PC, the best models were selected and optimized, based on an accuracy higher than 95% and an inference time under 10 ms. This model optimization was carried out using the trained models in the TF-TensorRT engine. This framework reduces the precision range used on each layer by decreasing the number of decimal digits used on each mathematical operation of FP32 to FP16. To validate the performance of the HAR models in the edge device, we compared the performance of the selected models from the PC. Finally, we tested the real-time results of eight activities with the optimized model.
4. Discussion
In this paper, we have performed real-time HAR of the exoskeleton wearer’s activities, using the integrated sensors of the wearable robot. The proposed HAR system has been implemented, tested, and validated using the proposed deep learning models on the edge device. First, we trained and tested the deep learning models on a PC where the Bi-LSTM-2L model achieves the highest accuracy of 99.79%, with the epoch dataset among the five models proposed without considering the inference time needed. Meanwhile, for the classifier models, the CNN-3L model was selected, optimized, and embedded in the Jetson Nano as the best model, achieving an average accuracy of 97.35% with an inference time of 4.97 ms and obtaining a general latency of 0.506 s in the real-time tests.
Recent HAR studies based on edge devices [
20,
21,
22,
23], have used a Raspberry Pi 3 board to implement SVM, custom CNNs, and GRU models in real-time tests. In all these cases, a minimum of three IMU sensors were necessarily placed on different body parts, such as the neck, wrists, waist, or ankles. Due to the multiple feature channels, the processing time for collection and inference was prolonged, reaching a recognition accuracy above 96.28% with an inference time of 115.18 ms and an overall latency time higher than 1.32 s. In contrast to these approaches, we have used a Jetson Nano board as an edge device to embed, train, and optimize the deep learning HAR classifiers on it. In our approach, only one IMU sensor and two rotary encoders were used to achieve the high recognition accuracy of the eight activities. Our work achieved a latency time of 0.506 s, the shortest compared to the previous studies’ time. These results demonstrate that real-time HAR could be performed for the wearable robot using a standalone system. Among the mentioned HAR works based on edge devices, the best approach was recently addressed in [
24]. In this study, a HAR approach was reported based on a Raspberry Pi 3 board as an application for a wearable robot or leg prosthesis. The traditional machine learning models KNN and SVM used in this attempt achieve an overall accuracy of 99.41% and a latency window of 0.566 s using a single 9-axis IMU and one resistor force sensor. Although the minimal difference in latency was 60 ms between this approach and our proposal, only four simple locomotion tasks, walk, stand, stairs ascend, and descend, were recognized using traditional machine learning approaches. In addition, this work used the body-attached sensors instead of the robot-mounted sensors.
Despite the previously mentioned HAR studies based on edge devices, few approaches link this structure with wearable exoskeleton robots. An instance of this lack is presented in [
4], where an actual exoskeleton robot is used for offline HAR without tests on real-time environments with edge devices. In this study, an accuracy of 77.5% was achieved recognizing complex assembling tasks. For this, a hybrid CNN-RNN model was used with twelve six-axis IMU sensors distributed in different locations, such as the head, forearms, thighs, wrists, and ankles. Contrary to this approach, our work aims to provide a HAR system, based on exoskeletons and edge devices, to create a standalone system capable of being used in real-time tests with higher accuracy and a lower latency response.
5. Conclusions
The main practical applications of this study are related to the use of an exoskeleton robot to assist human motions. For instance, by recognizing current human activities, the wearable robot could reduce the workload of carrying or lifting heavy objects, and prevent musculoskeletal diseases by improving the user’s muscle strength in the rehabilitation process. The present HAR system has one drawback, which is the misrecognition of the transition activities. This problem could be solved by modeling and training these activities in deep learning models. Furthermore, it is possible to deploy more sensors in the wearable robot for more complex or intricate tasks and extending our models to recognize them.
To summarize, we have validated and confirmed the feasibility of real-time HAR with the wearable exoskeleton robot and HAR system. The presented results demonstrate that it is possible to achieve real-time HAR of the robot wearer’s eight activities. In real-time tests, we have achieved an overall accuracy of 97.35%, with an inference time under 10 ms using the Jetson Nano board as an edge device with deep learning classifiers based on integrated sensors.