1. Introduction
Wearable exoskeleton robots have been developed to aid individuals in a range of activities, including carrying heavy objects, alleviating the burden of physically demanding tasks, and assisting in-patient rehabilitation. Studies have indicated that exoskeletons can substantially assist and lower metabolic costs during walking [
1,
2]. Numerous powered exoskeleton robots have facilitated the improvement of lower extremity movement deficits resulting from strokes [
3,
4,
5] or injuries such as amputations [
6,
7] by applying assistive torques to the joints [
8]. However, despite these successful applications, several challenges persist in developing safe and versatile control systems [
9], including the identification of the wearer’s intended movement without external commands, and the autonomous transition between different activity-specific controllers.
One approach to identifying intended activity involves using a locomotor activity intent recognition framework [
10,
11]. This method is predominantly applied in medical rehabilitation, analyzing patients’ gait patterns to furnish clinicians with a quantitative overview of motor function behavior over extended durations, thus aiding objective treatment strategy applications [
12]. For instance, due to postural instability and gait disturbances, Parkinson’s disease patients have an increased susceptibility to fall-related injuries [
13,
14]. Real-time movement monitoring can mitigate injury risks by promptly identifying fall hazards. Intent recognition technology augments current methods by pinpointing disease-specific predictors such as tremors and hyperkinesia [
15,
16], differentiating symptoms across varied motor activities. Accurately discerning an individual’s intended locomotion can also offer data that facilitate the adaptive control of assistive devices or wearable robots. Several studies have implemented activity intent recognition strategies by leveraging sensor fusion [
10,
11,
17]. Specifically, [
10] employed multiple sensors to monitor the internal state of the prosthesis (i.e., joint angles and angular velocities), as well as to gather information about user-environment interactions (i.e., forces and torques) to control the prosthesis for various activity modes (e.g., walking, standing, sitting). Trials with a unilateral amputee subject demonstrated that the Gaussian mixture model (GMM)-based intent recognition framework can identify user intent in real time and transition to the appropriate activity control. However, intent recognition in this study was reliant on handcrafted features extracted from the prosthesis signals, such as mean and standard deviation. This raises a challenge as temporal feature extraction becomes complex due to continuous changes that may occur during transitions between the wearer’s intended movements [
18]. Consequently, domain-specific knowledge and trial-and-error approaches become necessary to derive meaningful features [
9,
19,
20,
21,
22].
Deep learning (DL) technology has risen in popularity as a tool to autonomously detect users’ locomotion activities or intents in the field of human activity recognition (HAR) [
18,
23,
24]. Unlike traditional machine learning (ML) techniques, DL significantly reduces the need for laborious extraction of valuable features from wearable sensor data. Particularly, convolutional neural networks (CNN), with their local dependency and scale invariance, have become the most widely used for many practical issues, such as image classification [
25,
26], object recognition [
27], and natural language processing [
28,
29,
30,
31,
32]. Several recent studies have formulated hybrid architectures by incorporating additional layers, such as long short-term memory (LSTM) [
33,
34,
35,
36], gated recurrent unit (GRU) [
20,
21,
22,
37], or squeeze-and-excitation network (SENet) [
38]. These state-of-the-art technologies aim not only to minimize the computational cost (i.e., the number of parameters) but also to enhance prediction performance in HAR. While LSTM and GRU, variants of the recurrent neural network (RNN), can improve the accuracy of activity or intention recognition, they often entail issues such as prolonged training time. This is because the computational process of each subsequent stage depends on the result of the previous step and is executed sequentially. CNN has fewer parameters and quicker training than LSTM and GRU due to its local connectivity and weight-sharing mechanisms [
22]. However, the capability and accuracy of feature extraction are contingent on the network’s depth. As the depth increases, the model parameters rise exponentially. Therefore, choosing the appropriate network depth in CNN or hybrid model architectures, such as CNN + LSTM (GRU) and LSTM + CNN, in addition to model hyperparameters, is critical.
In this paper, we introduce multivariate single and multi-head CNN architectures for human locomotion activity recognition while wearing a lower limb wearable robot. In our design, two CNN architectures with different network depths and convolutional filter sizes each maintain a fixed kernel size. These architectures extract local temporal features from multivariate signals acquired from EMGs and a wearable robot, respectively. Each architecture then connects to fully connected layers with varying neuron sizes and ultimately identifies five locomotor activities: level ground walking (LW), stair ascent (SA), stair descent (SD), ramp ascent (RA), and ramp descent (RD). These activities are measured across three terrains: flat ground, staircase, and ramp.
The main contributions of this study include: First, we collected prospective research data evaluating the locomotion activity of 500 healthy adults aged 19 to 64. Second, using different multivariate signals collected from eight electromyography (EMG) sensors and a wearable robot, we compared the prediction performance for five locomotor activities between our two CNN architectures and three competing models, namely a CNN and two hybrid architectures (i.e., CNN + LSTM and LSTM + CNN). Lastly, we demonstrated that by only using the encoder, i.e., hip angles and velocities and postural signals, i.e., roll/pitch/yaw from an inertial measurement unit (IMU) from the lower limb wearable robot, a deeper single-head CNN architecture significantly outperforms the three competing architectures.
The rest of this paper is organized as follows:
Section 2 presents the related works.
Section 3 explains data collection, the proposed CNN model architecture, and hyperparameter optimization.
Section 4 describes the collected data characteristics and compares the proposed and three competing models. The conclusion and future study plans are summarized in
Section 5.
3. Methods
3.1. Participant Demographics and Recruitment Process
This study conducted a prospective analysis of five distinct locomotor activities—LW, SA, SD, RA, and RD—engaged in by 500 adults aged 19 to 64 years, from 1 September to 30 November 2022. We recruited participants through in-hospital advertisements targeting outpatients and their guardians. During recruitment, each participant was informed about the study’s objectives, the personal details to be collected (e.g., name, gender, residential area, date of birth, contact information), and the equipment and procedures for data collection. The exclusion criteria encompassed individuals who declined participation in the clinical study, those unable to walk independently, or those unable to communicate verbally.
3.2. Ethical Considerations
To address privacy and research ethics, we offered participants the following provisions: (1) Participants voluntarily agreed to join the clinical study without forfeiting any rights by signing the consent form. (2) While participant consent forms and other records might be accessed by research staff and pertinent agencies, all documents will remain confidential. (3) Participants consented to the use of portrait rights for photos and videos captured during physical data measurements as raw data for clinical research. Should consent be retracted, the associated data will be promptly deleted. (4) Participants have the liberty to rescind their consent for this clinical study at any point. All participants gave informed consent, encompassing the research subject consent form, security pledge, personal information collection and use agreement, and portrait rights use form. The study received approval from the Institutional Review Board (IRB) (No. GNUCH 2022-08-007-001) at Gyeongsang National University Hospital, Republic of Korea.
3.3. Data Collection
During the five locomotion behaviors, the participants, who were wearing a lower limb wearable robot, were instrumented with EMG sensors and a motion capture system in a simulated space for activities of daily living (ADL), as illustrated in
Figure 1.
They performed the five locomotor activities on three types of terrains with the following specifications: (1) For the flat ground terrain, a total length of 3000 mm was set. (2) For the ramp terrain, a total length of 3600 mm, a total height of 400 mm, and a slope of 4.3 degrees were set. (3) For the staircase terrain, a total height of 692 mm was set with four steps and a full footrest depth of 1519 mm, which included each footrest depth of 303 mm for the first and third steps and a final footrest depth of 610 mm.
The Hector H30A wearable robot, produced by HEXAR Humancare, Republic of Korea, was employed in this study. The robot is designed to assist the hip joint’s muscle strength while walking on various terrains, such as flat, uphill, and downhill [
47]. The robot comprises actuators, control units, sensors, and batteries and weighs approximately 4.3 kg. The two brushless DC (BLDC) motors in the robot are each capable of providing up to 12 Nm of torque to the user’s hip joint. The robot is equipped with two types of sensors: rotary encoders and an IMU. The encoders, placed within the actuator modules, measure the hip joint’s angular velocity. The IMU sensor, which includes a tri-axial accelerometer and a tri-axial gyroscope, is used to estimate the wearer’s posture. The robot can operate continuously for about 2 h. During the study, we collected 7-channel wireless signals at the lowest level (i.e., default mode) of the three torque modes that support the hip joint’s muscle strength. These signals, sampled at a rate of 71.42857 Hz, included the left/right hip angles (in degrees), left/right velocities (in rpm), and three postures (roll, pitch, and yaw; in degrees).
In addition to the robot’s sensor data, we used an 8-channel wireless surface electromyography (EMG) system (Delsys Trigno, Delsys, Inc., Boston, MA, USA), acquired at 2000 Hz [
48], to acquire EMG signals from four lower limb muscles. These muscles included the vastus lateralis (VL), tibialis anterior (TA), biceps femoris (BF), and gastrocnemius lateralis (GAL) of both lower limbs [
49]. Prior to placing the EMG sensors, the skin over each muscle was cleaned using alcohol wipes to remove dry skin and skin oils. The EMG electrodes were then affixed to the skin using double-sided adhesive tape, and their placement was adjusted as necessary. To measure kinematic motion information, an eight-camera motion capture system (Kestrel 2200, Motion Analysis Corp., Santa Rosa, CA, USA) was used. This system captured information about the spine, shoulders, elbows, hands, feet, and ankles, at a sampling rate of 100 Hz [
50].
3.4. Model Architecture
The model architecture of the proposed model is described in
Figure 2. It leverages either a single or multi-head CNN structure to extract richer features from the two types of multivariate signals gathered from the wearable robot and the EMG sensors. These architectures are similar in structure but vary in the number of blocks containing convolutional layers, filter sizes, and the number of fully connected layers.
In the single-head CNN architecture, each block—specifically, the feature extractor—captures local temporal features from EMG sensor signals and the wearable robot. Each can encompass up to three convolutional layers. We limited convolutional blocks to three to avoid degradation from potential gradient vanishing and exploding as network depth increases [
51,
52,
53]. The number of filters in a convolutional layer varied among four sizes: 16, 32, 64, or 128, with adjacent convolutional layers having a twofold difference in feature maps. We employed a fixed kernel size of 3 with a stride of 1 to augment decision functions and ensure quicker network convergence with non-linear activations. To hasten training and convergence, a BN layer and a rectified linear unit (ReLU) activation followed each convolutional layer. Each block concluded with a pooling layer, facilitating down-sampling to minimize parameters, preserve dominant features, and filter noise from involuntary human body jitter [
34]. We contemplated max-pooling or average-pooling layers with a pool size of two. Additionally, we restricted the number of fully connected layers to three. In the first fully connected layer, the number of neurons could be set to 32, 64, 128, 256, or 512. Similarly, adjacent layers exhibited a twofold difference in nodes, mirroring the design in the convolutional layer.
The multi-head CNN architecture, as displayed in
Figure 2, was designed as a separable structure to independently preserve the unique characteristics of different signals from the EMG sensors or the wearable robot. The temporal features extracted from various blocks were combined to form the final feature representation. These features were then forwarded to the fully connected layers. A classifier with a softmax layer was then used to identify the five locomotor activities.
3.5. Hyperparameter Optimization
Hyperparameter optimization, also known as hyperparameter tuning, is the process of selecting the best combination of hyperparameters that maximizes the performance of a learning algorithm. Traditional methods such as grid search are exhaustive in their approach and involve trialing a subset of hyperparameter values to find the optimal configuration. However, due to the high number of trials required and the need to keep track of them, this approach can be quite time-consuming. More recently, alternative methods such as random search and Bayesian optimization have gained popularity. One specific Bayesian optimization method is the tree-structure parzen estimation (TPE) [
54]. TPE is a unique Bayesian optimization method that sequentially builds models to estimate the performance of hyperparameters based on past measurements [
55,
56]. It utilizes conditional probability
P(
x|
y), where
x represents hyperparameters and
y represents the quality score (e.g., loss, accuracy) on the objective function. This method offers the advantage of efficient convergence to a global minimum in a relatively shorter time.
In this study, our focus was on the structural optimization issues, more specifically, determining the depth of the convolutional and fully connected layers in the proposed architecture (i.e., the number of blocks, convolution, and fully connected layers). For this purpose, we employed the Hyperopt library [
56,
57] to identify hyperparameters that yield the highest identification ability in validation data. Subsequently, the predictive performance of our models, which are designed using these optimal hyperparameters, was evaluated on test data.