Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors

Li, Yong; Wang, Luping; Liu, Fen

doi:10.3390/electronics11162526

Open AccessArticle

Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors

by

Yong Li

¹,

Luping Wang

^2,* and

Fen Liu

³

¹

School of Biomedical Engineering, Sun Yat-sen University, Guangzhou 510006, China

²

School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou 510006, China

³

CCTEG Chongqing Research Institute, Chongqing 400039, China

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(16), 2526; https://doi.org/10.3390/electronics11162526

Submission received: 10 July 2022 / Revised: 4 August 2022 / Accepted: 10 August 2022 / Published: 12 August 2022

(This article belongs to the Special Issue Human Activity Recognition and Machine Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Recently, deep neural networks have become a widely used technology in the field of sensor-based human activity recognition and they have achieved good results. However, some convolutional neural networks lack further selection for the extracted features, or the networks cannot process the sensor data from different locations of the body independently and in parallel. Therefore, the accuracy of existing networks is not ideal. In particular, similar activities are easy to be confused, which limits the application of sensor-based HAR. In this paper, we propose a multi-branch neural network based on attention-based convolution. Each branch of the network consists of two layers of attention-based grouped convolution submodules. We introduce a dual attention mechanism that consists of channel attention and spatial attention to select the most important features. Sensor data collected at different positions of the human body are separated and fed into different network branches for training and testing independently. Finally, the multi-branch features are fused. We test the proposed network on three large datasets: PAMAP2, UT, and OPPORTUNITY. The experiment results show that our method outperforms the existing state-of-the-art methods.

Keywords:

human activity recognition; deep neural network; attention mechanism; grouped convolution

1. Introduction

In recent years, with the wide application of wearable sensors in portable smart devices, sensor-based human activity recognition (HAR) has become a research hotspot [1,2,3]. Sensor-based HAR can be applied in human rehabilitation treatment, smart home monitoring, sports training evaluation, and virtual reality [4,5]. Inertial sensors including accelerometers and gyroscopes are used as the most important sensors that record the acceleration and angular rate of different positions of the body in real time. Traditional HAR methods are mainly based on machine learning and manual feature extraction [6]. Such methods are difficult to adapt to the sensor data changes caused by the activity differences of different populations, and the extraction of manual features relies on the experience of the expert. Therefore, the recognition accuracy of the traditional method is not high. The development of deep learning technology has greatly improved the accuracy of sensor-based HAR. Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have become important methods in the sensor-based HAR and show much superiority [7,8,9,10].

CNNs have powerful ability to extract the multi-level and multi-channel features of sensor data, which is beneficial to improve the accuracy of HAR. By designing multi-layer CNNs, rich features can be extracted. However, many CNNs simply increase the number of layers of the network [11,12]. These methods will greatly increase network parameters and computational complexity. In addition, the multi-layer CNNs also extract many unimportant features, which limit networks from further improving the accuracy of activity recognition. In the field of visual target detection, the attention mechanism is often added to the CNN model to retain important features and suppress unimportant features and is proven to be effective in improving the target detection rate [13,14]. In the sensor-based HAR, introducing the attention mechanism into CNNs to realize the extraction of important features from sensor data is becoming an important way to improve the performance of the networks.

Since the information that one sensor can provide is limited, using only one sensor for HAR will lead to low accuracy for the recognition of some similar activities and complex activities. Some researchers use multiple sensors arranged in different parts of the human body to collect human activity data [15]. In many studies, the sensor data stream is input to the network as a whole and the network cannot process the sensor data from different locations of the body independently and in parallel. This data input and processing method may cause interference between sensor data. So then, it affects the extraction of important activity features at specific positions of the human body.

In this work, we propose a new multi-branch grouped CNN to extract features of separated multi-sensor data. We introduce the dual attention mechanism into the multi-branch grouped convolution network to improve the recognition accuracy of sensor-based HAR. The main contributions of this work are as follows:

(1): A new multi-branch neural network is proposed, in which each branch consists of two layers of grouped convolution + dual attention submodule.
(2): A new method of sensor data input and processing is proposed. Sensor data collected at different positions of the human body are separated and fed into different network branches for training and testing independently.
(3): We test our method on three large HAR datasets. The experimental results show that our network outperforms existing state-of-the-art methods.

This paper is organized as follows. The related works are introduced in Section 2. The proposed method is described in Section 3. The experimental results are illustrated and discussed in Section 4. The conclusion of this paper is presented in Section 5.

2. Related Works

Due to its excellent performance, deep learning techniques have become the mainstream technology in sensor-based HAR in recent years. CNN models are widely used to process multi-dimensional sensor data. Bianchi et al. [16] propose a CNN model to identify daily activities. Cheng et al. [17] introduced a conditionally parametrized convolution network into sensor-based HAR, and achieved a balance of recognition performance and computational cost, but the accuracy of their method is not outstanding. Tang et al. [18] introduced hierarchical-split convolution into sensor-based HAR to extract multi-scale features from sensor data. Their method can extract multi-scale features at the same feature layer, but the size of each scale cannot be set flexibly. Gao et al. [19] embedded the dual attention mechanism, including channel attention and temporal attention, into the ResNet module, and obtained good results on several datasets. In their approach, the sensor data are input into the network as an overall data stream. Tong et al. [20] proposed a network based on BiGRU and a three-layer Inception module to achieve the recognition of traffic police actions. In their study, six inertial sensors were placed at different positions of the body to collect the acceleration data of subjects, and good experimental results were obtained. Some studies use the self-attention mechanism to extract sensor data features for HAR [21]. Buffelli et al. [22] proposed pure attention-model-based multi-head self-attention and show that their approach outperforms others. Shavit et al. [23] introduced the transformer model into the sensor-based HAR. Although the above self-attention methods have achieved good results, the computational cost of the network and the number of network parameters are large, and their performance has no obvious advantage over some CNNs.

Several studies have used multiple sensors to collect human activity information for activity recognition. Xu et al. [24] proposed a network in which the GRU layer is added after the Inception module to extract spatial and temporal features of multi-sensor data. The proposed network was validated on the PAMAP2 dataset. The data collected by multiple sensors are involved in training as an overall data stream. Liu et al. [25] proposed a lightweight network based on linear grouped convolution to complete human activity recognition. The sensor data are also fed into the network as an overall data stream. The accuracy of their network is 91.46%, which is lower than some existing methods. Huan et al. [26] proposed a hybrid network based on multi-layer CNN, BiLSTM layer, and a manual feature extraction module. Activity information was collected using sensors located in the lower extremity and wrist, respectively. The data collected by the two sensors were sent to different branches to participate in training independently. In their study, multi-layer CNN did not consider introducing the attention mechanism, and the stacked CNN may extract some unimportant features.

Bottleneck block is the backbone network commonly used in deep neural networks, such as ResNet [27] and MobileNetV2 [28]. Recently, Res2Net [29] proposed to use a group of convolutions in the bottleneck block to extract multi-scale features. In this paper, we build a bottleneck block with a group of CNNs as the backbone network.

Our method differs from the above methods. We build a multi-branch network and introduce the dual attention mechanism into the bottleneck block. We separate the multi-sensor data according to the position of the sensor and feed the data of each sensor into different branches of the network for training and testing. Our network has achieved excellent results on three datasets.

3. Proposed Method

The purpose of this paper is to improve the accuracy of sensor-based HAR as much as possible, especially to reduce the confusion of similar activities, and to make the network parameters and calculation amount to a reasonable range. In the existing methods, the recognition accuracy is not high enough. We find that some CNNs lack further selection for the extracted features, or the network cannot process the sensor data from different locations independently and in parallel. To achieve this goal, we propose a multi-branch attention-based grouped convolution network to process separated data from multiple inertial sensors. The architecture of our network is shown in Figure 1. Next, we will introduce the proposed method. First, we introduce the preprocessing of sensor data.

3.1. Sensor Data Preprocessing

In the HAR method, the sensor data need to be preprocessed in the first stage. The data collected by multiple inertial sensors are first normalized for each sensor axis. Then, the normalized sensor data are separated according to the collected position. In this paper, we preprocess the 3-axis accelerometer and 3-axis gyroscope data of inertial measurement units (IMUs) on each position of the body. The separated sensor data are segmented using a fixed-length sliding window, and there is a certain proportion of overlap between adjacent segments. After preprocessing, the data become segments with the size of

K \times M \times N

to be processed later, where

K

is the number of IMUs (inertial sensors),

M

is the window length, and

N = 6

is the number of inertial sensor axes.

3.2. Network Architecture

As shown in Figure 1, if there are

K

sensors on different positions of the human body, there will be

K

parallel branches in our network. For each branch, inspired by Res2Net [29], we build a bottleneck block with a group of single-scale CNNs. Since the convolution size is much smaller than the size of the sensor data segment, we believe that the performance of multi-scale convolution is equivalent to that of single-scale convolution, but the computational burden will increase greatly. Therefore, unlike Res2Net, we do not use multi-scale grouped convolution and residual connection. In our bottleneck block, we first implement

1 \times 1

convolution of

M

channels, and

M

is an integer multiple of 4. Then, the obtained features are split into four groups on average according to the channel, and each group is subject to standard convolution. The size of the convolution kernel is set to

3 \times 3

. The sum of the number of feature map channels in each group is equal to the number of feature map channels before grouping. Batch normalization (BN) and ReLU are added after each standard convolution. The BN can achieve the standardization and normalization on the input from each group convolution and it is inserted before the ReLU activation function, making the training process faster and easier. The features obtained by the four Conv + BN + ReLU (

3 \times 3

C-B-R) are concatenated along the channel dimension. Then, a

1 \times 1

convolution with

M

channels is implemented. The convolution here includes BN and ReLU. The features obtained by grouped convolution will be sent to the dual attention mechanism module. Then, the multi-branch network is constructed to extract features of multi-sensor data, and finally the extracted features are fused and sent to the classification module.

3.3. Dual Attention Mechanism

Attention mechanism is a method inspired by human vision, which automatically gives more weight to important parts of signal features and gives less weight to unimportant ones during calculation, and has achieved success in the fields of image target recognition and tracking. In particular, the channel attention mechanism has become a popular attention mechanism because of its good effect on improving network performance. In the squeeze-and-excitation network (SENet) [13], global average-pooling is firstly used to squeeze the spatial information for each channel independently. Then, two fully connected (FC) layers with nonlinearity followed by a sigmoid function are used to generate channel attention. In the field of sensor-based HAR, previous work [30] used the single channel attention mechanism, SENet, to improve network performance. Recently, the dual attention mechanism [14] has been proposed and has achieved promising results. Convolutional block attention module (CBAM) [14] is an improvement of the SENet, which adds spatial attention to channel attention. In this paper, we introduce the dual attention mechanism named CBAM into sensor-based HAR. Different from SENet, CBAM uses global max-pooling and global average-pooling to simultaneously squeeze spatial features for each channel to generate channel weights. As shown in Figure 1, CBAM is formed by concatenating the spatial attention mechanism after the channel attention mechanism.

The important channel among multiple channels can be selected through channel attention. According to CBAM, the channel attention

W_{C} \in R^{C \times 1 \times 1}

is represented by the following:

W_{C} = σ (ω_{2} Re LU (ω_{1} g_{1} (A)) + ω_{2} Re LU (ω_{1} g_{2} (A)))

(1)

where

A

is the input features.

g_{1}

,

g_{2}

are channel-wise global average-pooling and max-pooling.

σ

is the sigmoid function.

C

is the number of channels of the input features.

ReLU

stands for ReLU activation function, placed between two fully connected layers.

ω_{1}

,

ω_{2}

represent the weights of the FC layers. To reduce the parameters, the sizes of

ω_{1}

,

ω_{2}

are set to

C \times (C / r)

and

(C / r) \times C

, where

r

is a reduction ratio. The connection between the input features and the weights of channels is established by Equation (1).

The spatial attention mechanism focuses on where the important part of the signal is. Like the channel attention mechanism, global max-pooling is also introduced into feature squeezing in the spatial attention mechanism. In the spatial attention mechanism, the difference from channel attention is that the features obtained by global-maximum pooling and global average-pooling along the channel axis will be concatenated. Then, the concatenated features are sent to a

1 \times 1

standard convolution, in which the number of filters is 1. Finally, a sigmoid function is used to map the features obtained from

1 \times 1

convolution to spatial weights. Spatial attention can be expressed as:

W_{S} = σ (f^{1 \times 1} ([g_{1} (A); g_{2} (A)]))

(2)

where

g_{1}

,

g_{2}

represent global max-pooling and global average-pooling.

f^{1 \times 1}

represents

1 \times 1

convolution. σ is the sigmoid function.

3.4. Multi-Branch Network

As we have seen, it is difficult to obtain motion information of multiple limbs by placing one sensor on the human body. For example, a sensor placed on the wrist cannot obtain the motion information of lower limbs when riding a bike. A single sensor limits the sensor-based HAR system from recognizing more kinds of human activities. In the existing scheme, the use of multiple sensors placed on different positions of the human body to collect activity information is usually considered. For the processing of multi-sensor data, most methods will take these data as a whole data stream to extract features.

For one kind of activity, the collected sensor data from different positions of the body are significantly different. We hope to fully extract the features of the sensor data at each location and avoid mutual interference between different sensor data. Therefore, we construct a multi-branch network, in which each branch consists of stacked grouped CNN + dual attention subnetwork. Firstly, we separate the sensor data collected at different locations. Then, the sensor data of each location are fed into different branches of the proposed network to complete the training and testing of grouped CNN + dual attention subnetwork independently. Although deep architecture has been widely used in signal representation, too many network layers will lead to some problems such as overfitting and oversize. Therefore, in this paper, we stack only two layers of grouped convolution + dual attention submodule to build the network branch. Finally, the deep features extracted by each branch will be fused together by concatenation operation. The fused features are used as the input of an FC layer, and a Softmax layer is used to complete the recognition of human activities. Because multiple network branches extract the features of sensor data in parallel and the parameters of each branch are not shared, the amount of calculation and parameters of the network will increase compared with the single branch network. The increased network parameters and computational burden come from the multi-branch grouped CNN + dual attention subnetwork.

4. Experiments

In this paper, we test the proposed method with three large public datasets: PAMAP2, UT, and OPPORTUNITY dataset. All the datasets are collected by multiple inertial sensors and have been widely used to validate HAR methods.

4.1. Experimental Datasets

In the above three datasets, the data collected by 3-axis gyroscope and 3-axis accelerometer will be used to evaluate our method. The sliding window technique is a widely used data segmentation method in sensor-based HAR. The length and overlap rate of the sliding window have a certain influence on the recognition accuracy. However, so far, there is no clear statement about which length and overlap ratio of the sliding window are optimal. For a longer sliding window, it contains more sensor data. The recognition accuracy is higher, which is more favorable for recognizing some complex activities, but the recognition speed will decrease. In fact, the setting of the two parameters of the sliding window in most literature is consistent with previous studies. In this paper, for the sake of fairness, we also set the length and overlap rate of the sliding window to be consistent with other literature.

(1): PAMAP2 [31]: This dataset was completed by nine subjects, and each subject was asked to complete 12 protocol activities, including lying, sitting, standing, walking, running, Nordic walking, ascending stairs, descending stairs, vacuum cleaning, ironing, and rope jumping, and 6 optional activities, including watching TV, computer work, car driving, folding laundry, house cleaning, and playing soccer. They are all human daily activities. In this paper, we only tested 12 protocol activities. Three IMUs were placed on the chest, wrist, and ankle of each subject, respectively. Each IMU consisted of an accelerometer, gyroscope, and magnetometer. In our experiment, the inertial sensor data were collected at 100 Hz and downsampled to 33 Hz to reduce the amount of data in subsequent processing. Only the accelerometer and gyroscope data from the protocol activities were used for analysis. The sliding window length was set to 168 (5.12 s) and the overlap rate was set to 78%. The sliding window length and overlap are consistent with [19]. All the segments of the dataset were randomly separated into two parts, including a training set (80%) and a test set (20%).
(2): UT-dataset [15]: The dataset was completed by 10 subjects. The subjects were asked to perform 13 activities, including walking, standing, jogging, sitting, biking, going upstairs, going downstairs, typing, writing, drinking coffee, talking, smoking, and eating. Two sensors were placed on the subject’s wrist and ankle, respectively. Each sensor included an accelerometer, a linear accelerometer, a gyroscope, and a magnetometer. In this paper, we have only used data from the accelerometer and gyroscope. The sensor sampling frequency was 50 Hz. The sliding window length was 200 (4 s) and the overlap rate was 50%, which are the same as [26]. Like the PAMAP2, the segments of UT were randomly separated into two parts, including a training set (70%) and a test set (30%).
(3): OPPORTUNITY [32,33]: The dataset was completed by 12 subjects. All of the subjects were required to perform various daily activities. Each of them wore 7 IMUs and 12 three-axis acceleration sensors to ensure that rich activity data could be collected. The sampling frequency of all sensors was 30 Hz. We used the data from five IMUs (upper body) and two other IMUs (two shoes) to test. For the five IMUs worn on the upper body, we only used the data from 3-axis gyroscope and 3-axis accelerometer. For the two IMUs attached to two shoes, we used the 3-axis Euler angles and 3-axis body acceleration. The test data had a total of 42 dimensions. The sliding window was set to 30 (1 s), and the overlap rate was 50%, which are consistent with [21,25]. The whole dataset was randomly divided into two parts, including 70% of the training set and 30% of the test set.

4.2. Experimental Settings

The training and testing of the algorithm were completed on a desktop computer with an Intel Core i9-9900 CPU and an NVIDIA GTX 1060 GPU. The operating system was Windows 10, and the algorithm was based on TensorFlow 2.3.0 and was completed on PyCharm with Python. To comprehensively evaluate the proposed method, three different evaluation indicators were used, including accuracy [9], macro F1-score [21], and recall [9].

The cross-entropy loss function was selected to train the model with a batch size of 64. The model was trained for 40 epochs on PAMAP2 and UT and 50 epochs on OPPORTUNITY. The Adam optimizer was used to train the model. The initial learning rate was 0.0003 and dropped dynamically during training until it reached a minimum value of 0.00005. When the accuracy of the testing set does not improve for three times in a row, the learning rate will decrease automatically.

To validate the effectiveness of our method, we trained and tested the following four networks. The first three of the four networks are baseline models.

M-branch G-CNN: the multi-branch grouped convolution network, which removes the dual attention mechanism based on our proposed network.

S-branch Att-based G-CNN: single-branch grouped convolution network based on dual attention mechanism, which changes to a single-branch network based on our network.

M-branch CNN: the multi-branch standard convolution network.

M-branch Att-based G-CNN: our network, the multi-branch grouped convolution network based on the dual attention mechanism.

The parameters of the backbone network architecture were carefully tested and adjusted. The super-parameters of models are listed in Table 1. For our network, the number of convolution kernels of the first grouped convolution layer is 64. The number of groups of each CNN layer is 4. In the channel attention, the number of channels of input features

C_{1}

is 64. The reduction ratio

r

is 4. The number of convolution kernels of the second grouped convolution layer is 32. Therefore, the number of channels of input features

C_{2}

is 32 for the second channel attention.

4.3. Experimental Result and Discussion

4.3.1. Results on PAMAP2

The test accuracies of four networks on PAMAP2 under different epochs are shown in Figure 2. It can be seen from Figure 2 that the test accuracies increase with the increase of epochs, and fluctuation of each curve becomes very small after 20 epochs. The test accuracy of our network is higher than that of the others. The test accuracies of the above four networks on PAMAP2 are shown in Table 2. The accuracy listed in Table 2 is the mean value of the last 10 epochs. Std represents the standard deviation of accuracy. It can be seen from Table 2 that our proposed method has the highest accuracy of 97.35% on PAMAP2. The accuracy of our network is higher by 1.9% than S-branch Att-based G-CNN. This means that our multi-branch network can extract the features of sensor data from different locations of the human body more effectively than the single branch network. Compared with M-branch G-CNN and M-branch CNN, the accuracy of our model is 1.16% and 1.72% higher, respectively. This shows that the dual attention mechanism in our model can improve the performance of the network on the PAMAP2 dataset. The parameters and Flops of the four networks are roughly equal and are kept at a small level.

From Figure 3a–d, we can see that it is more difficult for all networks to identify similar activities on the PAMAP2 dataset. Sitting and standing, vacuum cleaning and ironing, cycling and ironing, standing and ironing, are more likely to be confused than other activities. However, our proposed network, the M-branch Att-based G-CNN, has the strongest recognition ability for the three pairs of activities: sitting and standing, vacuum cleaning and ironing, cycling and ironing. Among the four networks, our network has the least number of confusion samples for the three pairs of activities. For the standing and ironing, our network has fewer confusion samples than M-branch G-CNN and M-branch CNN, and the number of confusion samples of our network is roughly equal to that of S-branch Att-based G-CNN. However, as these two activities have the largest value at the diagonal position in the confusion matrix of M-branch Att-based G-CNN, our network has the highest recall for them.

4.3.2. Results on UT

It can be seen from Figure 4 that our network has the highest test accuracy after 20 epochs on the UT dataset. From Table 3, we can see that the test accuracies of the four networks on the UT dataset have exceeded 98%, and our network has the highest accuracy of 99.34%. The accuracy of our network is higher by 0.37% than S-branch Att-based G-CNN. The accuracy of our model is 0.51% and 1.25% higher than the M-branch G-CNN and M-branch CNN, respectively. The above results prove the effectiveness of our proposed network once again. From Figure 5a–d, we can see that there are fewer confusing activity samples on the UT dataset than the PAMAP2 dataset. This may be because the collected data of each activity in the UT dataset are relatively standard and this dataset is a balanced dataset. The main confusing activities are walking and going upstairs, standing and smoking, talking and smoking. There is confusion among these activities due to their similarities. Compared with the other three networks, our network effectively reduces the impact of similarity between different activities and has fewer confusion samples. For standing and smoking, our network has the least confusion samples among four networks. For the recognition of walking and going upstairs, our network has less confusion than the S-branch Att-based G-CNN and M-branch CNN. For the recognition of talking and smoking, our model has less confusion than M-branch G-CNN and M-branch CNN.

4.3.3. Results on OPPORTUNITY

The test accuracy curve of our networks and other baseline models on the OPPORTUNITY dataset is shown in Figure 6. The accuracy of our network is higher than that of other models. The experimental results are listed in Table 4. The accuracy of our network is 0.41%, 1.52% higher than that of M-branch G-CNN and S-branch Att-based G-CNN, respectively. This shows that in this dataset with more sensors, multi-branch network architecture is necessary to improve network performance. Our network has a small improvement over M-branch G-CNN. It may be because for this dataset, the similarity between different activities is large, and it is difficult for the attention mechanism to play a great role. In fact, the accuracy of all models on this dataset is lower than that on PAMAP2 and UT.

4.3.4. Comparison with Related Work

Table 5 lists some indicators of the methods in related literatures on PAMAP2 and UT dataset in recent years, including accuracy (Acc) and F1-score, number of parameters (Paras), and the number of float operations (Flops). In Table 5, we can see that our method has the highest accuracy and F1-score on the PAMAP2 dataset. The multi-scale CNN [18] has achieved accuracy of 93.75% and is 3.6% lower than our method. The hybrid model [26] has accuracy of 96.01% and is 1.34% lower than our method. The accuracy of other methods is less than 96% on the PAMAP2 dataset. Our method also has fewer parameters and Flops. For the UT dataset, the F1-score of our method is 9.03% and 6.85% higher than that of DT and KNN, respectively. Compared with the hybrid model [26], the accuracy and F1-score of our network on the UT dataset are slightly lower. However, our method has smaller Flops and consumes fewer computational resources. The number of parameters and Flops of our method is at a low level on both datasets. For the OPPORTUNITY dataset, the accuracy of our network is 9.29% higher than the [25]. The F1-score is 9.16% higher than the [21]. The accuracy of our network is also significantly higher than [19,34]. It can be seen that our network has obvious advantages over the existing methods for the indicators listed in the table.

5. Conclusions

In this paper, we propose a multi-branch neural network for HAR using data from wearable sensors. Each branch of the network extracts features of sensor data based on attention-based grouped convolution. We introduce a dual attention mechanism that consists of the channel attention and spatial attention to select the most important features. Multiple inertial sensors worn on different positions of the human body are used to collect activity information. The preprocessed data from each sensor are separated and fed into different branches of the DNN. Finally, the multi-branch features are fused. We test the proposed network and three baseline models using three large HAR datasets: PAMAP2, UT, and OPPORTUNITY. The experiment results show that our network outperforms other baseline models. The accuracy of our network has reached 97.35%, 99.34%, and 90.83% on PAMAP2, UT, and OPPORTUNITY, respectively. By analyzing the confusion matrix of the four networks on PAMAP2 and UT, the results show that our proposed method can reduce the adverse effect of activity similarity on activity recognition, and effectively improve the accuracy of sensor-based HAR. Compared with the results of existing studies, our method outperforms existing state-of-the-art methods. In the future, we will further study the following aspects as follows: (1) consider the transfer learning problem on the sensor-based HAR; (2) consider the unbalanced datasets problem on the sensor-based HAR.

Author Contributions

Conceptualization, Y.L. and L.W.; methodology, Y.L.; software, Y.L.; validation, Y.L., L.W., and F.L.; formal analysis, Y.L.; investigation, Y.L., L.W., and F.L.; writing—original draft preparation, Y.L.; writing—review and editing, Y.L. and L.W.; supervision, L.W.; project administration, L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Science and Technology Planning Project of Guangdong Science and Technology Department under Grant Guangdong Key Laboratory of Advanced IntelliSense Technology (2019B121203006).

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, J.; Chen, Y.; Hao, S. Deep learning for sensorbased activity recognition: A survey. Pattern Recogn. Lett. 2019, 119, 3–11. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, Z.; Zhang, Y.; Bao, J.; Zhang, Y.; Deng, H. Human Activity Recognition Based on Motion Sensor Using U-Net. IEEE Access 2019, 7, 75213–75226. [Google Scholar] [CrossRef]
Chen, Z.; Jiang, C.; Xiang, S.; Ding, J.; Wu, M.; Li, X. Smartphone Sensor-Based Human Activity Recognition Using Feature Fusion and Maximum Full a Posteriori. IEEE Trans. Instrum. Meas. 2020, 69, 3992–4001. [Google Scholar] [CrossRef]
Ordonez, F.J.; Roggen, D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition. Sensors 2016, 16, 115. [Google Scholar] [CrossRef]
Ramanujam, E.; Perumal, T.; Padmavathi, S. Human Activity Recognition with Smartphone and Wearable Sensors Using Deep Learning Techniques: A Review. IEEE Sens. J. 2021, 21, 13029–13040. [Google Scholar] [CrossRef]
Ignatov, A. Real-time human activity recognition from accelerometer data using Convolutional Neural Networks. Appl. Soft Comput. 2018, 62, 915–922. [Google Scholar] [CrossRef]
Huang, W.; Zhang, L.; Gao, W.; Min, F.; He, J. Shallow Convolutional Neural Networks for Human Activity Recognition Using Wearable Sensors. IEEE Trans. Instrum. Meas. 2021, 70, 1–11. [Google Scholar] [CrossRef]
Xia, K.; Huang, J.; Wang, H. LSTM-CNN Architecture for Human Activity Recognition. IEEE Access 2020, 8, 56855–56866. [Google Scholar] [CrossRef]
Abdel-Basset, M.; Hawash, H.; Chakrabortty, R.K.; Ryan, M.; Elhoseny, M.; Song, H. ST-DeepHAR: Deep Learning Model for Human Activity Recognition in IoHT Applications. IEEE Internet Things J. 2021, 8, 4969–4979. [Google Scholar] [CrossRef]
Nafea, O.; Abdul, W.; Muhammad, G.; Alsulaiman, M. Sensor-Based Human Activity Recognition with Spatio-Temporal Deep Learning. Sensors 2021, 21, 2141. [Google Scholar] [CrossRef]
Bloomfield, R.A.; Teeter, M.G.; Mcisaac, K.A. A Convolutional Neural Network Approach to Classifying Activities Using Knee Instrumented Wearable Sensors. IEEE Sens. J. 2020, 20, 14975–14983. [Google Scholar] [CrossRef]
Imran, H.A. UltaNet: An Antithesis Neural Network for Recognizing Human Activity Using Inertial Sensors Signals. IEEE Sens. Lett. 2022, 6, 1–4. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 2011–2023. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 1–17. [Google Scholar]
Shoaib, M.; Bosch, S.; Incel, O.D.; Scholten, H.; Havinga, P.J. Complex Human Activity Recognition Using Smartphone and Wrist-Worn Motion Sensors. Sensors 2016, 16, 426. [Google Scholar] [CrossRef] [PubMed]
Bianchi, V.; Bassoli, M.; Lombardo, G.; Fornacciari, P.; Mordonini, M.; De Munari, I. IoT Wearable Sensor and Deep Learning: An Integrated Approach for Personalized Human Activity Recognition in a Smart Home Environment. IEEE Internet Things J. 2019, 6, 8553–8562. [Google Scholar] [CrossRef]
Cheng, X.; Zhang, L.; Tang, Y.; Liu, Y.; Wu, H.; He, J. Real-Time Human Activity Recognition Using Conditionally Parametrized Convolutions on Mobile and Wearable Devices. IEEE Sens. J. 2022, 22, 5889–5901. [Google Scholar] [CrossRef]
Tang, Y.; Zhang, L.; Min, F.; He, J. Multi-scale Deep Feature Learning for Human Activity Recognition Using Wearable Sensors. IEEE Trans. Ind. Electron. 2022. [Google Scholar] [CrossRef]
Gao, W.; Zhang, L.; Teng, Q.; He, J.; Wu, H. DanHAR: Dual Attention Network for multimodal human activity recognition using wearable sensors. Appl. Soft Comput. 2021, 111, 107728. [Google Scholar] [CrossRef]
Tong, L.; Ma, H.; Lin, Q.; He, J.; Peng, L. A Novel Deep Learning Bi-GRU-I Model for Real-Time Human Activity Recognition Using Inertial Sensors. IEEE Sens. J. 2022, 22, 6164–6174. [Google Scholar] [CrossRef]
Mahmud, S.; Tonmoy, M.T.H.; Bhaumik, K.K.; Rahman, A.K.M.M.; Amin, M.A.; Shoyaib, M.; Khan, M.A.H.; Ali, A.A. Human Activity Recognition from Wearable Sensor Data Using Self-Attention. In Proceedings of the 24th European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 1–8. [Google Scholar]
Buffelli, D.; Vandin, F. Attention-Based Deep Learning Framework for Human Activity Recognition with User Adaptation. IEEE Sens. J. 2021, 21, 13474–13483. [Google Scholar] [CrossRef]
Shavit, Y.; Klein, I. Boosting Inertial-Based Human Activity Recognition with Transformers. IEEE Access 2021, 9, 53540–53547. [Google Scholar] [CrossRef]
Xu, C.; Chai, D.; He, J.; Zhang, X.; Duan, S. InnoHAR: A Deep Neural Network for Complex Human Activity Recognition. IEEE Access 2019, 7, 9893–9902. [Google Scholar] [CrossRef]
Liu, T.; Wang, S.; Liu, Y. A lightweight neural network framework using linear grouped convolution for human activity recognition on mobile devices. J. Supercomput. 2022, 78, 6696–6716. [Google Scholar] [CrossRef]
Huan, R.; Jiang, C.; Ge, L.; Shu, J.; Zhan, Z.; Chen, P.; Chi, K.; Liang, R. Human Complex Activity Recognition with Sensor Data Using Multiple Features. IEEE Sens. J. 2022, 22, 757–775. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Gao, S.; Cheng, M.; Zhao, K.; Zhang, X.; Yang, M.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
Mekruksavanich, S.; Jantawong, P.; Jitpattanakul, A. A Lightweight Deep Convolutional Neural Network with Squeeze-and-Excitation Modules for Efficient Human Activity Recognition using Smartphone Sensors. In Proceedings of the 2021 2nd International Conference on Big Data Analytics and Practices (IBDAP), Bangkok, Thailand, 26–27 August 2021; pp. 23–27. [Google Scholar]
Reiss, A.; Hendeby, G.; Stricker, D. Confidence-based Multiclass AdaBoost for Physical Activity Monitoring. In Proceedings of the ACM International Symposium on Wearable Computers, Zurich, Switzerland, 8–12 September 2013; pp. 13–20. [Google Scholar]
Chavarriaga, R.; Sagha, H.; Calatroni, A.; Digumarti, S.T.; Tröster, G.; Millán, J.D.R.; Roggen, D. The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Pattern Recogn. Lett. 2013, 34, 2033–2042. [Google Scholar] [CrossRef]
Roggen, D.; Calatroni, A.; Rossi, M.; Holleczek, T.; Forster, K.; Troster, G.; Lukowicz, P.; Bannach, D.; Pirkl, G.; Ferscha, A.; et al. Collecting complex activity datasets in highly rich networked sensor environments. In Proceedings of the 2010 Seventh International Conference on Networked Sensing Systems (INSS), Kassel, Germany, 15–18 June 2010; pp. 233–240. [Google Scholar]
Teng, Q.; Wang, K.; Zhang, L.; He, J. The Layer-Wise Training Convolutional Neural Networks Using Local Loss for Sensor-Based Human Activity Recognition. IEEE Sens. J. 2020, 20, 7265–7274. [Google Scholar] [CrossRef]
Teng, Q.; Zhang, L.; Tang, Y.; Song, S.; Wang, X.; He, J. Block-Wise Training Residual Networks on Multi-Channel Time Series for Human Activity Recognition. IEEE Sens. J. 2021, 21, 18063–18074. [Google Scholar] [CrossRef]

Figure 1. The architecture of our network.

Figure 2. Test accuracies on PAMAP2 under different epochs.

Figure 3. Confusion matrix on PAMAP2.

Figure 4. Test accuracies on UT under different epochs.

Figure 5. Confusion matrix on UT.

Figure 6. Test accuracies on OPPORTUNITY under different epochs.

Table 1. Selected super-parameters of the models, G-C-B-R means grouped C-B-R.

Layer	Backbone
Layer	M-Branch G-CNN	S-Branch Att-Based G-CNN	M-Branch CNN	M-Branch Att-Based G-CNN
Layer 1	C-B-R 1 × 1, 64	C-B-R 1 × 1, 64	C-B-R 3 × 3, 64	C-B-R 1 × 1, 64
	G-C-B-R 3 × 3, 16 × 4	G-C-B-R 3 × 3, 16 × 4		G-C-B-R 3 × 3, 16 × 4
	C-B-R 1 × 1, 64	C-B-R 1 × 1, 64		C-B-R 1 × 1, 64
		Dual attention		Dual attention
Layer 2	C-B-R 1 × 1, 32	C-B-R 1 × 1, 32	C-B-R 3 × 3, 32	C-B-R 1 × 1, 32
	G-C-B-R 3 × 3, 8 × 4	G-C-B-R 3 × 3, 8 × 4		G-C-B-R 3 × 3, 8 × 4
	C-B-R 1 × 1, 32	C-B-R 1 × 1, 32		C-B-R 1 × 1, 32
		Dual attention		Dual attention
Layer 3	Flatten, FC, Dropout (0.5), Softmax

Table 2. The experimental results on PAMAP2 (%).

Method		Accuracy	F1-Score	Paras	Flops
M-branch G-CNN	mean	96.19	96.15	1.219 M	10.567 M
M-branch G-CNN	std	0.09	96.15	1.219 M	10.567 M
S-branch Att-based G-CNN	mean	95.45	95.55	1.181 M	10.492 M
S-branch Att-based G-CNN	std	0.12	95.55	1.181 M	10.492 M
M-branch CNN	mean	95.63	95.51	1.219 M	10.566 M
M-branch CNN	std	0.03	95.51	1.219 M	10.566 M
Our network	mean	97.35	97.03	1.221 M	10.573 M
Our network	std	0.08	97.03	1.221 M	10.573 M

Table 3. The experimental results on UT (%).

Method		Accuracy	F1-Score	Paras	Flops
M-branch G-CNN	mean	98.83	98.83	1.037 M	9.063 M
M-branch G-CNN	std	0.04	98.83	1.037 M	9.063 M
S-branch Att-based G-CNN	mean	98.97	98.97	1.018 M	9.026 M
S-branch Att-based G-CNN	std	0.04	98.97	1.018 M	9.026 M
M-branch CNN	mean	98.09	98.09	1.037 M	9.062 M
M-branch CNN	std	0.03	98.09	1.037 M	9.062 M
Our network	mean	99.34	99.35	1.038 M	9.067 M
Our network	std	0.04	99.35	1.038 M	9.067 M

Table 4. The experimental results on OPPORTUNITY (%).

Method		Accuracy	F1-Score	Paras	Flops
M-branch G-CNN	mean	90.42	75.12	0.9 M	7.2 M
M-branch G-CNN	std	0.05	75.12	0.9 M	7.2 M
S-branch Att-based G-CNN	mean	89.31	71.81	0.79 M	7 M
S-branch Att-based G-CNN	std	0.07	71.81	0.79 M	7 M
M-branch CNN	mean	89.07	72.08	0.9 M	7.2 M
M-branch CNN	std	0.07	72.08	0.9 M	7.2 M
Our network	mean	90.83	76.16	0.9 M	7.2 M
Our network	std	0.03	76.16	0.9 M	7.2 M

Table 5. Comparison with related works (%).

	Methods	Acc	F1-Score	Paras	Flops
PAMAP2	Multi-scale CNN [18]	93.75	93.66	1.20 M	220.5 M
	Hybrid Model [26]	96.01	95.52	75.6 K	897.3 M
	CNN + local loss [34]	93.95	-	-	-
	ResNet + attention [19]	93.16	-	3.51 M	-
	Linear Grouped CNN [25]	91.46	-	8.83 M	39.7 M
	Our network	97.35	97.03	1.22 M	10.5 M
UT	DT [15]	-	90.32	-	-
	KNN [15]	-	92.50	-	-
	Hybrid Model [26]	99.7	99.7	-	-
	Our network	99.34	99.35	1.03 M	9 M
OPPORTUNITY	Linear Grouped CNN [25]	81.54	-	2.24 M	536.79 M
	Self-attention [21]	-	67	-	-
	ResNet + attention [19]	82.75	-	1.57 M	-
	Teng et al. [35]	83.06	-	-	-
	Our network	90.83	76.16	0.913 M	7.25 M

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Wang, L.; Liu, F. Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors. Electronics 2022, 11, 2526. https://doi.org/10.3390/electronics11162526

AMA Style

Li Y, Wang L, Liu F. Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors. Electronics. 2022; 11(16):2526. https://doi.org/10.3390/electronics11162526

Chicago/Turabian Style

Li, Yong, Luping Wang, and Fen Liu. 2022. "Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors" Electronics 11, no. 16: 2526. https://doi.org/10.3390/electronics11162526

APA Style

Li, Y., Wang, L., & Liu, F. (2022). Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors. Electronics, 11(16), 2526. https://doi.org/10.3390/electronics11162526

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Branch Attention-Based Grouped Convolution Network for Human Activity Recognition Using Inertial Sensors

Abstract

1. Introduction

2. Related Works

3. Proposed Method

3.1. Sensor Data Preprocessing

3.2. Network Architecture

3.3. Dual Attention Mechanism

3.4. Multi-Branch Network

4. Experiments

4.1. Experimental Datasets

4.2. Experimental Settings

4.3. Experimental Result and Discussion

4.3.1. Results on PAMAP2

4.3.2. Results on UT

4.3.3. Results on OPPORTUNITY

4.3.4. Comparison with Related Work

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI