1. Introduction
Actions play a particularly important role in human communication. These actions convey essential messages, such as feelings and root intentions, that help us to understand a person. Giving intelligent machines the same capabilities to understand human behavior is important for natural human–computer interaction and for many other practical applications that have attracted much attention in recent years. Today, modern sensor technology and algorithms for human position estimation make it much easier to access the 2D/3D human skeleton. Human skeleton data, which can be extracted from video images using pose estimation algorithms or captured directly using depth sensor devices, consist of time series of skeletal joints at multiple 2D or 3D coordinate positions. Compared with the traditional RGB video recognition method, the action recognition based on skeleton data has the advantage that the skeleton data can effectively reduce the influence of interference factors such as illumination changes, environmental background, and occlusion in the recognition process and are more adaptable to dynamic environments and complex backgrounds.
Early methods based on deep learning treated human joints as a series of individual features and organized them into characteristic sequences or pseudo-images. It is possible for RNN or CNN to predict the motion of a mark; however, these methods ignore the intrinsic correlation between joints and show that human topology is important information for the human skeleton. ST-GCN [
1] introduces the GCN method and one-dimensional temporal convolution for the first time to extract motion features and graph structures to simulate the correlation between human joints. GCN-based methods have become increasingly popular, and many excellent works have emerged on this basis. AS-GCN [
2] and 2s-AGCN [
3] propose methods for adaptively learning the relationship between spatial joints from data. CTR-GCN [
4] embeds three kinds of shared adjacency matrices divided according to the graph into the dynamic space in the channel dimension. However, most of these methods are biased towards modeling the spatial dimension, while ignoring the modeling of the temporal dimension. In the time dimension feature extraction work, the many existing works [
1,
2,
3,
5] relying only on fixed-size convolution kernels are far from sufficient. To model actions with different duration, recent works [
4,
6] introduce multi-scale temporal convolutions to improve and enhance ordinary temporal convolutions. These models use fixed-size convolution kernels at each layer of the network and use different dilation rates [
7] to obtain larger receptive fields. Hence, some recent models [
8,
9] also use the previous multi-scale time modeling methods.
While multi-scale temporal convolution uses several fixed-size convolution kernels or dilation rates at each layer of the network, we argue this scheme is inflexible. Skeleton-based action recognition models are usually a network structure of stacked GCNs. GCN modules close to the network output tend to have larger receptive fields and can capture more contextual information. In addition, the receptive field of the GCN module close to the network input is relatively small [
10]. It can be concluded that different layers have different effects on skeleton recognition. Therefore, for the learning of the temporal dimension, it is difficult to solve the problem of semantic detail features of skeletal actions by simply using several convolution kernels of fixed sizes or dilation rates in each layer of the network to achieve more effective modeling. Furthermore, most current models always extract spatial features first and connect the original input and the output of temporal features through residuals. Residual connections allow information to be directly transmitted to subsequent levels, thereby preserving the original features and avoiding feature disappearance layer by layer. However, in fact, the effective receptive field is not large, and there is a great deal of redundancy in the deep residual network, which will result in the loss of context when aggregating spatio-temporal information. In addition, approaches such as [
3,
11,
12,
13] use multi-stream networks to extract high-order features of skeletal data. This multi-stream approach is used by many advanced models. However, skeletons are considered as overall tree-like data structures. This means that, in some special actions, the entire input skeleton tree in each time series can be regarded as a whole to achieve good recognition results, just like many actions only need the participation of local body joints to be completed. For example, action categories such as “waving” and “victory” only involve hand joints, and action categories such as “walking” and “kicking” only involve leg joints [
14]. There are some excellent body partitioning methods that split the skeleton tree into several separate part groups [
15]. However, we believe that this is not sufficient, especially for actions that require the participation of most joints or body parts of the whole body, such as “running” and “high jump”.
To address some of the problems with the current model mentioned above, we introduce a new training framework named MMAFF that is based on skeleton-based behavior recognition. We propose a temporal modeling module with a multiscale adaptive attention feature fusion mechanism. First, extracting adaptive features of multiscale spatio-temporal topology for larger perceptual fields, instead of residual connections, we use an attention fusion mechanism that allows us to efficiently aggregate spatio-temporal scale features and solve the problem of context aggregation and initial feature integration. The temporal modeling module adapts the integration of topological features to help us to complete the simulation of actions. Specifically, the extracted spatial topology features of specific channels are input into our designed temporal adaptive feature fusion module. The temporal adaptive feature fusion module is divided into two parts. In the first part, we use multi-scale adaptive convolution kernels and dilation rates to optimize traditional multi-scale temporal convolution with a simple and effective self attention mechanism, allowing different network layers to adaptively select convolution kernels of different sizes and dilation rates instead of being fixed and unchanged. In addition, we have used an attention feature fusion mechanism that replaces the residual connection between the initial features and the output of the temporal module. We focus on the information of the initial features and the temporal dimension separately and then perform feature fusion, effectively fusing the initial features and high-dimensional temporal features, solving the problem of context aggregation and initial feature integration. Based on current multi-stream approaches, we propose a partial stream processing method called the limb stream. The limb stream integrates joint motion and bone motion modality data in the channel dimension. It uses fewer joints and network layers to train the joint motion stream and bone motion stream simultaneously, thus effectively reducing the number of training iterations and the number of parameters of the entire model. Due to the fact that most movements must be completed with the cooperation of the limbs, we consider the limbs as a whole that not only radiates some local detailed movements but also identifies joint movements that require the cooperation of multiple parts of the body. This allows for a more complete and centralized representation of the motion involving a subset of joints. The framework is shown in
Figure 1. Our contributions mainly include the following points:
- (1)
We optimize the traditional multi-scale temporal convolution to make it more adaptive and have the ability to fuse initial features so that a larger receptive field and local global context can be obtained.
- (2)
We propose the limb stream, as a supplement to the traditional independent modality processing method, which can obtain finer features of limb joint group motion, enhance the recognition ability of the model, and perform final score fusion with joint stream and bone stream.
- (3)
We conduct extensive experiments on NTU-RGB+D and-NTU RGB+D 120 to compare our proposed methods with the state-of-the-art models. Experimental results demonstrate the significant improvement of our methods.
The remaining chapters of this paper are organized as follows.
Section 2 presents related work and recent progress.
Section 3 details our proposed optimization. In
Section 4, we compare our results with state-of-the-art methods and conduct ablation experiments.
Section 5 summarizes the paper.
3. Method
In this section, we introduce specific method details for skeleton-based action recognition, as shown in
Figure 2.
Section 3.1 introduces the basic theory in this field.
Section 3.2 introduces the components of MMAFF. We detail the specific architecture of the used SAGC module in
Section 3.3. In
Section 3.4, we describe the TAFF module in detail. In
Section 3.5, we replace the commonly used multi-stream fusion method [
4,
29] with a multi-modality approach.
3.1. Preliminaries
In most skeleton behavior recognition tasks, the GCN-based method is usually used to construct the human skeleton structure as an undirected spatio-temporal graph
, where
V and
E represent the sets of joints and bone edges, respectively. Consequently,
can be used to describe the temporal skeleton sequence, where
N and
T represent the number of joints and the size of the temporal window, respectively. According to the relationship between the joints and barycenter, the nodes are divided into three subgraphs to build the adjacency matrix. GCN’s operation with input feature map
is as follows:
where
denotes graph subsets, and
,
, and
indicate identity, centrifugal, and centripetal joint subsets, respectively.
denotes the pointwise convolution operation.
is the p-th channel shared adjacency matrix.
3.2. MMAFF
Next, we focus on the specific details of the MMAFF framework in the single-stream state, since different streams are shared under different network structures.
Figure 2a is an overview of the network structure of the MMAFF single-stream state. The input initial skeleton data
X are first processed by the Multi-Modality Adaptive Feature Fusion Framework (MMAFF). The processing results are transformed by global average grouping (GAP) and fully connected layers (FC) so that layer-by-layer predictions of individual streams can be obtained. MMAFF consists of multiple Spatio-Temporal Adaptive Fusion (STAF) blocks.
The specific structure of STAF is shown in
Figure 2b. Each STAF block contains a Spatial Attention Graph Convolution module (SAGC) and a Temporal Adaptive Feature Fusion module (TAFF). The SAGC module in
Figure 2c is used to dynamically extract the information from spatial dimensions, and the TAFF module in
Figure 2d is used to adaptively extract temporal relations between joints and fuse multi-scale temporal features with initial features. Next, we elaborate on the details of each module.
3.3. SAGC Module
We first adopt [
4] to construct a dynamic topology for spatial attention map convolution to dynamically model spatial attention maps. As shown in
Figure 2c, we simultaneously feed the initial skeleton data into two parallel branches, each of which is processed by a 1 × 1 convolution and a temporal pooling block. Attention features are modeled by performing a subtraction operation on the pooled outputs of the two branches. This feature map is summed with the predefined adjacency matrix
to obtain the final channel-wise topologies
, i.e., we use the attentional feature fusion mechanism to replace the ordinary residual connection to better aggregate the information of spatial and temporal scales. For details, see the TAFF module in
Section 3.4.
where
is a learnable parameter,
is the
p-th channel shared topology, and
Q is the topological relationship of the specific channel, defined as
where
,
and
are 1 × 1 convolutions,
is temporal pooling. After we obtain the channel-wise topologies
, we input the initial skeleton features into a 1 × 1 convolution and multiply the results with
to aggregate the spatial dimension information as follows:
where
is a 1 × 1 convolution block. ⊗ is matrix multiplication operation.
3.4. TAFF Module
The multi-scale temporal adaptive feature fusion module consists of two parts: the TA module (temporal adaptive) and TFF module (temporal feature fusion). The first part can dynamically adjust the size of the convolution kernel and dilation rate at different network layers. As shown in
Figure 2d, this module is improved on the basis of traditional multi-scale temporal convolution, which contains four branches. Each branch uses a 1 × 1 convolution to reduce the channel dimension. The two branches on the left are the core of the adaptive function. By introducing a simple attention mechanism, the size of the convolution kernel and the dilation rate can be dynamically adjusted. The convolution kernel size (
) and dilation rate (
) can be dynamically resized according to different dimensions of the output channel. Inspired by the attention mechanism [
30], we use the following specific method formula:
where
is
l-th network layer output channel dimension, and
and
b are expressed as the parameters of the mapping function, set to 2 and 1, respectively. At layers 1–4 of the network,
is 3, and at layers 5–10 of the network,
is 5. Similarly,
is 2 at layers 1–7 of the network, and at layers 8–10 of the network,
is 3. Four branches of different scales obtain
through the aggregation function. We did not introduce more branches, so there was almost no change in the parameter and computational complexity. Specific experiments can be found in
Section 4.4.
For the second part, we use an attention feature fusion module to aggregate contextual information of different scales as well as different dimensions along the channel dimension. For the initial feature fusion problem in spatio-temporal modeling, we were inspired by TCA-GCN [
27] model. Instead, we use the AFF module in [
31] to fuse the features of different branches, and we use two input branches: one branch is
X (the initial skeleton data), while the other branch is
(multi-scale aggregated features). We focus on the information of the initial features and the temporal dimension separately and then perform feature fusion, effectively fusing the initial features and high-dimensional temporal features, solving the problem of context aggregation and initial feature integration and improving the effectiveness of modeling. As shown in
Figure 2d, the above expression is specifically expressed as
where
X denotes the residual connection of the input, and
is the concatenated output of multi-scale convolution. The specific formula for
[
31] is expressed as
where
and
are the local channel context and global channel context, respectively. Local context information is added to global context information within the attention module. Initial feature fusion is performed on the input features
X and
. After sigmoid activation function, the output value is between 0 and 1. We hope to take the weighted average of
X and
and subtract this group of fusion weights by 1, which can be used as soft selection. Through training, the network can determine their respective weights.
3.5. Limb-Modality Generator
As mentioned before, we continued the multi-stream fusion scheme to train the joint and bone data separately, while integrating the joint-motion stream and bone-motion stream together, which we called the limb stream. Limb (
) includes all joints in the limb. For the NTU-RGB+D dataset, the number of joints for the limb stream is 22. Since most actions are performed by the limbs, the limbs can better reflect the characteristics of the motion information, and the global motion information can be propagated to a specific local joint group without losing the global motion information. Therefore, as shown in
Figure 1, the training used data from three modalities: joint, bone, and limb.
Then, the subsequent integration of the solution was performed by performing weighted averaging on each predicted score of the modality to obtain the final classification. Changing the number of limb stream network layers may limit the total number of parameters used in the entire model. Replacing the motion stream with the limb stream can also reduce the training frequency and total training time.
Finally, we integrate joint motion and bone motion modality data in the channel dimension to generate , which is the specific representation of the limb stream. We then use this as part of the input to the network. Concatenating the modality data helps to model the inter-modality relations in a more direct manner.
5. Conclusions
In this paper, we propose a multi-modality adaptive feature fusion framework (MMAFF) to increase the receptive field of the model in spatial and temporal dimensions. Firstly, we propose a TAFF module that includes a TA module and TFF module, which can dynamically adjust the size of convolution kernel and dilation rate at different network layers and aggregate multi-scale context information along channel dimensions. Then, we introduce the limb stream; as a supplement to the traditional independent modal processing method, the limb stream enables richer and more dedicated representations for actions involving a subset of joints. Extensive experiments show that our model has advanced performance on different types of datasets.
In future research, we plan to extract features from both temporal and spatial graph dimensions simultaneously, rather than extracting spatial information first and then temporal information, in order to reduce feature redundancy. We will consider introducing language text models in the future to improve the performance of action recognition and reduce computational complexity.