3.1. Motion Collaborative Spatio-Temporal Vector
Human action is the fact of integrally cooperative movement of all joints, not the fact of individual movement of some joints. Therefore, this paper proposes Motion Collaborative Spatio-Temporal Vector that considers the integrally cooperative movement of limbs’ joints.
Most actions are the result of multiple joints moving integrally. In this paper, the scattered and separate motion joints’ information is spliced by vector superposition to form a comprehensive vector that can comprehensively describe action and highlight the integral and cooperative of movement. The comprehensive vector is called MCSTV, the basic principle of MCSTV is shown in
Figure 2. Where
represents the motion vector of left upper limb,
represents the motion vector of right upper limb,
represents the motion vector of left lower limb,
represents the motion vector of right lower limb, and
,
represents the original MCSTV. Due to MCSTV is stitched by the limbs’ motion vector, MCSTV can reflect the fact of the comprehensive effect of multiple motion vectors to a certain degree. The human skeleton is shown in
Figure 3a. The skeletal frame of the high wave is shown in
Figure 3b.
As shown in
Figure 3b, we select the motion vector from the spine to the left hand to represent the motion of left upper limb, that from the spine to the right hand to represent the motion of right upper limb, that from the spine to the left foot to represent the motion of left lower limb, and that from the spine to the right foot to represent the motion of right lower limb, respectively, denoted
,
,
,
.
For different actions, the contribution degree of each joint is different. As showed in
Figure 2, the motion vector of the limbs are directly accumulated. Due to the degree of limb is different, if limbs’ motion vector are directly accumulated, the action cannot be accurately described. We must obtain the contribution degree of each limb. However, these motion vectors are all three-dimensional vectors, and the change of the vector in space is the result of the change of the vector in multi-view. Therefore, these motion vectors need to be projected onto three orthogonal cartesian planes
,
and
. The offset of
on each plane is expressed as:
where
,
and
respectively represent the projection of the
i-th frame of
on
,
and
.
,
and
respectively represent the offsets of the
i-th frame of
on
,
and
.
The offset of
of each frame is the sum of offset on three orthogonal planes. The offset of each frame of
is expressed as:
where
represents the offsets of the
i-th frame of
.
Each action consists of
N frames, so the total offset of
is the sum of the offset of each frame. The offset of
is expressed as:
where
represents the total offset of
.
Similarly, according to Equations (
4)–(
6), the total offset of
is obtained as
, the total offset of
as
and the total offset of
as
.
The contribution degree of each limb is the ratio of the offset of the limb and the total offset of all limbs. The contribution degree of each limb is expressed as:
where
,
,
and
respectively represent the contribution degree of
,
,
and
.
Finally, the contribution degree of each limb was used to constrain motion vector of each frame, and motion vector of each limb were weighted accumulated to form MCSTV. The calculation of MCSTV is expressed as:
where
represents the
i-th frame of
,
represents the MCSTV that motion vector of each limb were weighted accumulated.
MCSTV is obtained by weighted accumulation of motion vector of each limb is shown in
Figure 4. As can be seen from the figure, after weighted accumulation, the MCSTV of this action is dominated by
. Compared with the direct accumulation of motion vector in
Figure 2, this method more directly reflects the major motion joints of the action.
3.2. Motion Spatio-Temporal Map
To describe the action information more comprehensively and accurately, this paper proposes a feature representation algorithm, called Motion Spatio-Temporal Map (MSTM). MSTM can completely express the spatial structure and temporal information. This algorithm calculates the difference between adjacent frames of depth sequence to obtain the motion energy. Next, the key information is extracted from the motion energy by key information extraction based on inter-frame energy fluctuation. Then, the key energy is projected to three orthogonal axes to obtain the motion energy list of the three orthogonal axes. Finally, the motion energy list is spliced in temporal series to form MSTM. The flow illustration of MSTM is shown in
Figure 5.
As shown in
Figure 5, the motion energy of the action is obtained through the difference operation between two adjacent frames of the depth sequence. The motion energy is expressed as:
where
and
respectively represent the body energy of the
k-th and
k+1-th moment, i.e., the
k-th and
k+1-th frame of depth video sequence.
represents the motion energy of the
k-th moment.
Due to the habitual sloshing of some joints, there is a lot of redundancy in the motion energy obtained by Equation (
9). Regarding the problem, we propose an energy selection algorithm, i.e., key information extraction based on inter-frame energy fluctuation. We use this algorithm to remove the redundant of the motion energy at each moment. The main idea of this algorithm is to divide the human body into four areas according to the height range and width range at the initial moment of actions. Then, we calculate the proportion of the motion energy of each region in the whole body, and select a certain amount of motion energy as the main energy. The remaining energy is taken as redundancy and removed. The detailed steps of this algorithm are as follows:
Firstly, we calculate the height range
and width range
of human activity area at the initial moment of actions. The body is divided into upper body and lower body according to the hip center of the body. The body is divided into left body and right body according to the symmetry. Finally, the body is divided into four regions: left upper body (LU), right upper body (RU), left lower body (LL), and right lower body (RL). The motion energy of the body is the sum of the motion energy of four regions. The division of human activity areas is shown in
Figure 6.
Next, we calculate the motion energy proportion of each region in the whole body, the energy proportion can be expressed as:
where
,
represent the motion energy proportion of each region.
H,
W respectively represent the height and width of the whole body.
,
respectively represent the height and width of each region.
Then, we rank the motion energy proportions of each region from the largest to the smallest. The maximum value is denoted as
, the minimum value is denoted as
, and
. In this paper, we select
of the whole body energy as the key energy. The remaining energy is considered to be redundant and removed from the original motion energy. The value of
is determined by the experimental results and recognition accuracies in
Section 4.2.1. The selection of key energy follows the following criteria.
If , then the motion energy of the corresponding region of is retained as the key energy, and the motion energy of the other three regions is considered to be redundant. If and , then the motion energy of the corresponding region of and is retained as the key energy, and the motion energy of the other two regions is considered to be redundant. If and , then the motion energy of the corresponding region of , and is retained as the key energy, and the motion energy of is considered to be redundant. If none of the above conditions are met, the whole body motion energy is considered to be the key energy and retained.
The key energy is projected onto three orthogonal cartesian planes to form three 2D projection maps according to front view, side view, and top view, denoted
,
,
. The 2D projection maps are expressed as:
where
,
and
respectively represent the coordinate of a pixel point on the
,
and
.
,
, and
respectively represent the value of a pixel point on the
,
and
.
To obtain the energy distribution of the width axis, height axis and depth axis, we select
and
to continue to project to the corresponding orthogonal axis to obtain the row sum or column sum of the 2D energy projection maps. According to width axis, height axis and depth axis, three 1D motion energy lists are generated, which are expressed as
,
and
respectively. The 1D motion energy list is expressed as:
where
,
, and
respectively represent the
j-th element of the energy list on the width axis, height axis, and depth axis.
and
respectively represent the width and height of the 2D energy projection map.
According to the temporal order,
are spliced to form MSTM of three axes, which are respectively represented as
,
and
. For the depth sequence of
N frames, the calculation of MSTM is expressed as:
where
, w is width axis, h is height axis and d is depth axis,
represents the 1D motion energy list of the
k-th frame of the action sequence on the
u axis.
represents the MSTM on the
u axis.
represents the
k-th row of
.
With the maximum width
, minimum width
, maximum height
and minimum height
of the body’s activity area as the bounds, MSTM is processed with the region of interest (ROI) [
28], i.e., the image is cropped and normalized.
The actions in the original depth sequence are defined as positive order actions. The positive order high throw is shown in
Figure 7a. The actions that the order is contrary to the original depth sequence are defined as reverse order actions. The reverse order high throw is shown in
Figure 7b. The various feature maps of the positive and reverse order high throw are shown in
Figure 8.
Figure 8a,b are the MSTM of positive order and reverse order high throw respectively. From left to right are the MSTM of height axis, width axis and depth axis respectively. Owing to MSTM reflects the change of the energy information on the three orthogonal axes, it preserves the spatial and temporal information completely. Positive order and reverse order actions have the same motion trajectories and opposite temporal orders, the final MSTM is symmetric along the time axis and easy to be distinguish. In contrast,
Figure 8e,f respectively represent the MHI of positive order and reverse order high throw. MHI retains part of temporal information of the actions and has the ability to distinguish between positive order and reverse order actions. However, due to the coverage of the trajectory and the absence of the depth information, MHI cannot fully express the spatial information.
Figure 8c,d are the MEI of positive order and reverse order high throw, respectively.
Figure 8g,h are the DMM of positive order and reverse order high throw, respectively. MEI and DMM do not involve temporal information, so they cannot be distinguish. MEI does not involve the depth information, which means the spatial information is incomplete. DMM contains depth information and expresses spatial information fully.
3.3. Feature Fusion
To ensure that the description of action information more accurate, fuse features are usually used in human action recognition research. Therefore, this paper fuses skeleton feature MCSTV and image feature MSTM. It cannot only reflect the integrity and cooperativity of the action, but also express the spatial structure and temporal information more completely.
Let denote N action samples, and the i-th sample contains features from M different modalities, but they correspond to the same representation in a common space, denoted by . Where is the sample of the i-th category, is the target projection center of the i-th category, modality M represents M types data.
In this paper, we propose Multi-Target Subspace Learning (MTSL) to study the common subspace of different features. The minimization problem is followed as:
where
is the projection matrix of the
p-th modality.
is the sample features of the
p-th modality before the projection.
is the sample features of the
p-th modality after the projection.
is the primary target projection matrix in the subspace,
.
is the number of categories.
and
are weighting parameters.
is the
c-th auxiliary projection target center matrix for samples of each category. The auxiliary projection target center is the symmetric target center of the projection target center of other categories with respect to the projection target center of the current category. The selections of
are shown in Algorithm 1.
Algorithm 1 The selection of . |
Input: The target projection matrix of subspace: ; The number of categories: . |
Output: The auxiliary projection target center matrix: |
|
for all to do |
for all to do |
if then |
|
else |
|
end if |
end for |
|
|
end for |
Note: is the j-th column of B. |
In Equation (
14), the first term is used to learn the projection matrix, the second term is used for feature selection. The third item is used to expand the inter-class distance between different categories of samples and reduce the dimension of the projection target area.
According to the analysis of
-norm by He et al. [
29], the second term is optimized.
is deduced to
, where
,
is an auxiliary vector of
-norm. The
i-th parameter of
is
,
is the
i-th row vector of
. To keep the denominator from being 0, an infinite decimal
is introduced, and
is not 0.
is rewritten as:
Equation (
14) is derived and the computational formula of the projection matrix is obtained as:
The projection matrix of different modalities is obtained through Equation (
16), and the test sample of each modality are sent into the common subspace to acquire fusion features
. The fusion features are used to research human action recognition.