3.1. Action Definition
The human hands are the primary interaction objects for the robot during human–robot collaborative assembly process. To prevent collisions with human hands and upper body during robot motion, it is essential to predict the next future positions of the worker’s arms. Focusing on hand trajectory predicting during assembly process, this work considers the joints of a worker’s upper body, including neck, left (right) clavicle, left (right) shoulder, left (right) elbow, left (right) wrist, left (right) hand, and left (right) fingertip, representing a total of 13 joints which form a symmetrical structure. The human upper body skeleton is shown in
Figure 1.
The historical observation for frame length is represented as , where is the number of joints and represents the human pose at frame consisting of the 3D coordinates of joints, . The future human motion of length is . Given historical observation , human motion prediction aims to predict the future skeletal position sequence and to maximum the symmetry between and the ground truth .
3.2. Discrete Cosine Transform
The discrete cosine transform (DCT) transforms the joint coordinate trajectories from Cartesian space to trajectory space with a set of pre-constructed discrete orthogonal bases, and the original joint trajectories are then converted to linear combinations of the base trajectories. The DCT operation could extract both current and periodic temporal properties from the motion sequences, which more intuitively represents the motion dynamics of the joints and is beneficial for obtaining continuous motions [
30]. In [
18], the DCT operation was firstly applied to the motion prediction problem, and its effectiveness was verified. Therefore, keeping precedents [
18], the DCT transform is firstly used to process the joint trajectory sequence.
Given an
-frame motion sequence
, the sequence is projected to the
DCT domain via the
DCT operation:
where
is the predefined
DCT basis and
is the DCT coefficients. As the
DCT transform is an orthogonal transform, the inverse discrete cosine transform iDCT can be used to recover the trajectory sequence based on the DCT coefficients. iDCT uses the transpose of
as the basis matrix:
Considering the requirement of rapid response for HRC and the smoothness property of human motion, only the first
rows of
and
are selected for the transformation denoted as
, and an
-frame trajectory is compressed into an
-dimensional coefficient vector. Although this operation is a lossy compression of the original trajectory, it offers two benefits to MHP: (1) the high-frequency part of the motion dynamics is discarded, the subtle jitter of the prediction results will be reduced, and the predicted trajectory becomes smoother; and (2) by compressing the length of the sequence, computation costs during the training and inference are reduced, and the real-time performance is enhanced [
18].
3.3. Fast Spatial–Temporal Transformer Network
The inter-joint spatial–temporal relationship in movement plays a vital role in human action prediction, and which were always modeled with RNNs or spatial–temporal GCNs. However, RNNs cannot understand the spatial structure well, and GCNs have limited temporal receptive field, which leads to their poor future prediction ability. On the full spatial–temporal resolution, Transformer models have stronger spatial–temporal perception and inference ability. However, due to the large number of linear layers, the model parameter number s is larger than RNNs or GCNs, which causes higher computing power requirements on HRC devices.
This work proposes the Fast Spatial–Temporal Transformer Network (FST-Trans) to address the above two issues. The FST-Transformer Network is designed with a symmetrical encoder–decoder structure, as shown in
Figure 2:
Firstly, the historical observation
is padded with its final frame
to obtain an
-frame sequence
. After DCT transformation, the DCT coefficients
are fed into the network. Feature embedding is performed by single linear layer, and the initial feature
is obtained by adding the temporal position encoding
and joints position encoding
:
where
,
are both the learnable parameters, and
represents the hidden dimension. The initial feature
goes through
layers of FST-Transformer block to obtain the final feature
, using another single linear layer as the prediction head to get the DCT coefficients of the predicted trajectory:
Via an iDCT operation, a joint 3D coordinate trajectory can be obtained , and is the future motion estimation.
Both the encoder and decoder consist of FST-Trans blocks, and skip-layer connections are applied to minimize feature degeneracy. The sub-modules of single FST-Trans block include a temporal multi-head attention, a spatial multi-head attention, and a feed-forward neural network, and its computational flow is:
For the
layer FST-Trans block, the input features are layer normalized by
[
31] firstly, and then the joint attention coefficients matrix
and temporal attention coefficients matrix
are computed, respectively,
are all learnable weights for the
feature vectors in the attention mechanism.
and
are computed via the multi-head attention method [
24], and
stands for
operation:
After linear projection
, the normalized feature
is multiplied with the attention coefficient matrices
and
, and then added to the origin layer input feature
to form the residual connection, and the layer normalization is performed to obtain the feature
, which is used as the inputs for the feed-forward neural network (Equations (10) and (11)). In Equation (10),
represents the GELU activation function [
32],
are two learnable weights, and
are the learnable bias. The output of feed forward neural network
is added to the layer input
to obtain the block output feature
.
The structure difference between FST-Trans block and 2-layer vanilla Transformer is demonstrated in
Figure 3. To achieve spatial–temporal decoupled modeling [
28,
29], two Transformer blocks are usually necessary for single layer; spatial and temporal features are extracted in different latent spaces, respectively. Inspired by these results [
33,
34,
35], this work tries to simplify and merge two vanilla Transformer Blocks for better feature extraction and computational efficiency. In FTS-Trans, the spatial attention mechanism shares
vectors with the temporal attention. This design means the two attention coefficient matrices
and
act in the same latent space to fuse of the spatial–temporal features in a more compact manner. If the feature
is reshaped into a 2D tensor
, the application of the attention mechanism in the
dimension can also achieve the simultaneous extraction of spatial–temporal features. Usually, the number of joints
is pre-fixed, the size of attention coefficient matrix
will be mainly influenced by the temporal dimension
. To make longer-term predictions, the expanded temporal dimension will increase the runtime costs, affecting the real-time performance of HRC system. There are 10 learnable parameter matrices (
2 each) in 2 Transformer blocks, and only 1 feed-forward neural network is kept in the FST-Trans block, so the number of learnable weights is reduced to 7. The number of layer normalization and activation functions are halved, which reduces the overall computational complexity significantly.