1. Introduction
Gaze direction is important non-invasive human behavioral information, and can be an important cue for understanding human intention, making it valuable in human-behavior-related fields, such as human–computer interaction [
1,
2,
3], autonomous driving [
4,
5], and virtual reality [
6,
7]. Gaze estimation tasks can be divided into three main categories: (1) 3D gaze direction estimation [
8,
9]; (2) target estimation [
10,
11]; (3) tracking estimation [
12,
13]. In this paper, we focus on 3D gaze direction estimation.
3D gaze estimation methods can be divided into model-based gaze estimation methods and appearance-based gaze estimation methods, as shown in
Figure 1. Model-based gaze estimation methods [
14,
15,
16,
17] usually consider the eye’s geometric features, such as the eyeball shape and the pupil position, using machine learning methods—for example, the support vector machine—to predict gaze direction; however, model-based gaze estimation methods usually require specific detection equipment, making them unpopular to use in real-world environments. Conversely, compared to model-based gaze estimation methods, appearance-based gaze estimation methods do not require specific detection equipment. Such methods can predict gaze direction, using face or eye images: however, these methods depend on large volumes of trainable data. Zhang et al. [
18] first applied deep learning in the gaze estimation field: since then, many researchers [
18,
19,
20,
21] have proposed appearance-based deep learning gaze estimation methods and, more recently, a new pipeline of gaze estimation has been proposed [
22]. The new pipeline of appearance-based gaze estimation includes static feature extraction, dynamic feature extraction, and gaze estimation: such methods are usually highly accurate, even when head pose and illumination change.
There are several significant problems with appearance-based deep learning gaze estimation methods: for example, the feature fusion stage has not been fully considered. Recent research [
23,
24] has proposed several feature fusion strategies to help settle this issue. In this paper, we considered the feature fusion stage, applying the self-attention mechanism [
25] to fuse the coarse-grain face feature and fine-grain eye features. The self-attention mechanism was able to effectively learn interrelation among different features, and retain more important features. As far as we know, our study is the first time that the self-attention mechanism for feature fusion has been applied in appearance-based deep learning gaze estimation methods. The self-attention mechanism [
25] was first proposed in natural language processing (NLP), the vision transformer (ViT) [
26] making the connection between the self-attention mechanism and computer vision (CV). The ViT has two variants: the pure-ViT and the hybrid-ViT. The pure-ViT needs to split the original image, but the hybrid-ViT uses a convolutional network to extract the feature map from the original image, before splitting the feature map. Both the pure-ViT and the hybrid-ViT are extremely competitive state-of-the-art techniques used in the image classification field: Cheng et al. [
27] applied them to gaze estimation, and achieved highly accurate results. Compared to the hybrid-ViT, the pure-ViT does not perform quite as well, because it needs to slice the facial image into several patches, which can destroy global information in the image, such as the head pose. To avoid the destruction of global information, we considered an independent but entire feature as a patch, independently. In particular, the face, the left eye image, and the right eye image were extracted, using a convolutional neural network, and flattened to a one-dimensional vector as one patch, independently. We implemented this operation in a static transformer module (STM): this module is described in detail in
Section 3.1.
Another key problem in appearance-based deep learning gaze estimation methods is the efficient extraction of dynamic features. Unlike the static features extracted from an image, the dynamic features need to be extracted from an input video. Some research [
22,
28] has extended the basic static network, using a dynamic network—for example, using a long short-term memory (LSTM) approach [
29]. However, these methods do not define dynamic features clearly—in other words, such methods obtain implicit dynamic features through RNNs. Liu et al. [
30] proposed a differential network to predict differential information in solving personal-calibration problems in the gaze estimation field. The differential information was characterized as the difference between the gaze direction of two images, reflecting sight movements in continuous time. Inspired by this, we defined the dynamic feature to be sight movement, and we proposed a new RNN cell—the temporal differential module (TDM)—to obtain it in our work: this module can steadily extract effective dynamic features, using differential information. The TDM is described in detail in
Section 3.2. The TDM is the core of the temporal differential network (TDN). We then combined the STM and the TDN into an end-to-end gaze estimation network—that is, a static transformer temporal differential network (STTDN)—to achieve better results.
To our knowledge, this is the first study to use self-attention mechanisms to fuse both coarse-grained facial features and fine-grained eye features. The contribution of our work can be summarized as follows:
- (1)
We proposed a novel STTDN for gaze estimation, which could achieve better accuracy, compared to state-of-the-art algorithms;
- (2)
We proposed the STM, to extract and fuse features. In the STM, we used a convolutional neural network to extract features from the face, left eye image, and right eye image. Then, we considered each feature as a patch, independently, to solve the key pure-ViT problem. Lastly, we used multi-head self-attention to fuse these patches.
- (3)
The proposed TDM was used to obtain the dynamic information from video, and we clearly defined the dynamic feature to be sight movement.
The rest of the paper is organized as follows: we discuss the related work in
Section 2; we describe our proposed STTDN method in
Section 3; we summarize and analyze the experimental results in
Section 4; finally, the study conclusions are discussed in
Section 5.
2. Related Work
In recent years, appearance-based deep learning methods have been the mainstream methods used in the gaze estimation field. Compared to other traditional gaze estimation methods, appearance-based deep learning methods perform accurately when the head pose and illumination change. Zhang et al. [
18] proposed the first deep learning model based on LeNet [
31] for gaze estimation, which was used to predict the gaze direction from a grayscale image of the eye. Zhang et al. extended the convolutional neural network to 13 layers in their work [
20], which achieved more accurate results, their appearance-based method using a VGG16-inherited network [
32]. Zhang et al. also considered facial features as inputs in their work [
21]. Other studies [
32,
33,
34] further developed appearance-based deep learning gaze estimation methods. Fischer et al. [
33] used two VGG16 networks simultaneously, to extract features from two images of the human eye, and connected the two-eye feature for regression. Cheng et al. [
34] established a four-channel network for gaze estimation, in which two channels were used to extract features from the left eye and right eye images, the other two channels being used to extract facial features. Additionally, Chen et al. [
19] applied a dilated convolution to extract high dimension features, the dilated convolution effectively increasing the receptive field while avoiding reduced image resolution. Krafka et al. [
11] added face grid information to their model.
Although past appearance-based deep learning methods demonstrated high performance, they still needed to improve. For example, existing methods [
33,
34] simply concatenate the different obtained features (such as left eye image and right eye image) into one feature, without considering their internal relationship. The feature fusion stage is not fully considered in these existing methods. More recently, some research [
23,
24] has proposed the use of a feature fusion strategy. Bao et al. [
23] applied the squeeze-and-excitation mechanism in the fusion of eye features, and used adaptive group normalization to correct fused eye features with facial features. Cheng et al. [
24] proposed an attention module to fuse fine-grained eye features, with the guidance of coarse-grained facial features.
With the self-attention mechanism developing in CV, Cheng et al. [
27] first applied the ViT in the gaze estimation field, and achieved an advanced level of performance. The self-attention mechanism [
25] was proposed initially in the NLP field. Alexey Dosovitskiy et al. [
26] proposed the ViT to integrate this method into the CV field. Liu et al. [
35] further developed the ViT to the swin transformer (Swin), the sliding window technique being applied to reduce the computing resource overhead. At this stage, the Swin method remains at the forefront of several CV tasks—including image classification [
36], semantic segmentation [
37], and object detection [
38], among others. In this paper, we further developed the self-attention mechanism application in the gaze estimation field, by specifically applying the self-attention mechanism to the feature fusion process.
Apart from the static features extracted from images, important features—namely, the dynamic features—can be obtained from video. Research [
22,
28,
39,
40] has proposed several gaze estimation methods that incorporate dynamic features. Kellnhofer et al. [
22] proposed a video-based annotation tracking model, and used an LSTM method to obtain the dynamic features from a video. Zhou et al. [
28] applied the Bi-LSTM to obtain dynamic features. Some research [
22,
28] has applied RNNs to implicitly obtaining dynamic features: at this stage, dynamic features are difficult to learn, because of the unstable extraction process of ambiguously defined dynamic features. More recently, some research [
39,
40] has defined dynamic features obtained clearly. Wang et al. [
39] defined optical flow to be a dynamic feature, and then used the optical dynamic to reconstruct the three-dimensional face structure. Wang et al. [
40] considered eye movement as a dynamic feature, and proposed a gaze tracking algorithm. Inspired by differential information, we also defined sight movement as a dynamic feature. Liu et al. [
30] proposed a differential network to settle the personal calibration problem: in their work, the differential network predicted the differential information, which reflected eye movement. In this paper, we propose a new RNN cell—the TDM—to more efficiently obtain this dynamic feature.
3. Proposed Model and Algorithm
In this section, we describe in detail how we designed this end-to-end 3D gaze estimation network. The pipeline of the proposed STTDN is as shown in
Figure 2. The STTDN consisted of two main components: (1) the STM, which was the static network; (2) the TDN, which was the dynamic network. The network used input video frames to predict the gaze direction corresponding to the last input frame. At this stage, the STM was responsible for extracting the static features from a single face image and the left eye and right eye images corresponding to it, while the TDN was responsible for extracting the dynamic features from the static features. The flow chart of the proposed model is shown in
Figure 3. There were two basic states of our proposed model: back propagation and forward propagation. Forward propagation could be used to calculate the intermediate variables of each layer, and back propagation could be used to calculate the gradient of each layer.
3.1. The Design of the STM
The overall structure of the STM is as shown in
Figure 2, where we integrated two convolutional neural networks and a multi-head self-attention fusion block. The STM module used a face image and its corresponding left eye and right eye images, before outputting static features.
Given a face image and its corresponding left eye and right eye images—where (H, W) denotes the resolution of the original image, C denotes the number of channels, and i denotes the ith image in the input video frames—we used two independent convolutional neural networks for feature extraction. We differentiated left eye and right eye images in the feature extraction stage, because the left eye and right eye images contributed differently to the gaze direction, due to the headpose and illumination. The first convolutional neural network was used to extract the facial features; the second convolutional neural network extracted the left eye and right eye features simultaneously, before flattening them to one-dimensional vectors—that is, the , , and , where d was the feature dimension and d = 32 in our experiments.
Similar to the position encoding and patch embedding process in [
26], we created the feature matrix
: firstly, we created a learnable
; secondly, we coded the feature position—specifically,
was coded as position 0, and
,
,
were coded as positions 1, 2, and 3, respectively; finally, we concatenated the four one-dimension vectors into a feature matrix
.
The feature matrix
could be further fused by using the multi-head self-attention fusion block. The core of the multi-head self-attention fusion block was the multi-head self-attention mechanism [
25], which was a derivative of the self-attention mechanism. The self-attention mechanism used a multi-layer perception (MLP) to map the feature matrix
to (Queries)
, where
n was the batch size,
was the dimension of Queries and Keys, and
was the dimension of Values. For our experiment,
. The formulaic definition of self-attention can be expressed as follows:
Unlike the self-attention mechanism, the multi-head self-attention mechanism used projection matrices
, projecting feature
into different representation subspaces. Moreover, a fusion matrix
fused the information extracted from the different representation subspaces, where
i denoted the
representation subspace, and
h denoted the number of representation subspaces. In our experiment, we employed
. The definition of the multi-head self-attention mechanism calculation can be expressed as follows:
where
, and
represents concat operation.
We could stack multiple transformer layers [
25], to implement the multi-head self-attention fusion block, the structure of which is shown in
Figure 4. In
Figure 4,
L represents the number of transformer layers. An independent transformer layer comprises the MLP and a multi-head attention layer (MSA). The definition of the transformer calculation can be expressed as follows:
where
denotes the feature matrix received by each layer of the transformer,
denotes the intermediate variable, which is the feature output using a single layer transformer, and
denotes the feature matrix output by each layer of the transformer.
Finally, is the independent output from the STM, which represents the static feature of each frame.
3.2. Design of the TDN
In this section, we will take a closer look at the TDN in the STTDN. The structure of the TDN was as shown in
Figure 5. In the TDN, we used five RNN cells (TDM) to obtain the dynamic features from five input video frames, and applied a fully connected layer to predict the gaze direction of the fifth input video frame.
The TDM was the core component in the TDN, its structure being as shown in
Figure 6. We defined the dynamic feature as clear sight movement in the TDM. We introduced the differential information to define the dynamic feature clearly. Liu et al. [
30] initially proposed the differential information concept, which represents the difference between the gaze direction of two images of the eye. We can generalize the definition of the dynamic feature as a sight movement using differential information, specifically information from the eye image to the face image, and from only these two images to the video frames. A generalized definition of the dynamic feature can be expressed as follows:
where
denotes the sight movement from the
ith frame to the
th frame,
denotes the gaze direction of the
ith frame, and
denotes the gaze direction of the
th frame.
We could apply
in the TDM to extract the dynamic features from the video frames. Compared to the LSTM cell, we kept only two gates in the TDM: the forgotten gate
and the output gate
, the coefficient of both gates being determined by
. The proposed algorithm of the TDM is shown in Algorithm 1:
Algorithm 1: TDM |
Input: Feature vector of 5 frames Output: The state of hidden layer - 1:
procedureTDM() - 2:
Initialize two zero vectors - 3:
repeat - 4:
for do - 5:
Calculate based on and using ( 6) - 6:
Calculate and base on using ( 7) and ( 8) - 7:
Update based on and using ( 9) - 8:
Update based on and using ( 10) - 9:
end for - 10:
until eigenvector calculation completed - 11:
return - 12:
end procedure
|
The definitions can be expressed as follows:
where,
,
,
,
, and
represent weight matrix, and
,
,
,
, and
represent bias matrix.
Finally, was output by the last TDM in the TDN. A fully connected layer was implemented after the last TDM. The fully connected layer used to predict the gaze direction.
3.3. Topology of the STTDN Network
In this section, we supplied additional topology details. We integrated the convolutional neural network and the multi-head self-attention fusion block in the STM. We implemented ResNet18 [
41] as the convolutional neural network to extract features, the number of feature maps being 32—two ResNet18 being used to extract features from eye and face images, respectively. An additional average pooling-down sampling layer followed multiple convolutional layers, to ensure consistent feature dimension. We then stacked six transformer encoders as the multi-head self-attention fusion block. The number of heads of each transformer encoder was 4, the input dimension of each transformer encoder was 32, and the inner dimension of each transformer encoder was 128. The TDM had fewer parameters—the hidden layer dimension was 32, the num layer was 1.
4. Experimental Results
In this section, we discuss the experimental performance of the proposed STTDN on two public datasets, MPIIFaceGaze [
20] and Eyediap [
42], and the effectiveness of the STM and the TDM. The remaining sections are as follows: first, we introduce the two public datasets used in this study (
Section 4.1), and the evaluation metric (
Section 4.2); then, we compare our proposed method with state-of-the-art methods, in
Section 4.3; in
Section 4.5 and
Section 4.6, we evaluate the effectiveness of the two main ideas in the STTDN—that is, (1) the way to fuse features by STM, and (2) the way to extract dynamic features by DTN; in
Section 4.7, we discuss an ablation study conducted for the STTDN.
To allow readers to reproduce our proposed architecture and conduct further research, we offer several important parameters related to this study. We implemented our model using Pytorch, and evaluated it on two TITAN RTX platforms. The optimizer used was the AdamW optimizer [
43], the loss function was L1 loss, the epoch was set to 30, the initial learning rate was set to
, and the number of video frame inputs was set to 5.
4.1. Datasets
To better evaluate the performance of the STTDN, we used two public datasets: (i) MPIIFaceGaze [
20] and (ii) Eyediap [
42].
Figure 7 shows examples of face images on the two public datasets, and their corresponding left eye and right eye images.
The MPIIFaceGaze dataset [
20] was proposed by Zhang et al., and is the most popular dataset used for appearance-based gaze estimation methods. The MPIIFaceGaze dataset contains a total of 213,659 images collected from 15 subjects during several months of daily life without head pose constraints. Because the images come from the real-world environment, the dataset has abundant illumination and head pose scenes. We considered the two face images with the shortest time interval as two adjacent frames.
The Eyediap [
42] dataset contains 94 video clips from 16 participants in experimental scenes: it contains three visual target segments—that is, continuous moving targets, screen targets, and floating balls. Each participant was recorded with six static-head postures and six free-head postures. As the data were collected in a laboratory environment, the images lacked illumination variation. As 2 subjects lacked screen-target video, we obtained images from 14 subjects for our study.
Based on the dataset pre-processing process in [
32], we cropped out RGB face images with resolutions of 224 × 224 pixels, as well as grayscale left eye and right eye images with resolutions of 36 × 60 pixels.
4.2. Evaluation Metric
We used the leave-one-person-out criterion as the experimental evaluation metric—a common choice in gaze estimation studies. We used 14 subjects as the training dataset, and 1 subject as the validation dataset, before selecting 15 objects as the validation dataset, in turn, and using the average error precision of the 15 experiments as the model performance. We used the angular error as the evaluation metric. The greater the angular error, the lower the accuracy of the model. The definition of the angular error can be expressed as follows:
where
denotes the actual gaze direction, and
denotes the estimated gaze direction.
4.3. Comparison with State-of-the-Art Methods
To evaluate the performance of the model, we selected several networks which had exhibited advanced performance, for comparison: Hybrid-ViT [
27]; Gaze360 [
22]; iTracker [
11]; DilatedNet [
19]; RTGene [
33]; AFFNet [
23]; and CANet [
24]. We recorded the performance of the STTDN and the comparison methods, as shown in
Table 1—a more intuitive representation being reflected in
Figure 8.
The STTDN achieved an angular error of 3.73
on the MPIIFaceGaze dataset [
20], which was highly competitive with the previous best method—that is, AFFNet [
23], which also achieved an accuracy error of 3.73
on MPIIFaceGaze. The STTDN achieved an angular error of 5.02
on Eyediap [
42], an improvement of 2.9% on the previous best method—that is, Hybrid-ViT [
27], which achieved an angular error of 5.17
on the Eyediap dataset. It is evident that our model outperformed the other models, as shown in
Figure 8.
We also analyzed the performance errors from two different perspective: (1) the angular error of the STTDN on different experiment participants; (2) the angular error distribution of the STTDN on different datasets.
4.4. Perfromance Analysis
4.4.1. The Angular Error of the STTDN on Different Experiment Participants
First, we analyzed the error of the STTDN on different experiment participants. We recorded the angular error of each experiment participant, as shown in
Figure 9. As shown in
Figure 9a, the STTDN performed best on person ID p0 (2.4
), and worst on person ID p14 (4.74
), the difference being 2.34
(4.74
–2.4
). In
Figure 9b, the STTDN performed best on person ID p14 (3.33
), and worst on person ID p7 (7.57
), the difference being 4.24
(7.57
–3.33
).
There was still a large difference among different experiment participants, which prevented our model from performing better. This problem in the personal estimation field is called the personal calibration problem. The calibration problem can be considered as a domain adaption problem, where the training set is the source domain and the test set is the target domain. The proposed method did not use a calibration sample in the target domain: thus, this proposed method did not solve the personal calibration problem. For example, the facial contrast of person ID p8 in
Figure 9b was quite different from the others, leading to a higher angular error. Moreover, the personal calibration problem exists in other gaze estimation methods, too. Compared to the Eyediap dataset [
42], the difference was smaller using the MPIIFaceGaze dataset [
20] (4.24
> 2.34
). We analyzed the main reason, as follows. The MPIIFaceGaze had a larger data scale, and more trainable experiment participants. Additionally, the MPIIFaceGaze had richer illumination conditions. Improving the dataset could effectively alleviate the personal calibration problem, to a certain extent.
4.4.2. The Angular Error Distribution of the STTDN on Different Angles
We analyzed the angular error distribution of the STTDN at different gaze directions. We recorded the distribution of the gaze direction for the two datasets, as shown in
Figure 10.
Figure 11 shows the recorded distribution of the angular error of the STTDN in different gaze directions. The proposed method performed poorly at some extreme gaze directions, because the training dataset lacked data samples at extreme angles. Conversely, the more concentrated the gaze direction distribution, the better the proposed model’s performance. A limited number of data samples at extreme gaze directions resulted in the STTDN performing poorly at extreme gaze directions.
4.5. The Effectiveness of the STM
We applied the MSA method to the STM, to fuse the coarse-grained facial features and the fine-grained eye features. We evaluated the effectiveness of the fusion feature in this subsection. Specifically, we evaluated the effectiveness of: (1) adding fine-grained eye features; (2) fusing two differently-grained features, using the MSA method.
To evaluate the effectiveness of the fusion feature, we implemented one fully connected layer after the STM into an end-to-end gaze estimation network, called the transformer static network (STN). We evaluated the angular error of the STN, using two public datasets—MPIIFaceGaze [
20] and Eyediap [
42]—and recorded the results, as shown in
Table 2. We also set up two comparison models—the STN-W/O self-attention and the STN-W/O eye patches—the angular errors of which were also recorded in
Table 2. The specific implementation detail of the STN-W/O self-attention and STN-W/O eye patches was as follows:
- (1)
STN-W/O self-attention: we removed the multi-head self-attention fusion block from the STN, to obtain the STN-W/O self-attention. At this stage, the fine-grained eye and coarse-grained facial features were connected and directly input to the fully connected layer, before the fully connected layer predicted the gaze direction.
- (2)
STN-W/O eye patches: the model was approximately the same as the hybrid-ViT network architecture [
27]. The hybrid-ViT [
27] used only face image as its input, with its main structure also comprising two parts: Resnet18 and the transformer layers. The hybrid-ViT used Resnet18 to extract the feature map from the face image, before splitting the feature map into patches: finally, it input these patches into the transformer layer. The hyperparameters used in this model should remain consistent.
Table 2 shows that STN achieved the best performance on both datasets—that is, an angular error of 5.07
using the Eyediap dataset, and 3.75
using the MPIIFaceGaze dataset. After removing the multi-head self-attention fusion block from the STN, the angular error of the STN-W/O self-attention on the MPIIFaceGaze increased by 0.24
, and on the Eyediap dataset by 0.02
, demonstrating the effectiveness of the multi-head self-attention fusion block. Compared to the STN, the STN-W/O eye patches also exhibited different degrees of degradation on the two public datasets: specifically, the angular error of the STN-W/O eye patches increased by 0.25
on the MPIIFaceGaze dataset, and by 0.17
on the Eyediap dataset, demonstrating the effectiveness of adding fine-grained eye features.
4.6. The Effectiveness of the TDN
In this section, we explore the effectiveness of using dynamic features. We set up two comparison methods—that is STM–LSTM and STM–BiLSTM. Specifically, we replaced the TDM with LSTM and BiLSTM to get STM–LSTM and STM–BiLSTM, respectively. The number of input video frames was set up to five. We recorded the angular error of these models, as shown in
Table 3. STTDN performed best among these models. Compared to STM–LSTM, STTDN had improved by 0.7% and 2.7% on the MpiiFaceGaze dataset [
20] and the Eyediap dataset [
42]. Compared to STM–BiLSTM, STTDN had improved by 4.6% and 4.1% on the MpiiFaceGaze dataset [
20] and the Eyediap dataset [
42]. This proves that the ability to extract dynamic features by TDN is better than the other common RNN.
We also discovered another critical issue: STM–LSTM and STM–BiLSTM showed degradation on both the MpiiFaceGaze dataset [
20] and the Eyediap dataset [
42], compared to TSN, which did not add dynamic features. Compared to TSN, STTDN/TSM–LSTM/TSM–BiLSTM all used RNN to extract dynamic features and predict gaze direction based on those dynamic features. In other words, STTDN, TSM–LSTM, and TSM–BiLSTM added the dynamic features in their model. Specifically, STN achieved the angle error of 3.75
on the MpiiFaceGaze dataset in
Table 2, while STM–BiLSTM and STM–LSTM achieved 3.76
and 3.91
, respectively, on the MpiiFaceGaze dataset; STN achieved the angle error of 5.07
on the Eyediap dataset in
Table 2, while STM–BiLSTM and STM–LSTM reached 5.16
, respectively, and 5.24
on the Eyediap dataset: however, this does not prove that adding dynamic features is ineffective, because the proposed STTDN performed better than STN on two datasets. A reliable explanation for this degradation is that RNN was difficult to be trained when RNN was used to extract ambiguous dynamic features. This phenomenon was more obvious in non light-weight models. Thus, the proposed TDN can extract better dynamic features and predict more accurate gaze direction.
4.7. Ablation Study
We conducted ablation experiments to evaluate the effectiveness of the main modules—(1) the STM and (2) the TDN—in the STTDN. For this purpose, we set up two variant models—namely, the STTDN-W/O STM and STTDN-W/O TDN:
- (1)
The STTDN-W/O STM: unlike the STTDN, we replaced the Resnet18 structure in the STM with four 3 × 3 convolutional layers and a global average down-sampling layer, and replaced the multi-head self-attention fusion block with a fully connected layer.
- (2)
The STTDN-W/O TDN: unlike the STTDN, we removed the TDN from the STTDN. With the removal of the TDN, an external fully connected layer was implemented after the STM, to predict the gaze direction.
We recorded the angular error of these variants and the STN on two public datasets, as shown in
Table 4. The angular error of the STTDN reached 3.73
on the MPIIFaceGaze dataset [
20], and 5.02
on the Eyediap dataset [
42], respectively, when the number of input frames was set to five. After removing the STM, the angular error of the STTDN-W/O STM reached 4.67
on the MPIIFaceGaze dataset [
20], and 5.85
on the Eyediap dataset [
42], respectively. The STTDN-W/O STM and STTDN-W/O TDM methods exhibit more reduction than the STTDN, demonstrating the effectiveness of the STM and TDN in the STTDN.
4.8. Computational Complexity
Our proposed model included convolution structure, transformer structure, and RNN structure. In order to compute the whole model complexity, we defined the complexity calculation formulas of three main structures. The definition of the complexity computing of convolution structure can be expressed as follows:
where
D is network depth,
l is the
lth convolution layer,
is the feature map side output by the
lth convolution layer,
is the kernel side of the
lth convolution layer,
is the channel output by the
lth convolution layer, and
is the channel output by the
th convolution layer.
The definition of the complexity computing of RNN structure can be expressed as follows:
where
D is network depth,
l is the
lth RNN layer,
n is the RNN cells number of the
lth RNN layers, and
d is the input feature dimension of the
lth RNN layers.
The definition of the complexity computing of transformer structure can be expressed as follows:
where
D is network depth,
l is the
lth transformer layer,
n is the RNN cells number of the
lth RNN layers, and
d is the input feature dimension of the
lth RNN layers.
We could compute the time complexity of the convolution part in our proposed model using (
12), the multi-head self-attention fusion module using (
14), and TDN using (
13). The overall time complexity of our proposed model was
, where
N was the input frame number. Compared to the convolution network (e.g., STN-W/O Self-Attention), the computational complexity of the proposed model only additionally increased when (1) using the multi-head self-attention mechanism to fuse the different-grain feature, and (2) extracting dynamic features. Fortunately, its complexity was not
N times when using
N frames as input. We will save the next
N-1 frames fusion features during operation. The previous
N-1 frames fusion features did not need to be recomputed in the next operation.
5. Conclusions
This study proposed a novel gaze estimation network, the STTDN, which integrated the STM and the TDM. We provided a multi-head self-attention fusion strategy STM for fusing fine-grained eye features and coarse-grained facial features. Additionally, we defined a new dynamic feature (sight movement) and proposed an innovative RNN cell-TDM to obtain it. Through experimental evaluation, the STTDN demonstrated great competitiveness, compared to the state-of-the-art methods, on two publicly available datasets: MPIIFaceGaze and Eyediap. In future work, we will apply contrastive learning in gaze estimation, to solve the personal calibration challenge, and to increase performance in extreme angle environments. In addition, we will apply the proposed STTDN to cognitive workload estimation, which presents the occupancy rate of human mental resources under working conditions.