1. Introduction
Facial expression recognition (FER) is a research topic that is constantly receiving attention in the fields of computer vision and human–computer interaction (HCI). Seven basic emotion categories were first proposed by Ekman and Friesen [
1], and then many discrete domain FER studies based on this criterion have been performed. Recently, various methods to recognize the sophisticated emotional intensity in the continuous domain of arousal and valence have been proposed [
2]. FER has various applications, such as in-vehicle driver monitoring, facial modeling [
3,
4], psychology and digital entertainment [
5,
6]. With recent advances in machine learning, such as deep learning, considerable progress has been made in FER as well as image encryption and fault diagnosis [
7,
8]. However, robust operation is still difficult because our real-life environment contains various performance degradation factors such as occlusion, complex motion and low illumination. The purpose of this study is to propose an FER algorithm that works robustly even in such wild situations.
Most of the previous studies on FER have adopted discrete emotion categories such as anger, disgust, fear, happiness, sadness, surprise and neutral. Some FER algorithms have even used a continuous domain consisting of valence and arousal. This paper employs universal discrete domain emotion categories.
Early FER techniques focused on classifying human emotions in still images, but research on analyzing emotions of people in videos is becoming popular more and more. There are several datasets for training or verifying video-based FER techniques. For example, the extended Cohn–Kanade (CK+) dataset was collected in a relatively limited environment and used to evaluate the performance of many algorithms such as [
9,
10,
11]. As shown in
Figure 1a, CK+ includes videos in which subjects express an artificial emotion. The frontal faces were acquired in an environment with constant illumination, so CK+ cannot be used for designing FER algorithms that should take into account changes of head pose and illumination, as shown in
Figure 1b. Thus, robust FER techniques that work well even in wild videos such as
Figure 1b are required.
For effective FER in the wild environment, we proposed a visual scene-aware hybrid neural network (VSHNN) that can effectively combine the global and local features from video sequences [
12]. The VSHNN consists of a three-dimensional (3D) convolutional neural network (CNN), 2D CNN and RNN. In analyzing the facial features in a video, 3DCNN and 2DCNN are useful for learning temporal information and spatial information, respectively. The RNN can learn the correlation of the extracted features. Meanwhile, it was reported that multi-tasks are more effective in feature analysis than a single task [
13]. Thus, this paper assumes that jointly analyzing facial features from multiple aspects can perform much better than analysis from a single aspect. Based on this assumption, this paper analyzes the temporal and spatial information of an input video or their association by using 3DCNN, 2DCNN and RNN, which can be analysis of facial features from various aspects. The 3D CNN with an auxiliary classifier extracts overall spatiotemporal features from a given video. Next, the 2D CNN (fine-tuned DenseNet) extracts local features such as small details from each frame in the video. The RNN properly fuses the two latent features and then final classification is performed.
To improve the ultimate performance of FER, this paper extends the architecture of the VSHNN and presents a new multi-modal neural network that can exploit facial landmark features as well as video features. To this end, we propose three methods. First, to maximize the performance of RNN in VSHNN, we propose a frame substitution (FS) module that replaces the latent features of less important frames with those of important frames in terms of inter-frame correlation. Second, we propose effective facial landmarks based on a new relationship between adjacent frames. Third, we propose a new multi-modal fusion method that can effectively integrate video and facial landmark information at the feature level. The multi-modal fusion can further improve the performance by applying the attention based on modality characteristics to each modality. The contribution of this paper is summarized as follows.
We propose an FS module to select meaningful frames and show that the FS module can enhance the spatial appearance of video modality during learning.
We propose a method to extract features based on facial landmark information to help FER analysis. Because this method reflects correlation between frames, it can extract sophisticated landmark features.
We propose an attention-based multi-modal fusion method that can smartly unite video modality and facial landmark modality.
The rest of the paper is organized as follows.
Section 2 introduces existing studies on various features and modalities that are used in the proposed method.
Section 3 and
Section 4 describe the proposed method in detail.
Section 5 presents the performance verification of the proposed scheme, and
Section 6 concludes the paper.
3. Visual Scene-Aware Hybrid Neural Network
This section proposes an extended architecture of VSHNN. VSHNN is basically a scene-aware hybrid neural network that combines CNN-RNN and 3D CNN to capture global context information effectively [
12]. The 3D CNN processes spatial domain information and temporal domain information at the same time, so it encodes the overall context information of a given video. Therefore, fusing the 3D CNN structure with a typical CNN-RNN structure makes it possible to learn based on global context information. In order to select only those frames that are useful for learning video modality, we newly propose a frame substitution (FS) module that replaces the CNN features of less important frames with those of important frames. The importance of a frame is defined as the correlation between features of frames. As a result, only CNN features of meaningful frames are input into the RNN in VSHNN, so global context information can be considered more effectively during the learning process.
As shown in
Figure 2, the VSHNN basically follows the CNN-RNN structure. We used DenseNet-FC, which is a well-known 2D CNN and scene-aware RNN (SARNN) as an RNN module. The local features of each frame extracted by DenseNet-FC are called “appearance features,” and the global features extracted by 3D CNN are called “scene features.” VSHNN first extracts appearance features that contain information about small details using the fine-tuned DenseNet-FC, which has learned sufficient emotion knowledge for each frame. Appearance features consist of only features of meaningful frames thanks to the proposed FS module. Next, the 3D CNN extracts scene features consisting of temporal information and spatial information. Finally, emotion classification is performed by using the SARNN to fuse the two types of information effectively.
3.1. Pre-Processing
As shown in
Figure 2, each piece of input data is pre-processed in advance. The pre-processing for VSHNN consists of face detection, alignment and frame interpolation. Wild datasets such as AFEW include a number of videos with low illumination and occlusion, so we use the multi-task cascaded CNN (MTCNN) [
38] for robust face detection. After performing face detection and alignment with MTCNN, the detected area is properly cropped and set as an input image for the following networks. If the input video length is too short to obtain the temporal features, we increase the length of the video to at least 16 frames using a separable convolution-based frame interpolation [
39].
3.2. Fine-Tuned DenseNet-FC
Recently, transfer learning has been attracting much attention in the field of deep learning as a solution to insufficient training data. If transfer learning is applied, the model learned in the source domain can improve the performance in the target domain [
40]. We use a fine-tuned 2D CNN as a transfer learning tool to utilize learned knowledge of facial expression.
Transfer learning has previously been realized based on the VGG-face model [
27,
29,
30]. Instead of VGG, we use the state-of-the-art (SOTA) classification model DenseNet [
41] because it can extract more distinctive features for data analysis. The original DenseNet [
41] averages the feature maps through the global average pooling (GAP) layer. To input feature vectors with rich information into the main network, we replace the GAP layer with two fully connected (FC) layers prior to training. This modified network is called DenseNet-FC.
DenseNet-FC is fine-tuned in two steps. First, the DenstNet is pre-trained with the ImageNet dataset [
42]. Second, DenseNet-FC is re-trained with the FER2013 dataset [
43] to facilitate the analysis of facial expression changes. As a result, the fine-tuned DenseNet-FC can work for any test video in the AFEW dataset at the inference stage. The dimension of the appearance feature is fixed to 4096.
3.3. 3D Convolutional Neural Network (3D CNN)
A 3D CNN plays a role in grasping the overall visual scene of an input video sequence because it receives the entire sequence as input. A 3D CNN can be one of three well-known networks: C3D [
44], ResNet3D or ResNeXt3D [
45]. The 3D CNN is trained via transfer learning. For instance, C3D is usually pre-trained using the Sports-1M dataset [
46], and the Kinetic dataset [
47] is used for pre-training of ResNet3D and ResNeXt3D. Note that pre-training is performed with action datasets because more dynamic videos provide richer scene features. Then, the pre-trained 3D CNN is fine-tuned, assuming that the convolution layers that serve as filtering in the time–space aspect have already experienced the motion information sufficiently. Therefore, we freeze the convolution layers during fine-tuning. In other words, only the top FC layers are trained so as to refine the scene features in the FC layers.
Next, we add an auxiliary classifier to the last layer of the 3D CNN [
48]. The auxiliary classifier makes the learning more stable because of its regularization effect. As in
Figure 3, the structure of the auxiliary classifier consists of FC layers, batch normalization [
49], dropout and ReLU. Moreover, the softmax function is used in the loss function.
3.4. Frame Substitution Module
Conventional FER studies for video sequences [
29] employed window-based data pre-processing to select meaningful frames. However, the window-based pre-processing that is basically passive could not adaptively select frames that help FER during learning process. Therefore, we insert an FS module that reflects the statistical characteristics of the video sequence between the fine-tuned DenseNet-FC and the 3D CNN, which makes it possible choose useful frames during the learning process.
On the other hand, the FS module may inevitably cause duplicate information in the sequence or change the original temporal pattern. However, a facial expression in a video clip occurs only in a specific section of the video clip. In other words, RNN does not consider all the frames of the video clip, but mainly learns only frames that have expressions activated. Therefore, the FS module has a positive effect in terms of performance because it increases the number of activated frames that RNN considers. Moreover, the length of a training purpose video sequence—that is, the time interval for recognizing one facial expression—is usually less than 1 s. During such a short time, different facial expressions seldom exist together. Therefore, it is unlikely that changes in temporal patterns caused by the FS module will have a negative impact on the perception of a certain emotion. Experimental evidence for this is shown in
Section 5. The detailed operation of the FS module is as follows.
Let
be the appearance features obtained from DenseNet-FC (see
Figure 2), where
.
D and
N are the feature dimension and the number of frames in the input video, respectively. To calculate the inter-frame correlation information, we define
based on
, where
is a column vector. Based on
Y, we calculate the correlation matrix
R:
In Equation (1),
rjk is the Pearson correlation coefficient between variables
and
:
where
The value corresponding to the j-th row and the k-th column of R indicates how similar the j-th frame and the k-th frame are among the N frames. An example of this is shown in
Figure 4.
Next, we compute the row-wise means of R and obtain the frame indexes corresponding to the top K means, which are defined as . Similarly, the frame indexes corresponding to the bottom K means are defined as . K was set to 3 for an input video composed of 16 frames. Note that the K value does not damage the temporal structure of the video sequence but substitutes unnecessary frames. Finally, we replace the feature corresponding to the -th frame in X with the feature corresponding to the -th frame, which provides X’, which has the same size as X. This method helps the RNN to detect facial changes effectively with only a small amount of computation.
3.5. Visual Scene-Aware RNN (SARNN)
We propose a temporal network that can fuse global scene features and local appearance features at the feature level. Inspired by a previous study [
29], a scene-aware RNN is proposed as an improvement of RNN for the feature-wise fusion of two signals: scene features and appearance features. The scene features from 3D CNN can be used as context because they have temporal information that does not exist in individual frames. We present three types of connections for the efficient fusion of scene features and appearance features (types A, B and C), as shown in
Figure 5. Here, we assume that LSTM [
50] is used as a model of the RNN. This scene-aware LSTM is called SALSTM.
3.5.1. Type A
Conventional LSTM recognizes the relation between the current and past information by using sequence input
and hidden state
. However, the proposed SALSTM has a total of three inputs up to the scene feature. In
Figure 5a,
represents an appearance feature, and
represents the scene feature. The appearance feature is input into each unit of SALSTM in temporal order. At the same time, the scene feature is delivered to all units of SALSTM. Thus, the SALSTM is designed to take into account the visual scene, which can be summarized as follows. With an input sequence
, the LSTM is operated as follows:
Equations (3)–(6) represent the four gates of SALSTM. is an input gate, is a forget gate, is an output gate, and is a candidate gate. In Equations (7) and (8), is the cell state, and is the hidden state. σ(·) denotes the sigmoid function, and denotes the hyperbolic tangent function. is the weight matrix, is the bias, and is the Hadamard product. As a result, in Equations (3)–(6) is added to enable feature-wise fusion with the appearance feature as a scene feature clause. The subsequent learning process is equivalent to that of conventional LSTM.
3.5.2. Type B
In general, RNN-based networks initialize all hidden states to zero because they have no previous information (
). However, type B makes use of the scene features extracted from 3D CNN as previous information, as shown in
Figure 5b. The advantage of such a connection is that temporal information is derived by using the temporally previous information as a whole visual scene, i.e., the scene feature. This can also be easily implemented without changing the LSTM equations. Thus, scene feature
v can be summarized as follows:
where
is the video input,
is the 3D CNN, and θ represents all the training parameters of the 3D CNN.
3.5.3. Type C
As shown in
Figure 5c, the simplest method to fuse multi-modal features is feature-wise concatenation, as follows:
These equations represent the four gates of the LSTM, and the operation [·] is the feature-wise concatenation operation. The last layer
of the SALSTM plays a role in switching to a probability distribution of emotions via a softmax function prior to classification. Note that the LSTM part of the SALSTM can be replaced with various RNN units, such as a gated recurrent unit (GRU) [
51].
3.6. Training
The training of VSHNN consists of two steps. In the first step, DenseNet-FC is trained, and the appearance features are extracted on a frame basis. In the second step, 3D CNN and SARNN are trained. As mentioned in
Section 3.3, the only FC layers of the pre-trained 3D CNN are fine-tuned in the second step.
We use the cross-entropy function as the loss function of the main network and the auxiliary classifier. Using the cross-entropy loss function, the final loss function
is defined as follows:
where
is the main network loss, and
is the loss of the auxiliary classifier.
is a hyper-parameter that determines the rate of reflection of the auxiliary classifier. The training proceeds such that Equation (15) is minimized.
4. Multi-Modal Neural Network Using Facial Landmarks
We need to utilize another modality, i.e., facial landmarks that can be acquired from video, because the performance of FER using only the image modality (pixel information) is limited. Facial landmarks are a sort of hand-crafted feature. Note that FER using only landmark information is inferior to image-based FER [
34,
35,
37,
48]. Consequently, multi-modal networks in which the landmark modality is merged with the image modality have been developed. However, the low performance of conventional landmark-based FER techniques is a cause of decreasing the effectiveness of multi-modal fusion. To solve this problem, we enhance the landmark modality and merge the enhanced feature with VSHNN.
First, we describe the facial landmark in detail, and then we provide three different multi-modal fusion schemes: intermediate concatenation, weighted summation and low-rank multi-modal attention fusion.
4.1. Facial Landmark as an Additional Modality
We extend the landmark Euclidean distance (LMED) [
37] and propose a new LMED using the correlation between adjacent frames called LMED-CF. LMED is useful for extracting features from wild data, but it is highly dependent on the learning model. Therefore, LMED-CF that is less dependent on the learning model is presented.
The LMED-CF is generated as shown in
Figure 6. First, 68 3D landmarks are extracted by [
36]. The reason for using 3D landmarks is to cope with extreme head poses that occur frequently in wild data. Next, the Euclidean distance between the landmarks that are closely related to the facial expression change is calculated and transformed into a 34-dimensional feature
I for each frame (
). This becomes
for
N-frame video sequences. In [
48], LMED is a 102-dimensional descriptor created by calculating the max, mean and standard deviation of 34-dimensional features in units of frames.
The LMED-CF is defined by adding the inter-frame correlation to the conventional LMED, as shown in
Figure 6. The method of computing the inter-frame correlation is discussed in
Section 3.4. In detail,
Y is obtained based on
, and then a correlation matrix is calculated using Equation (1). Next, an upper triangular matrix excluding the diagonal elements of the correlation matrix is flattened to generate an
-dimensional feature based on the inter-frame correlation. Finally, four statistics are concatenated and normalized to [−1,1] to construct the final LMED-CF. If
N is 16, the dimension of LMED-CF is 222 (= 102 + 120). Since LMED-CF provides frame-to-frame or temporal correlation, it can contribute to learning more information than LMED.
4.2. Multi-Modal Fusion
This section introduces two representative multi-modal fusion methods and proposes a new multi-modal fusion method that compensates for the shortcomings of the previous methods.
4.2.1. VSHNN-IC: Intermediate Concatenation
Intermediate concatenation (IC) is simply to concatenate features extracted from multiple modality networks. Applying this technique to VSHNN: as shown in the dotted arrow in
Figure 7, the scene feature extracted from the 3D CNN and the landmark feature through LMED-CF are concatenated, and then the fused feature information is input into SARNN. Because the two features are information of different modalities, their scales may also be different. To avoid this problem, prior to concatenation, batch normalization (BN) is applied to the scene feature as in Equation (16).
where
and
denote a BN layer and a landmark feature extracted by LMED-CF. The remaining process is the same as in ordinary VSHNN. LMED-CF reflects the statistical characteristics of geometric changes, which can help improve the performance by capturing minute facial expressions that are missed by scene features.
4.2.2. VSHNN-WS: Weighted Summation
As a type of late fusion, weighted summation is a method of properly merging score vectors from heterogeneous networks. For instance, VSHNN-WS simply performs a weighted average on the outputs of networks of different modalities:
where
is the score vector output from VSHNN,
is the score vector output from the landmark-based network, and
z is the final score.
α is the weight to be applied to each output and is determined experimentally so that
z is maximized. In this paper, α was set to 0.6 for AFEW and 0.5 for CK+ and MMI datasets.
4.2.3. VSHNN-LRMAF: Low-Rank Multi-Modal Attention Fusion
The IC scheme is simple but does not fully account for the characteristics of each modality. The WS method is also inefficient because it determines weights manually. Recently, a bilinear pooling based on low-rank decomposition has been proposed to improve the drawbacks of the typical multi-modal fusion methods [
52]. However, the performance improvement in [
52] is still not so large compared to the IC and WS methods.
Inspired by [
52], we propose a new method to further improve the efficiency of low-rank fusion, where self-attention is applied to enhance the feature of each modality. We call it low-rank multi-modal attention fusion (LRMAF). Like VSHNN-WS, LRMAF receives the score vectors from VSHNN and LMED-CF with MLP (see
Figure 7).
Figure 8 depicts the operation of LRMAF. First, a value of 1 is attached to the last element of each feature vector. Let the output vectors be
, respectively. Using the vectors, we calculate self-attentions from Equation (18). Note that the final output is mapped to a probability value between 0 and 1 through the sigmoid function σ(·).
where
indicates a FC layer, and
and
stand for the trainable parameters of video and landmark FC layers, respectively.
After element-wise multiplication of the attention and feature vectors corresponding to each modality, take the cross product of the results of the two modalities. Then, the low-rank fusion vector
z is obtained as in Equation (19).
Next, in order to reflect the importance of each modality in
z, we apply low-rank decomposition to the learnable weight matrix
W. Then,
W is decomposed into
and
as in Equation (20).
where
the rank of
W.
Finally, the output fusion vector
is obtained by applying Equations (19) and (20).