1. Introduction
It is important to understand human emotional states in order to interact with another person effectively to support patients’ needs in hospitals and to provide a user-friendly service while using service robots, intelligent agents or smart phones [
1]. Facial expressions are frequently used to understand human emotional states. The basic facial expressions, including anger, disgust, fear, happiness, sadness and surprise, are observed in most cultures [
2,
3] and used in many facial expression recognition systems, even though other evidence shows that facial expressions of emotion are culture-specific [
4]. Recently, complex facial expressions, which are combinations of basic facial expressions, have also been analyzed and categorized [
5]. A facial expression recognition system may also be useful for performance-based human emotional intelligence assessment [
6] by measuring the accurate generation of facial expressions for specific emotions, and for the accurate understanding of expressions generated by other persons. Specific configurations of facial muscle movements appear as if they summarily broadcast or display a person’s emotions, which is why they are routinely referred to as emotional expressions and facial expressions [
7]. For example, Westerners represent each of the six basic emotions with a distinct set of facial movements common to the group, while Easterners do not [
4].
Many facial models have been developed based on 2D and 3D landmarks [
8], facial appearances [
9], geometry [
10] and 2D/3D spatio-temporal representations [
11]. A comprehensive literature review on this subject can be found in review papers [
12,
13,
14]. These approaches can be categorized as model-based, image-based and deep learning-based. Most of the image-based approaches are based on engineered features such as Histogram of Oriented Gradients (HOG) [
15], Local Binary Pattern Histogram (LBPH) [
16] and Gabor filters [
17].
Facial landmark points are one of the key features of model-based facial expression analysis. Statistical shape models such as active shape models (ASMs) [
8] or appearance models such as active appearance models (AAM) [
9] are frequently used with classifiers such as a support vector machine (SVM). A facial expression recognition system using multi-class AdaBoost [
18] with pairs of landmarks exhibits high performance in facial expression recognition [
19]. Automatic landmark point detection and tracking using multiple differential evolution Markov chain particle filters [
20] exhibits improved performance over conventional landmark point tracking based on AAM [
9]. An SVM classifier on the displacement of facial landmark points yields high classification accuracy [
21].
Recently, convolutional neural networks (CNNs) have been used for face recognition [
22,
23] and facial expression recognition [
24,
25,
26]. The face alignment and normalization process using 3D face modeling shows an improvement in face recognition performance in CNN-based face recognition [
22]. CNN architectures specialized for facial action unit recognition for supporting region-based multi-label learning have also been developed [
27]. Deep CNN provides high-level features from the trained deep model. Features extracted from the CNN are very powerful and are used for numerous computer vision applications [
28], including facial expression recognition using SVM [
24,
29,
30] and micro-expression recognition tasks with evolutionary search for features [
31]. A hybrid model that integrates CNN and SVM by replacing the fully connected classification layer of the CNN with an SVM classifier has also been developed [
32].
Efforts have been made to combine multiple approaches for the improved classification of facial expression. Two-channel CNNs with different kernels are combined in a fully connected layer and exhibit better performance than hand-coded features [
33]. The ensemble of multiple deep CNNs is used for static facial expression recognition [
34]. Deep belief networks (DBN) are used to extract and combine audio and visual features for emotion recognition from audio–video data [
35].
Especially for the recognition of facial expressions, combining geometric features and texture features is important for achieving improved performance by compensating for the limitations of one of the features: the geometry provides global facial motions, whereas the texture provides subtle and detailed variations in expression, such as a wink or eyebrow movement. Texture feature extraction based on discriminative non-negative matrix factorization (DNMF) and geometric displacement from grid tracking was used for distance measurement and fused by the SVM classifier [
36].
Recently, learning shape, appearance and dynamics with a combination of a CNN and bi-directional long short-term memory neural network (BLSTM) were used for facial action unit recognition [
37]. As an alternative to this approach, the joint fine-tuning of deep neural networks from a deep temporal appearance network (DTAN) and deep temporal geometry network (DTGN) is proposed [
38]. Two pre-trained networks and joint loss functions are fine-tuned with additional fully connected layers.
A great deal of work is done by using the hybrid approach, a combination of transfer learning and pre-trained deep convolutional networks [
39]. The model classifies the facial expressions into seven classes. For the FER-2013 database, the model gives accuracy of 74.39%. Another facial expression recognition work is also presented using graph-based feature extraction and hybrid classification approach (GFE-HCA) [
40]. Feature dimensions from facial parts such as the right eye, left eye, nose and mouth are optimized using a weighted visibility graph, which catalyzes the graph-based features from the edge-based invariant features. The combination of a deep convolutional network and modified joint trilateral filter is also used to efficiently recognize facial emotions [
41]. However, a system that combines a conventional feature-based classifier and deep learning approaches does not exist.
The method discussed in [
42] addresses a unique artificial intelligence (AI)-based system for speech emotion recognition (SER) that utilizes hierarchical blocks of the convolutional long short-term memory (ConvLSTM) with sequence learning. In order to extract the local emotional features in a hierarchical correlation, four blocks of ConvLSTM are used. In addition, to extract the spatial cues by utilizing the convolution operations, the ConvLSTM layers are adopted for input-to-state and state-to-state transition. Moreover, in order to extract the global information and adaptively adjust the relevant global feature weights according to the correlation of the input features, a novel sequence learning strategy is also utilized.
In this paper, we have presented the hybrid model: a fusion of CNN-based facial expression recognition and conventional high-performance geometry-based facial expression recognition using SVM classifiers. In order to extract facial motion characteristics efficiently, we applied deep learning algorithms to the dense facial motion flow, which was the extracted optical flow from the initial frame (neutral expression) to the given frame of the sequence. In the case of geometric displacement, the displacement direction and strength of the facial landmark points was extracted and used as a feature vector for the SVM classifier of the geometric facial expression data. The weighted summation of the result of the hybrid classifiers provides state-of-the art results for the Cohn–Kanade (CK+) database [
43] and is satisfactory for the BU4D database, which is still comparable with the state-of-the-art schemes.
3. Experiments
In this section, we evaluate and compare our approach with other state-of-the-art algorithms for facial expression recognition using the CK+ and BU4D databases. We have used 80% data for training and 20% for validation. First, we evaluate the proposed method using CNN-based motion flow. Secondly, the method is tested using the facial geometry displacement-based SVM using the CK+ database. In the last, we evaluate the hybrid model—the mixture of CNN and SVM. Moreover, the proposed method is also evaluated and compared with the state-of-the-art methods using the BU4D database.
3.1. Facial Expression Recognition Using CNN-Based Facial Motion Flow for CK+ Database
The CK+ database was released for the purpose of promoting research into automatically detecting individual facial expressions. Since then, the CK+ database has become one of the most widely used datasets for algorithm development and evaluation as it provides AU coding. Based on the facial action unit criteria, it provides seven facial expression categories: Anger (An), Contempt (Co), Disgust (Di), Fear (Fe), Happiness (He), Sadness (Sa), and Surprise (Su). We used these seven facial expression categories as the class labels of the facial expression recognition system. In all of the following experiments, the CK+ database with seven class labels is used. The number of class samples in the database is different in different expressions.
We evaluated the facial expression recognition performance using four-fold cross-validation for the CK+ database. From the training database, 20% of the data are used for the validation test for the learning of the CNN-based facial expression recognition. For the training of the proposed CNN-based facial expression recognition from facial motion flow, we trained the proposed model with batch size 256 and learning rate starting from 0.0003 and reducing gradually.
Figure 6 shows the feature map visualization of the second convolution layer of the model, when we applied training data in different expressions. We selected representative feature maps of each facial expression, which shows feature maps that capture distinguishing characteristics of each facial expression. The visualized feature map of each expression shows different activation areas of each expression, which may be corresponding to the action unit of each expression type. The result implies that the network properly extracts the features of different expression types.
Table 2 shows the CK+ facial expression recognition results, with an average recognition rate of
using CNN-based classification with the network architecture presented in
Figure 4a.
By the modified network architecture presented in
Figure 4b, we achieve higher facial expression recognition performance than the basic network architecture with an average
recognition rate, as shown in
Table 3. We evaluated the facial expression recognition performance in different input data types.
Table 4 shows the performance of facial expression recognition for the CK+ database. Three channels are used for the input in the case of
,
, angle(
), and
,
, Intensity(
I). The table shows that the proposed two-channel motion flow (
,
) outperforms other input data types. The performance of the modified CNN-based facial expression recognition using two-channel motion flow (
,
) is
better than the basic CNN-based model’s performance.
3.2. Facial Expression Recognition Using Facial Geometry Displacement-Based SVM for CK+ Database
Using the SVM classifier based on facial landmark displacement, we achieved
facial expression recognition performance for seven facial expression categories.
Table 5 shows the confusion matrix of the facial expression recognition performance from the landmark displacement in the sequence. The facial-geometry-based SVM classifier shows superior facial expression recognition performance to the CK+ baseline facial expression recognition by the multi-class SVM using the combination of similarity-normalized shape and canonical appearance features in [
43] or the SVM classifier using facial feature point displacement [
21].
We further investigate the facial expression recognition performance of the SVM with different representations of the facial geometry and its displacement. The landmark point itself, point with normalization by centers of eye location, intensity and angle of the landmark flow, and landmark flow in coordinate (
,
) are tested.
Table 6 shows the performance of facial expression recognition for the CK+ database according to different input data types of the SVM. In this additional experiment, the SVM classifier using the direct landmark displacement in Equation (
5) can achieve
recognition performance, as shown in
Table 7, which is higher than that when using the displacement intensity and angle by
.
3.3. Mixture of CNN and SVM Classifier for CK+ Database
For the mixture of the facial expression classifiers using a weighted summation of the probabilistic estimation of the facial expression categories, it is necessary to determine the weighting factors of the heterogeneous classifiers. The weighting factors can be found based on the validation set that is not used for the training classifiers.
Figure 7 shows the variations in the facial expression classification performance according to the weight values when we combine the modified CNN-based facial expression classification and SVM-based classification using the displacement intensity and angle of facial landmarks.
The combined facial expression recognition performance is between and , which can be higher than the individual classification performance of the two individual classifiers. The performance reaches a peak at a weighting for the CNN-based classifiers and weighting for the SVM-based classifiers. The weighting factors are computed from the validation set in the experiment.
Using the mixture of the basic CNN-based classifier and SVM-based classifier, we achieved
facial expression recognition performance for the CK+ database.
Table 8 presents the confusion matrix of the facial expression recognition performance of the heterogeneous classifier.
Figure 8 shows the comparison of the CNN-based classifier, SVM-based classifier, and hybrid CNN-based and SVM-based classifiers. For most of the facial expressions, the fused classifier surpasses or at least matches the peak classification performance of the two heterogeneous classifiers in each facial expression category.
Using the mixture of the modified CNN-based classifier and SVM-based classifier, we achieved
facial expression recognition performance, which is higher than that of the mixture of the basic CNN-based classifier and SVM-based classifier. The confusion matrix in the case of the fusion of the modified CNN-based classifier and SVM classifier is shown in
Table 9.
We compared the performance of the CK+ database with other approaches in the literature. First, we compared the proposed method with AU-Aware Deep Networks (AUDN), in which the researchers proposed a deep learning method for facial expression recognition by incorporating the existing knowledge that the appearance variations caused by expressions can be decomposed into a batch of local facial action units (AUs) [
47]. Second, we compared the proposed method with occlusion-robust expression prediction using local subspace random forest (WLS-RF) [
50], in which the researchers train random forests upon spatially defined local subspaces of the face. Third, we compared the proposed scheme with the appearance features using salient facial patches [
52], in which the researchers suggested a novel foundation for expression identification by extracting a few facial patches, depending upon the location of facial markers. Fourth, we compared the proposed scheme with the joint model of the deep temporal appearance model and deep temporal geometry model (DTAGN) [
38], which is stationed on two diverse frameworks. The framework evokes temporal appearance features from the image pattern and temporal geometry features from temporal facial marker spots. Finally, we compared the proposed method with CNN with SIFT (Scale Invariant Feature Transform) [
48], in which the researchers suggested CNN for face recognition and attained very good accuracy by using a small portion of training data.
Table 10 shows the facial expression recognition performance in different approaches for the CK+ database. The proposed model shows state-of-the-art performance for facial expression recognition on the CK+ database among recent approaches.
3.4. Facial Expression Recognition for BU4D Database
We have also evaluated the facial expression recognition performance using the BU4D database [
67]. BU4D is a high-resolution 3D dynamic facial expression database, which contains facial expression sequences captured from 101 subjects. For each subject, there are six model sequences showing six prototypic facial expressions (anger, disgust, happiness, fear, sadness and surprise), respectively. Each expression sequence contains approximately 100 frames.
Figure 9 shows a sample sequence of a surprise expressions and its landmark points in the BU4D database. Depth data and texture data are provided. We used only texture data for the analysis of facial expressions. Using the mixture of the modified CNN-based classifier and SVM-based classifier, we achieved
facial expression recognition performance using four-fold cross-validation.
Table 11 shows the confusion matrix of facial expression recognition for the BU4D database using the proposed method.
We also compared the facial expression recognition performance with other facial expression recognition systems using the BU4D database. First, we compared the proposed method with the bags of motion word (BoMW) model [
44], which finds facial expressions by using local facial motion descriptors and motion words from clustering its descriptors. Second, we compared the proposed method with the 3D facial surface descriptor, spatial HMM and temporal HMM (2D-HMM) model used for facial expression recognition from 3D facial surfaces [
51]. The method uses three classes of HMMs: temporal 1D-HMM, pseudo 2D-HMM (a fusion of a spatial HMM and a temporal HMM) and real 2D-HMM. Third, we compared the proposed method with the Local Binary Pattern Three Orthogonal Plane (LBP-TOP), which is used to extract the spatiotemporal features of facial expressions [
53]. The method examines two space–time detectors and four descriptors and uses the bag of features framework for human facial expression recognition. Fourth, we compared the proposed method with 3D motion-based features captured using Free-Form Deformations (FFDs) [
49], in which a revealing pattern is geared to contain an onset followed by an apex and an offset. Fifth, we have also compared the proposed method with the Dynamic Geometric Image Network (DGIN), which uses different geometric quantities for dynamic 3D facial expression recognition [
45]. A two-stage long-term and short-term sliding window scheme is presented for data augmentation and temporal pooling throughout the training phase. We have also compared the proposed method with the method in [
54], which extracts the spatiotemporal features using HOG3D and discriminative feature selection and a hierarchical classifier for interpretable facial expression recognition. They showed the best performance on the BU4D database by extracting real 4D features in each hierarchical classifier layer from depth sequences around onset frames. However, they showed much lower performance than the proposed method with all frames, the same setting as the proposed model. Finally, the proposed method is compared with the Dense Scalar Fields (DSFs) and their deformation magnification used for 4D facial expression recognition [
46]. The method uses two growing but diverse suggestions by evaluating the spatial facial deformations using tools from Riemannian geometry and enhancing them using temporal filtering. The proposed method shows comparable results with state-of-the-art performance on the BU4D database, as shown in
Table 12.
4. Conclusions and Future Works
This paper presents the state-of-the-art performance of a facial expression classification system obtained by using a hybrid model: a mixture of CNN-based facial motion flow classification and SVM-based facial geometry displacement classification. The hybrid model exhibits higher facial expression classification performance than each individual facial expression classifier. The facial motion flow, which is the extracted facial motion displacement in every pixel, provides data that can be used to extract motion characteristics without a specific network architecture for motion extraction. A CNN architecture that can model facial motion flow is also presented. In addition, we have compared the proposed model with the state-of-the-art models and shows recognition performance of 99.69% and 94.69% for the CK+ and BU4D database, respectively. For the CK+ database, if we compare our recognition performance with [
48], which uses a combination of CNN and SIFT, there is an improvement of 0.59%, and if we compare the proposed model with [
38], which uses DTAN and DTGN, there is an improvement of 3%. For the BU4D database, if we compare the recognition performance of the proposed model with [
54], which uses HOG3D for onset frames, there is a slight degradation of 1.95%. However, if we compare performance with all frames, our proposed model performs better by 11.84%. In addition, if we compare our proposed model with [
46], there is an improvement of 0.51%, which shows the effectiveness of the proposed approach. In conclusion, the proposed method outperformed the others in the case of the CK+ database and is still comparable for the BU4D database with the previous schemes.
The limitation of proposed system is that we have used only western databases for the recognition of facial expressions, but in the future, we will add an eastern database for facial expression recognition. Even though our system is suitable to detect facial expressions, the detection should be interpreted carefully, because of the limited congruency between the facial expression and the emotion actually experienced by a person. The proposed hybrid model could be improved by improving the CNN-based facial expression classification performance. The weighting factor can be further optimized for better classification by determining the optimal weighting for each expression category or by determining another nonlinear classifier based on each classifier output. In the future, the proposed model can be used for entertainment applications, such as changing music tracks by identifying the mood of a person by using his/her facial expressions.