1. Introduction to Facial Expression Recognition (FER)
Facial Expression Recognition is a computer-based technology that uses mathematical algorithms to analyze faces in images or video. The facial analysis is developed in three primary phases: face detection, facial landmark detection, and facial expression and emotion classification. The last step analyzes the movement of facial features and classifies them into emotion or attitude categories, also taking the name of Facial Emotion Recognition, a topic of emotion recognition that involves the analysis of human facial expressions in multimodal forms. More generally, emotion recognition is the automatic processing of human emotions, most typically from facial expressions as well as from verbal expressions, but also body movement and gestures. The acronym FER, in literature, often refers to both facial expression recognition and facial emotion recognition [
1]. In this paper, it stands for Facial Expression Recognition, the recognition of emotional states based on facial expressions.
Facial expressions play an important role in expressing internal emotions and intentions and are one of the most significant non-verbal ways in daily emotional communication. Nowadays, there is a considerable demand for improving performance in facial expression recognition, due to the broad set of its potential applications, such as surveillance, security, and communication, even in real-time.
The majority of face recognition research and commercial face recognition systems typically use intensity images of the face (i.e., 2D texture data), but other approaches, which are becoming more and more common, consist of using 3D models of the face (i.e., 3D shapes) or both 2D and 3D face data (multi-modal FER) [
2].
Although significant progress has been made, most existing algorithms that use 2D features fail to solve the challenging problems of illumination and pose variations, naturally overcome by 3D approaches. For this reason, and thanks to the rapid technological development of 3D scanning, 3D FER has attracted more and more attention. A promising research direction to meet the requirements of real applications is the multi-modal 2D+3D FER, as it exists a considerable complementarity among different modalities. Data modality is not the only possible classification. In general, it is possible to distinguish two other perspectives to classify existing FER methods: expression granularity and temporal dynamics. From the first one, they are divided into recognition of prototypical facial expression (basic emotions) and detection of facial Action Units; from the second one, they are classified into still images and image sequences.
In contrast to traditional approaches, in recent years, deep-learning-based algorithms have been used for feature extraction, classification, and recognition tasks. Deep learning architectures, such as the Convolutional Neural Networks (CNN) and the Recurrent Neural Networks (RNN), have been applied to the field of computer vision, yielding state-of-the-art results in many studies with the availability of big data, including object recognition, face recognition, and FER. One of the main advantages of CNN is to enable “end-to-end” learning directly from input images, removing altogether or highly reducing the dependence on pre-processing techniques [
3].
The study of facial expressions and emotions is a topic that has interested in the research community for several years. In the absence of a uniquely comprehensive and widely accepted theory, multiple approaches for facial expression recognition have emerged, and extensive and detailed surveys of FER focusing on traditional approaches research are given in [
4,
5,
6,
7]; the reader is referred to [
8,
9] for 3D and [
10,
11] for 2D earlier approaches. Recently, deep learning-based FER approaches emerged, and have been surveyed in [
1,
12,
13,
14].
This study aims to highlight the advantages and drawbacks of 3D methods over 2D methods, commonly thought to have the potential for greater recognition accuracy and robustness. Compared to previous works focused on this topic, our survey offers a newcomer the very current scenario and the best outcomes in the whole facial expression recognition field. We present and analyze both traditional and deep-learning available methods for static and dynamic 3D FER, particularly focus on feature-based algorithms. Here, we mainly focus on facial landmarks and feature extraction and the main differences between 2D and 3D approaches.
After this brief overview of the possible types of classification of existing 3D FER algorithms, the remainder of the paper is structured as follows.
Section 2 focuses on basic emotions and action units, and some models of emotion classification are presented.
Section 3 gives an overview of the conventional and deep learning-based methods for 2D, 3D, and multimodal FER, keeping the distinction between feature-based and model-based approaches. Moreover, two exhaustive tables are used to outline and organize the articles surveyed. Contributions have been chosen among the years 2010 and 2019, to provide the most up-to-date view of the latest researches.
Section 4 deals with facial animation, discussing some critical references.
Section 5 presents a summary of available facial databases mainly used in this research field.
Section 6 and
Section 7 investigated, respectively, the importance of facial landmarks and the role of time in FER algorithms. In
Section 8, the results of the main works of the last decade on the three-dimensional recognition of facial expressions are reported, divided between basic emotions and action units. Finally,
Section 9 presents a comparison between the advantages and the drawbacks of the 2D and 3D algorithms. It also mentions how some research groups have positively addressed these disadvantages and what challenges are still open for the future.
Section 10 concludes the work and summarizes the obtained results to identify the best characteristics of an excellent automatic facial expression recognition system.
2. Basic Emotions and Action Units
Ekman [
15], in 1971, defined a set of six emotions that are accepted as universal: anger, disgust, fear, happiness, sadness, and surprise. Ekman and Friesen named this group the basic emotions, which are universally recognized regardless of language and culture and cannot be decomposed into smaller semantic labels. The seven characteristics of emotions identified by Ekman et al. [
16] are: “presence in other primates, distinctive physiology, universal commonalities in antecedent events, quick onset, brief duration, automatic appraisal, and unbidden occurrence”. Most research studies on FER have been limited to these six “cardinal” categories of emotions. However, humans make use of a much fuller range of facial expressions for everyday communication than these six, some are even combinations of these basic ones [
17]. Martinez et al. state that “there are approximately 7000 different expressions that people frequently use in everyday life” [
18]. Furthermore, some of the expressions can have multiple interpretations depending on the context in which they are shown [
19].
Another possibility for studying facial expressions is the use of action units. An action unit (AU) is the action of muscles typically seen when an individual produces facial expressions. Defined by Ekman and Friesen in 1978 [
20], the Facial Action Coding System (FACS) is given by a set of AUs to classify the movements of a distinct muscle or a muscle group activation of facial expression. Facial features are often divided into two groups, the upper face and the lower face, since facial actions in the upper part have only small interactions with facial motion in the lower one, and vice versa. The parameters associated with the upper face features describe the motion and the shape of the eyes, the brows, and the check. On the other hand, the parameters associated with the lower face features describe the motion and the shape of the lips, and the furrows in the nasolabial and nasal root regions [
21]. At the beginning, they divided the muscular activity in 46 AUs; although the number is relatively small, more than 7000 combinations of action units have been observed [
22], allowing to describe the details of facial expression. For example, AU 12 (lip corner puller) defines the contraction of the zygomatic major muscle, typically observed in feelings of happiness, together with AU 25 (lips part). The anger expression, on the other hand, is characterized by AU 4, AU 7 and AU 24, which cause the lowering of the eyebrows, the tension of the eyelids and the pressure of the lips respectively.
Table 1 summarizes the action units for Ekman’s six universal emotions.
The emotion categories can be classified into two groups: basic emotions, described above, and compound emotions, constructed as a combination of two basic emotion categories. Some examples of compound emotions most typically expressed by people are: happily surprised, sadly fearful, fearfully angry, and disgustedly surprised. It is important to note that often in the categories of composed emotions, there is a conflict between the action units because it is impossible to perform some muscular movements simultaneously, such as lip pressor (AU 24) and lips part (AU25) in happily surprised. In total, the researchers mapped twenty-one different facial emotions from Ekman’s six basic emotions, plus the neutral expression [
17].
However, there are other emotions, such as guilt, jealousy, pride, and shame that people may experience in everyday lives, but do not tend to show clear and distinct expressions. Ekman investigated them at the beginning of his studies and proposed an expanded list of basic emotions, including a range of positive and negative emotions that are not all encoded in facial muscles.
Basic emotion classification is, still today, a more popular research topic than automatic AU recognition, even though the latter is independent of interpretation and more suitable for describing spontaneous facial behaviors [
11]. At first, this disparity in favor of basic emotion classification [
23,
24,
25,
26,
27,
28,
29,
30] at the expense of AU recognition [
31,
32,
33] in both 2D and 3D domains, partly due to lack of FACS-codes databases. The Bosphorus is the first, and the only, 3D publicly available database that contains AU annotations and provides facial action coding. Sun et al. [
31] proposed an AU recognition system manually labeling eight AUs in the BU-4DFE database, but their annotations are private.
Returning on the basic emotion method, in 2014 Jack et al. [
34] from the University of Glasgow reported new research, published in the journal
Current Biology, intimating that there are only four basic emotions. They reached this conclusion by studying the facial muscles involved in every emotion, as well as the activation over time of each of them. The results obtained stated that while happiness and sadness are distinct over time, fear and surprise share some common signals, like wide-open eyes. Similarly, anger and disgust share the wrinkled nose.
Other scientists have developed the theme of basic emotions. For example, Plutchik [
35,
36] developed a new model, called the “wheel of emotions”, where basic emotions can be expressed at different intensities and can mix to form several emotions, as shown in
Figure 1. Plutchik’s eight basic emotions here called primary are joy, trust, fear, surprise, sadness, anticipation, anger, and disgust. Additional emotions exist, like aggressiveness, optimism, love and submission: these are all seen as a combination of primary emotions.
Russell [
37,
38] developed the Circumplex Model of Affect, presented in
Figure 2. It is a circular model divided into quadrants, to show the level of valence and arousal of emotional states. The x-axis represents the continuum between pleasant and unpleasant emotions, the y-axis represents the continuum between high and low arousal emotions, while the center of the circle represents a neutral valence and a medium level of arousal.
Another theory that has gained notoriety is Parrott’s approach [
39], in which he identified over 100 emotions and conceptualized them as a tree-structured list, as seen from
Figure 3. The first layer is composed of six primary emotions (love, joy, surprise, anger, sadness, and fear) that can be branched out into different forms of feeling, and the secondary emotions are the derivation of the primary ones instead of being a combination of them. Some of these emotions were not categorized by human expression, but rather, emotional states.
In recent years, some research groups have tried to enlarge the number of emotions and facial expressions considered, testing their algorithm not only on basic emotions. For example, Fabiano et al. [
40] consider in their study also embarrassment, nervousness, and pain. Zhang et al. [
41,
42] add to these expressions also startle, whereas Wei and Jia [
43] include yawn. Others, on the contrary, have chosen to focus on a reduced set of these basic facial expressions, like Le et al. [
44] who test their algorithm only on happiness, sadness, and surprise, or Sandbach et al. [
45,
46] whose algorithm works on anger, happiness and surprise emotions. An algorithm that works only on a small group of facial expressions has limited capabilities but can be useful for particular application fields. An in-depth analysis of the main works that use basic emotions or action units, with their respective recognition rates, is presented in
Section 7, both for traditional and deep learning approaches.
3. Conventional and Deep Learning-Based Approaches
In this section, the main traditional methods for 3D facial expression recognition are presented, comparing feature-based, model-based, and multi-modal algorithms. Next, the leading deep learning techniques applied to FER for feature extraction and classification are described, distinguishing between 2D and 3D.
3.1. Feature-Based VS Model-Based Algorithms
Existing conventional approaches for FER based on 3D static data can be categorized into feature-based algorithms and model-based ones. The first category focuses on the extraction, directly from the input data, of facial surface geometric information such as curvature, spatial relations between pairs of interest landmarks (Euclidean and Geodetic distances), gradient and local shape. Features are usually calculated on the region surrounding principal facial landmarks or on the mouth and eyes that inherently contain essential information for emotion recognition. These key features that are considered closely related to expression categories, in order to perform FER are fed to various classifiers, as well as Support-Vector Machines (SVM) [
47,
48,
49,
50,
51], Adaboost, k-Nearest Neighbors (k-NN), Linear Discriminant Analysis (LDA), Modified Principal Component Analysis (PCA), Hidden Markov Model (HMM) [
44,
45,
46], Random Forest [
52] or Neural Networks [
51,
53,
54].
Feature-based methods are straightforward, but present two main drawbacks. First of all, most of these kinds of approach need a set of correctly located landmarks for feature extraction, an additional step of the process. If until a few years ago it could be considered a difficult task in real-world applications because it required the manual localization of landmarks for feature localization or surface registration, to date this step has been automated, as recently reported by Zeng [
55] and Li [
56]. Moreover, the performance of the model is directly related to the discriminative power of facial features adopted instead of shapes. Most of the works found in literature make use of the landmarks manually labeled on the BU-3DFE or Bosphorus dataset. Experiments have proven their excellent performance in recognizing universal expressions, but most of the features used have not yet shown enough discriminative power for distinguishing subtle facial AUs.
Thanks to the technological development in imaging and scanning, nowadays it is possible to capture 3D scans and extract geometric characteristics from certain regions around facial landmarks. Examples of some popular expression features are the distances between 3D facial landmarks [
24,
57,
58,
59], 3D facial curves [
60], distances between locally extracted surface patches [
56], facial geometry images and normal maps [
61]. Another possible technique exploits 3D face descriptors, which derive from depth maps by using mathematical operators: first principal curvature, shape index, mean curvature, curvedness, etc. Descriptors are presented and fully explained in [
62,
63].
Table 2 gives a comparison of selected 3D FER algorithms that use facial features to perform expressions recognition. The works are listed chronologically by year of publication, and alphabetically by author.
The model-based methods, on the other hand, make use of a generic face model created using the neutral expression to determine emotions by measuring the feature vector formed by the coefficient of shape deformation. This approach needs to bring into correspondence the tracking model to 3D face scans by means of a registration step.
Zhao et al. [
30] presented an automatic 3D FER approach to perform expression prediction by combining a Bayesian Belief Net (BBN) and Statistical Facial Feature Models (SFAM). They tested the developed automatic method on the BU-3DFE database, applying the SFAM for the automatic recognition of the six universal expressions reaching an average recognition rate of over 82%. In [
64], Chen et al. proposed a real-time 3D model-based emotion recognition method. The 3D model of the face, which gives essential clues for improving sturdiness and allows managing large head rotations, rapid head movements, and partial facial occlusions, is restored by 2D images. More recently, Fabiano et al. [
40], in 2018 developed a novel method for 3D facial expression recognition based on a statistical shape model with global and local constraints, showing that the combination of the global shape of the face, along with local shape index-based information can be used to recognize a range of expressions, including the six basic emotions. The proposed method outperformed the state-of-the-art results achieving a classification accuracy of 99.99% on non-spontaneous data and 99.69% on spontaneous data (BP4D database). Zhen et al. [
65,
66] presented in their papers a novel approach to study FER problem based on the Muscular Movement Model (MMM) by combining the advantages of both feature-based and model-based methods. They formed without any manual landmark 11 muscle regions, each of which described by a certain number of geometric features (e.g., coordinate, normal and shape index) to capture shape characteristics. A genetic algorithm learns the weights of the several sections, and SVM and HMM classifiers are used for expression prediction in 3D and 4D FER, respectively.
Table 3 gives a comparison of selected 3D FER algorithms that use face model technique to perform expressions recognition. The works are listed chronologically by year of publication, and alphabetically by author.
3.2. Multi-Modal Algorithms Using 2D and 3D Data
Algorithms that combine results from 2D and 3D data did not appear until about the early 2000s. Up to date, in this area, the most straightforward approaches use combining the features obtained independently from the bi-dimensional or three-dimensional methods, such as texture information, landmarks location, facial shapes, and curvature, for recognizing expressions and measuring their intensity [
12]. Experiments show that merging features obtained in different modalities helps to catch the general characteristics of facial deformation and to enhance recognition accuracy, even though the texture information is affected by the illumination and pose variations.
In 2014 Hayat and Bennamoun [
76] used the 2D texture information incorporated along-with 3D features. After learning the SVM models from the 2D and 3D data separately, they performed the classification and undertook a fusion of results separately obtained for performance improvement. Hence, a feature dimensionality reduction technique is used to reduce the size of the feature vector while retaining its quality. Jan and Meng [
49] in 2015, proposed to fuse the key features obtained from the geometric and textured domains, to investigate how the overall performance is affected. Also in this work, a feature dimensionality reduction method is used, before applying machine learning techniques; merging the many elements produced by the algorithms can result in a large feature vector which can slow down the system. Accordingly, in 2018, they continued to use a textured 3D face scan, since both detailed 3D geometric features and 2D texture information can provide key cues for FER. Experiments are conducted on the BU-3DFE database, demonstrating the effectiveness of combing texture and depth cues.
From a general point of view, multimodal techniques may deal with various data modalities, originating from other sources of information such as thermal image acquisitions, voice data, brain signals or cardiovascular activity, and context information. Other visual cues that could be combined with standard features for improving methods that detect facial expressions are eye gaze, head orientation, the motion of the head and body, mouth fidgeting, and FEs frequency or duration. Some researchers have already demonstrated that context can improve emotion recognition, as well as the body posture that becomes more important as the FE is more ambiguous. Meanwhile, also the voice can provide indications on emotions through acoustic properties such as pitch range, rhythm, and amplitude or duration changes.
Marginal studies have been conducted on the combination of 3D facial data and physiological or acoustic cues. Investigations on the possible integration of visual and non-visual modalities, like physiological data coming from wearable devices, could be a possible branch of research for the coming years.
3.3. Deep Learning Applied to FER
Deep learning [
90,
91] is a machine learning technique theorized for the first time in the 1980s [
92] but only lately considered in practice because it needs a large amount of labeled data and considerable processing power. In recent years, deep learning techniques have been employed successfully in a wide range of tasks including the recognition of facial expressions, a difficult problem for machine learning, since people can show their feelings in very different ways. Deep Neural Networks (DNN) have been used to classify images of the human face into emotion categories in an end-to-end approach and overcome the difficulties of the traditional methods, reaching a level of recognition accuracy higher than that of man in some activities.
Automatic deep FER includes three different steps: pre-processing, deep feature learning, and deep feature classification [
13]. Pre-processing is necessary before training the neural network to learn essential features for recognition; some examples are image cropping, rotation correction, data augmentation, and spatial normalization. Completed this preliminary phase, one of the deep learning techniques, particularly CNN or RNN, is applied for FER, and the given face is classified into one of the basic emotion categories. Using deep networks, feature extraction and their classification are performed in an end-to-end way [
93], while in the traditional methods these two last phases are independent. Alternatively, the neural network can be used for feature extraction only, and then independent classifiers, such as support vector machine (SVM), can be applied to the extracted representations.
3.3.1. Convolutional Neural Networks (CNN)
Most of the deep learning methods use neural network architectures, and one of the well-known and used to recognize objects or faces is the Convolutional Neural Network (CNN). The three main factors that have contributed to the use of CNNs for deep learning are the elimination of manual extraction of features, the cutting-edge recognition results and the possibility of retraining existing networks for other recognition activities [
1].
A CNN is composed of an input layer, an output layer and up to 150 hidden layers in the middle (for this reason, it is called “deep”) [
94], while traditional neural networks have only 2–3. These layers are intended to find and learn features, repeating convolution, activation or Rectified Linear Unit (ReLU), and pooling operations; each hidden layer increases the complexity of the features of the image [
95]. Below is a summary of the three main steps:
- -
Convolution is a mathematical operation that recalls the functioning of a scanner and consists of applying a smaller matrix called kernel to the input matrices, the images. Each of the convolutional filters is translated with a specific translation stride and activates certain features of the pictures.
- -
After a convolutional layer, in most ConvNets, there is an activation function. The main purpose of this layer is introducing a non-linearity in the system, using non-linear functions, such as tanh, sigmoid, and ReLU. The Rectified Linear Units returns 0 if it receives any negative input, but for any positive value x it returns that value back, so it can be written as f(x) = max (0, x): only activated features are transferred to the next layer.
- -
The pooling performs a non-linear subsampling, reducing the size of the output matrices and the number of parameters that must be learned by the network. The most common operations are MaxPooling and AveragePooling.
After the learning phase of the features, the classification phase begins. To provide the classification output, a fully connected layer, and a classification layer, such as SoftMax, are used. The penultimate layer generates a vector of dimensions equal to the number of classes, containing the probability for each one and computing the class score on the entire original image. The last layer of the CNN architecture assigns the decimal probabilities to each class, in a multi-class problem.
3.3.2. RNN-LSTM
A Recurrent Neural Network (RNN) is a type of advanced artificial neural network commonly used in speech recognition. While in a traditional neural network all inputs and outputs are independent of each other, RNNs use sequential information. This class of neural networks shows temporal dynamic behavior and uses its internal memory to calculate output depending on the earlier computations.
A variation of the recurrent net, the so-called Long Short-Term Memory (LSTM), was proposed in the mid-90s by the German researchers Hochreiter and Schmidhuber [
96]. These networks use three gates to regulate and control the cell state and thus overcome the issue of the vanishing gradients and exploding problems that are common in training RNNs. LSTMs can keep the memory for a more extended period than RNNs, enabling them to model long-term dependencies in a sequence; common areas of application include speech recognition, language modeling, and video analysis (video-based expression recognition tasks).
3.3.3. State of the Art
In 1872 Darwin published the book “
The Expression of Emotion in Man and Animals” [
97], which first gave rise to the recognition and study of emotions. Subsequently, the researchers Izard and Ekman, inspired by Darwin, have conducted important studies on facial expressions, leading to significant advancements. The first study that presents an algorithm developed to perform Facial Emotion Recognition was conducted by Bartlett et al. in 2003. The paper [
98] tried to make a method capable of automatically detecting frontal faces in a video stream and classifying facial expression as either anger, disgust, fear, happiness, sadness, surprise, and neutral. The real-time face-detection system, a development of Viola-Jones’ work [
99], was tested on the Cohn-Kanade dataset of posed facial expressions. Later, several types of conventional approaches for automatic FER were developed. These algorithms are able to accomplish the task, detecting the face region and extracting geometric features (such as facial landmarks, face shape and its components), appearance features (such as pixel intensities and texture of the face), or a hybrid of geometric and appearance features [
1,
100]. The works presented in the literature can be divided according to the feature representations (static images or dynamic image sequences), the deep-learning-based algorithm (CNN or hybrid CNN-RNN), and the type of data (2D, 3D or 2D+3D). In this survey, the works of the last decades are presented, subdividing them into two-dimensional, three-dimensional, or multi-modal methods.
a. 2D FER
Matsugu et al. [
101] developed the first facial expression recognition model with the property of subject independence as well as translation, rotation, and scale invariance. They employed a CNN, more efficient and compact than the Fasel’s model [
102] presented the year before, to “detect smiling or laughing faces based on differences in local features between a normal face and those not.” In his paper, Fasel proposed two independent CNNs, one for facial expressions, and the other for face identity recognition, combined by a Multilayer Perceptron (MLP). FER systems are traditionally evaluated in a subject-independent way or with a cross-database approach, rarely in a subject-dependent way. This last technique, in which only a single person builds each classification model, is used in limited cases [
103] where greater importance is given to recognition accuracy than generalization. The subject-independent manner, on the other hand, trains the classifier on a subset of images and evaluates it on the remaining part of the same database, while the cross-database method trains and evaluates the classifier on all of the pictures in different databases. Cross-database tasks are more complicated to satisfy as each database presents different settings of illumination, pose, resolution, etc. Deep neural networks are able to recognize subtle features, and for this reason, perform well in flexible learning tasks reaching high levels of accuracy in FER problems [
104].
In 2014 there was one of the largest and the most challenging computer vision competition, ImageNet 2014 challenge. GoogLeNet, a new architecture of CNN inspired by LeNet [
105] and implemented by Google, won the classification and object recognition challenges, achieving a top-5 error rate of 6.67%. In the following years, this same network, trained initially to distinguish between 1000 objects, was re-trained for different objectives, including facial recognition [
106]. The performance of 2D facial expression recognition is degraded by pose variation and illumination problems, not present with a three-dimensional approach.
b. 3D FER
The technological progress in recent years has enabled us to achieve better performance in various areas, including facial expression recognition. The acquisition of high-quality 3D facial scans overcomes the problems of illumination and pose changes and contains more information about the movements of facial muscles induced by expressions.
In 2006 the first database including three-dimensional face models, the BU-3DFE, was made public, and many studies were performed. In addition to traditional approaches, categorized into model-based ones as well as feature-based ones, some researchers used deep networks in tasks of computer vision, often reaching higher accuracies.
The multimodal 2D+3D is a common approach for facial expression recognition, widely used in literature. Savran et al. [
33] showed that in general 3D data perform better than 2D data, especially for lower face AUs, but with the fusion of two modalities higher detection rates are achieved (97.1 %). Depth maps (3D meshes) are fused with other 2D maps, such as texture maps [
2,
51,
107], curvature maps [
2,
108] and normal maps [
2,
107].
Li et al. conducted two studies with Deep Learning-based methods. In the first [
109], they generated the deep representation of the 3D face by putting the geometry map, normal maps, normalized curvature map, and texture map into a pre-trained CNN. Facial geometric attributes are then classified using SVM, achieving the facial expression prediction. Two years later, in their second work [
2], they built a new deep CNN model for subject-independent multimodal 2D+3D facial expression recognition and trained it on six types of facial attribute maps. The single end-to-end training framework increases the accuracy and achieves the best results. “This is the first work of introducing deep CNN to 3D FER and deep learning-based feature level fusion for multimodal 2D+3D FER” [
2].
Oyedotun et al. [
110] proposed a CNN model able to learn more discriminative features from both RGB and depth map latent representation for joint learning of facial expressions. Yang and Yin [
108] presented a 3D facial expression recognition algorithm using CNNs and landmark clues, with the sole use of 3D geometrical facial models.
Chen et al. [
111], in their paper, directly used 3D facial point-clouds for FER based on a fast and light manifold CNN model, FLM-CNN. Compared to the existing ones, the method proposed is better in speed and feature extraction, achieving state-of-art performance on BU-3DFE and showing a high tolerance to pose changes.
Some of the last year’s works are [
112,
113,
114,
115,
116,
117]. Others are those of Jan et al. [
51] and Zhu et al. [
118]. Jan et al. [
51] designed a novel system for 3D FER based on accurate facial parts extraction according to localized facial landmarks, and deep feature fusion of facial parts. With the use of facial parts, they achieved better performance than using the entire face. Finally, a multi-class SVM classifier is adopted for facial expression prediction. Zhu et al. [
118] introduced a novel deep learning approach to 3D FER, namely Discriminative Attention-based Convolution Neural Network (DA-CNN), to capture more comprehensive expression related representations. They conducted several experiments to prove the effectiveness of their method, and state-of-art results are achieved for both 3D FER and multi-modal 3D+2D FER.
Lately, some researchers have started using 4D (3D dynamic) data [
3,
119,
120]. Sandbach et al. [
9,
46] proposed a method that exploits 3D motion-based features for dynamic FER in the BU-4DFE database, performing a six-way classification, the six basic emotion. In the second paper they surveyed the progresses in 3D and 4D face acquisition and presented available databases for static and dynamic 3D FER.
4. Facial Animation
Facial animation is an area of computer graphics that consists of methods and techniques for generating and animating models of a human, an animal, or a fantasy character face. Parke made the first efforts to represent and animate three-dimensional faces using computers in 1972 [
121]. Computer-based facial expression modeling and character animation is not a new endeavor but has been considerable growth of interest in recent years.
Following the success for describing movements of facial muscles of the FACS and Action Units developed by Ekman and Friesen in 1978 [
20], Platt [
122] and Brennan [
123] in the early-1980s produced, respectively, the first physically based muscle-controlled face model, and techniques for facial caricatures. Their studies gave birth to the first animated human character able to express emotion through facial expressions and body movements.
Different techniques exist for the generation of facial animation data: marker-based motion capture [
124,
125], markerless motion capture, audio-driven, and keyframe animation. Marker-based techniques are widely used for real-time facial animation thanks to their robustness but are not useful for retrieving fine-scale dynamics and require specialized sensors. To simplify the motion capture process, techniques without requiring markers or specialized tracking hardware came out leveraging depth sensors and structured-light based devices [
126]. The researchers demonstrated the ability to track detailed facial expressions from a 3D sensor in real-time, but the system required an extensive set of pre-processed facial expressions and consequently a lengthy training session. The year later, Li et al. [
127] used the same system replacing the Principal Component Analysis (PCA) model with an optimized rig, in order to reduce training poses and enable retargeting.
Real-time 3D low-cost sensors such as Microsoft’s Kinect favored the development of new methodologies that simplify the procedure [
128,
129]. In [
128], a user-specific dynamic expression model is created in an offline preprocessing step. The novel face tracking algorithm combines 2D color image and 3D depth map, simultaneously captured by Kinect, in a systematic way with user-specific blendshapes. This proposed method achieves high-quality performance-driven facial animation in real-time with a low-cost and markerless acquisition system, and more robust and accurate results than previous video-based methods. Li et al. [
129] proposed a real-time facial animation system with adaptive tracking without any training or expression calibrations, achieving superior tracking fidelity than existing state-of-the-art techniques.
Other works in facial animation are [
130,
131,
132]. Mousas and Anagnostopoulos [
130] presented in their paper a novel mesh deformation method to automatically transfer facial blendshapes from a reference to a source face model. Parameters such as elasticity, mesh curvature descriptors were not considered, but the presented method achieves a lower error rate than the previous methodologies. Wei and Deng [
131] studied speech animation in real-time based on live speech input, synthesizing but maintaining the realism of facial animation. Ouzounis et al. [
132] presented a methodology that provides the ability to efficiently transfer facial animations to face models with different morphological variations.
Recently, Ma and Deng [
133,
134] presented a “complete pipeline to photo-realistically transform the facial expression for monocular video in real-time” [
133] and a “real-time, automatic, geometry-based method for capturing fine-scale facial performance from monocular RGB video” [
134]. Facial animation applications include communication, education, and scientific simulation area, even if the primary use remains animation films and computer games.
5. Databases
In the literature, there are numerous databases related to FER for comparative and extensive experiments; some of them are used in particular for conventional FER approaches with decision methods, others for FER systems based on deep learning with recognition algorithms. Traditionally, human facial emotions have been studied using 2D data, static or dynamic, with many difficulties attached. A 3D-based analysis of facial emotions will facilitate handling significant pose variations and subtle facial behaviors but has specific problems such as a high computational cost. The databases mainly used in this research field are described below and summarized in
Table 4.
To date, various databases for 3D facial expression recognition exist and have been employed by the research community for the evaluations of the developed algorithms. Of these, only three publicly available have been designed specifically for emotion analysis, having datasets displaying the six basic emotions or different AUs of the FACS, and are BU-3DFE, BU-4DFE, and Bosphorus. The databases listed above are not the only ones to contain facial expressions; other public databases, for example, FRGC v2.0 and GavabDB, present a set of expressions variations, but incomplete or with an irregular distribution. The main databases with 3D images and video sequences used for the study of facial expressions are described below.
The most common for 3D FER systems is the Binghamton University 3D Facial Expression (BU-3DFE) database [
138], the first one to become publicly available. It consists of raw 3D facial scans of 100 subjects, 56 females and 44 males, with different ethnic and racial ancestries, and ranging age from 18 to 70 years old. Each person performed the expressions corresponding to the six basic emotions, acquired at four levels of intensity (low, middle, high, and higher) using 3D sensors, along with a neutral expression. In total, therefore, the database contains 2500 3D facial expression models, 25 for each subject, plus the associated texture images captured at two views (about +45° and −45°). The manually annotated dense landmark set is provided with the release, contributing to its diffusion and use.
The BU-4DFE [
140] is a 3D dynamic facial expression database, built to analyze the facial behavior in a dynamic 3D space. There are 58 female and 43 male subjects with a variety of ethnicities and an age range of 18–45, each of one performed gradually the six universal emotions starting and then finishing with the neutral expression. The sequences last approximately four seconds and are captured at a video rate of 25 frames per second; the database contains 606 3D facial expression sequences, giving a total of over 60thousand frame models.
The Bosphorus database [
139] was designed for research on 2D and 3D FER tasks. There are 105 subjects, one-third of whom are professional actors and actresses, for a total of 4666 facial data acquired using a structured-light based 3D system. The Bosphorus is the only publicly available database to date that contains 3D face scans for AUs, including intensity and asymmetry codes for each AU. Moreover, there are scans in various poses, expressions and realistic occlusion conditions, such as glasses or hand around the mouth, or with mustache and beard. In this case, as in the previous ones, the data was collected in a controlled environment in which the subjects were instructed to perform specific emotions; therefore, the BU-3DFE, the BU-4DFE, and the Bosphorus are all non-spontaneous databases.
The Binghamton-Pittsburgh 3D Dynamic Spontaneous (BP4D), developed by Zhang et al. [
41,
42], actually is the only database that considers 3D dynamic spontaneous facial expressions [
147]. The data was collected in the course of social interactions between the participants and an interviewer, with the use of 8 specific tasks that elicit different emotional expressions: happiness or amusement, sadness, surprise or startle, embarrassment, fear or nervous, physical pain, anger or upset, and disgust. Hence, the BP4D is suitable to design and test methods dealing with real-world scenarios.
Some works use the EUROKOM Kinect Face Dataset [
145] or other private RGB-D database, explored extensively for various applications. The EUREKOM Kinect Face Dataset consists of face images of 52 people captured in nine states, including various facial expressions, with a Kinect RGB-D camera, a sensor characterized by low data acquisition time and low-quality depth information. The data are acquired in different modalities (2-D, 2.5-D, 3-D), and all the images are provided in the three sources of information: the RGB color image, the depth map, and 3D.
6. Landmarks
Facial landmarks are key points that are used to extract facial information, such as identity, expression, and emotion, in computer vision tasks. They are shared by all human faces and contain a geometric and biometric meaning.
Landmarks were initially introduced by Farkas [
148] and extensively applied to different disciplines involving the human face [
149]. Extensive handbooks and reference works have been written within the context of the anthropometry discipline, such as “
Anthropometry of the head and face” [
150], and “
Three-Dimensional Cephalometry: A Color Atlas and Manual” [
151], which report the truthful and medical meaning of each landmark. Up to 59 points can be identified in the human face, but the most famous and used are only 20.
The localization of the landmark points is an essential step in many facial expression recognition [
152] and head pose estimation [
153] algorithms. Most of the studies in literature used manual landmarks, usually exploiting the pool of landmark coordinates stored in the publicly available databases. The Bosphorus database, for example, provides 2D and 3D coordinates of 24 labeled facial landmarks, which are: outer, middle, and inner left/right eyebrow, outer and inner left/right eye corner, nose saddle left/right, left/right nose peak, nose tip, left/right mouth corner, upper and lower lip outer/inner middle, chin middle, and left/right ear lobe. More recently, some researchers developed algorithms to automatically detect the locations of the landmarks points of the human face on images or videos without any assistance by the user [
48,
51,
55,
56,
68,
70].
However, facial key point detection is challenging due to significant variations in facial appearance under different facial expressions, head poses, environment and lighting conditions, and motion patterns, in particular for video sequences. Furthermore, facial occlusions can cause loss of landmarks information. The mouth region, like the lips, mainly presents this difficulty because it has high degrees of freedom in the movements. In more rigid parts, like eye corners, facial landmarks are easier identified automatically. For these reasons, Zeng et al in [
55] proposed a general and fully automatic framework for 3D FER that only needs three main facial landmarks: nose tip, left and right inner eye corners. Similarly, Xue at al. [
78] presented in 2014 a fully automatic method, including the detection of landmarks, for 3D FER. The algorithm, based on depth features, detect the four eye corners and nose tip in real-time, and then, from these five points, define another 25 heuristic points for facial expressions analysis.
In order to support the researchers’ choices to use a set of landmarks instead of another, statistical analysis or assumptions are often used. For example, Derkach and Sukno [
84] decided to discard 15 of the 83 manually labeled landmarks in the BU-3DFE database corresponding to the silhouette contour because they have arguably little validity in a 3D setting. Hence, only a subset of 68 landmarks laying within the face area was considered.
The motivation to use a small number of facial landmarks is a significant reduction in recognition time, a critical issue in real-time applications, where the expression must be determined almost immediately. Instead, more features will result in added tracking time and greater complexity in the classifier.
7. Role of Time
Facial expressions play an important role in the recognition of human emotions. At first, facial expressions analysis interested only psychologists, but later it had a wide diffusion in the field of scientific research.
In the past, many techniques, including neural networks, have been applied to recognize facial expressions in still images. The strategies developed had some limitations; one of these is the possibility to view only an instant of the expression, and usually, the still images capture the moment at which it is most marked, without considering that in their daily lives people show the apex of their facial emotions only for particular cases and for very brief periods of time. The subtle facial expressions that are used for most of the communication activities are not identifiable in still images but became visible in video sequences. For this reason, dynamic facial expressions became increasingly important in recent years. Temporal dynamics of facial expression provide additional relevant information that is not available in static 2D or 3D images. Indeed, an emotion lasts from 250 milliseconds to 5 seconds [
154]; a dynamic method may be useful for evaluating the intensity level of muscle activities and for classifying emotions.
Referring to the work of Schmidt and Cohn [
155], a majority of spontaneous smiles reach onset faster and show more action units than the posed smiles. In contrast to the 18 types of smile described by Ekman, during spontaneous smiles “the appearance of AU 12 was either simultaneous with or closely followed by one or more associated action units”, such as AU 6 (cheek raiser), AU 15 (lip corner depressor) or AU 17 (chin raiser). The dynamic properties of spontaneous human facial expressions have significant implications for human-computer interaction to describe naturalistic interactive behavior.
Most of the works that choose video sequences as input use 2D FER algorithms; only a few attempts have been made to analyze facial expression using 4D data (i.e., 3D videos), partly due to the lack of public databases having three-dimensional dynamic sequences. Nowadays, thanks to the incredible development of capturing, reconstruction, alignment, and tracking techniques, dynamics 3D recordings are increasingly used in expression analysis research [
9].
The public availability of the BU-4DFE database contributed to this innovation process. This database includes 3D sequences of the six universal emotions, but unfortunately, it does not provide AU annotations.
9. Discussion
9.1. Computational Problems
3D faces offer more granular cues but also impose a higher dimensionality than 2D faces. Indeed, multiple scans from slightly different viewpoints are typically necessary to convert the raw data into a clean data set and reconstruct the 3D geometrical model. This problem becomes even more significant when the resolution and frame rates increase, raising the amount of 3D data acquired and consequently, the storage and computational costs. Moreover, after the data pre-processing step, post-processing which includes registration and merging of the scans, holes filling, smoothing, regularization, is needed, as well as a point detection step for head pose estimation and feature extraction.
For the reasons mentioned above and to reduce the dimensionality problem, 3D recognition tasks are often performed by mapping a 3D facial surface into the 2D plane. In other words, the depth image from the 3D mesh or the point cloud is computed, and the value of the z-coordinate of every point in the 3D space is assigned to the correspondent in the 2D plane.
In feature-based algorithms, also features selection step is crucial for the overall recognition method performance and accuracy. It aims to select the most relevant feature set from all the prospective features, and to eliminate the non-relevant features, consequently reducing the dimensionality. Hence, it needs to be well planned to deal with the high dimensionality problem inherent with the use of 3D face images.
Some research groups attempted to address this problem by proposing different strategies. For example, Azazi et al. [
48] tackle the significant dimensionality problem through a twofold solution. First, the algorithm transforms the 3D faces into the 2D plane using conformal mapping. Then, a Differential Evolution (DE) based optimization algorithm is undertaken to select the optimal facial feature set and the classifier parameters simultaneously. Only the features with the maximum discriminative power are chosen.
These steps may lead to an increase in time and computational complexity [
7]. Hence, when a real-time response is desired, the 3D solution is not always right, although these strategies to reduce the response time of the algorithm have been developed.
9.2. Illumination and Limitation of Acquisition Technology
Changes in lighting due to skin reflectance properties and due to the internal camera control can significantly affect the performance of 2D FER systems. Illumination alterations cause troubles in AUs and facial expression recognition algorithms by the production of shadows and significant 2D texture variations [
157]. To overcome these difficulties due to changes in the illumination conditions, image representations that are not very sensitive to these variations exist. Some examples are edge maps, image intensity derivatives, and images convolved with 2D Gabor-like filters, but Adini and Moses’ results [
158,
159] showed that none of these representations is sufficient by itself. In general, it remains an open question of whether edge maps and the others provide an illumination-insensitive image for face recognition.
If on the one hand all or most of the 2D methods present illumination problems, on the other hand, the 3D systems are naturally robust to light variations. However, the technologies available for 3D data acquisition may be significantly affected by the difference of illumination conditions during the acquisition phase [
160]. For example, artifacts, like holes or spikes, occur in oily facial regions or the proximity of eyebrows, mustache, or beard, even under ideal illumination conditions.
This problem is even more significant when tridimensional dynamic sequences are acquired. In these cases, to overpower the ambient light, the illumination provides by 3D sensors (i.e., flash) is not enough, and conspicuous lighting equipment is necessary for obtaining a constant illumination. Hence to develop a system robust to light variations for facial expression recognition remains an important goal. For this reason, in 2011, Stratou et al. [
161] introduced a novel 3D dataset, called Relightable Facial Expression 3D database (ITC_3DRFE), which enables experimentations, having photometric information for studying the effect of illumination on FER. The author aimed to evaluate what kind of lighting can improve the algorithm’s performances.
Despite recent advances in 3D technologies and devices, the acquisition of facial data can be accomplished only in controlled environments, representing a common drawback of 3D recognition systems. These solutions also require that the person stay still in front of the 3D scanning device for a time range from some seconds up to a few minutes. Besides, multiple scans from slightly different acquisition view-points are typically necessary to reconstruct parts of the face that can be affected by self-occlusions from a particular view.
9.3. Neutral Scan
The neutral scan that is the subjects’ frontal face scan with the neutral expression is used to normalize the facial features. Based on the assumption of the availability of the subject’s neutral scan during testing, it is possible to consider two different scenarios: person-dependent and person-independent. In a person-dependent approach, the feature distances of the neutral scan of a subject are subtracted from the features of his/her expressive scans.
The neutral scan of the input subject can be taken for granted in laboratory applications, but not in real-life ones. During case-studies, people gradually perform the six basic emotions from (and then go back to) the neutral in approximately four seconds. On the contrary, during real applications, the capture time is minimal, and facial expressions change very rapidly from one to another without necessarily passing through the neutral one, making it challenging to acquire it. Berretti et al. [
67] in 2010 proposed an approach capable of achieving state of the art results, without using neutral scans. The same strategy was used by Fang et al. [
8] in 2012. Three years later, Azazi et al. [
48] proposed a person-independent approach. Their algorithm can recognize the facial expression in any arbitrary 3D textured face image, without any prior information about the neutral face of the person. The strength of this choice lies in the fact that it does not need a neutral face, taking advantage of its use for real-time applications is required for real-life applications.
9.4. 2D VS 3D: Pros and Cons
The use of 3D facial data overcomes some problems encountered on solutions based on 2D models, but, at the same time, it has some drawbacks that make three-dimensional techniques convenient only for a set of specific applications.
Table 7 lists the main advantages and disadvantages of 2D and 3D methods, next highlighting the common challenges.
The main common challenges that both 2D and 3D methods still have to overcome are listed below:
- -
Current publicly available databases include posed facial expressions; lack of spontaneous datasets
- -
Many features or facial points are marked by hand
- -
Algorithms usually depend on an excessive number of feature points
- -
Few works are fully automatic; many of the systems still require manual intervention
- -
Face images collected in commonly used facial expression databases ignore the effect of time and age
- -
Due to the large-scale dissimilarity in the nature of the facial data, systems can accurately recognize or classify expressions only for the human faces for which it is trained
- -
The detection of subtle AUs or the combinations of several co-occurring AUS is still challenging
- -
More expressions are needed to be handled by future facial expression recognition methodologies as well as arbitrary and spontaneous ones, and micro expressions
- -
All the expressions have to be recognized with equal accuracy
9.5. Next Steps
Considering the different approaches analyzed in the paper, the highest recognition rates were obtained by working on three-dimensional data, or with multi-modal algorithms using 2D and 3D data. These results confirm the advantages of using 3D images or videos compared to the most common 2D methods. Based on the amount of data available, it is possible to decide whether to apply a deep learning technique to FER. Neural networks need a large amount of labeled data and considerable processing power, but allow one to perform feature extraction and their classification in an end-to-end way, reaching a state-of-the-art level of recognition accuracy. Alternatively, a conventional approach is recommended, and the most widespread is the feature-based algorithm, whereas fewer works focused on model-based or multimodal-based approaches. It is used both for the automatic AU recognition and for the basic emotions, obtaining better performances for the latter. The AUs are independent of the interpretation, and therefore more suitable to describe spontaneous facial behaviors. For this reason, thanks also to the technological development and the birth of the first databases that consider 3D dynamic and spontaneous facial expressions, in the future works we could expect a more in-depth study of the action units.
Technological advances have allowed the development of research in 3D facial expression analysis. A large number of works still use databases for static 3D FER, most with posed and exaggerated expressions of the six basic emotions, in contrast to real life. The analysis of facial expression dynamics (4D FER) is increasingly expected shortly, along with the other significant challenge of recording spontaneous behavior, “captured in a range of contexts having both high spatial and temporal resolution in real-time” [
9]. The next step will be to get closer to the real world. Nowadays, with the acquisition and processing tools available, accessible to all at affordable prices, it is easier to work with three-dimensional images and videos. The transition from 2D to 3D, although with some drawbacks above all related to dimensional and computational costs, is almost complete. The next step will see the use of 4D with the introduction of the variable
time, with the need to speed up the recognition algorithms and the creation of new databases.
Once the initial problems related to the amount of data needed and the required computing power have been overcome, a combination of deep-learning algorithms improved the performance of FER. In the future, it will be possible to integrate deep learning or other advanced algorithms with Internet-of-Things sensors, improving the current recognition rates and including spontaneous micro-expressions.
10. Conclusions
This survey explores and aims to group and organize all the works and the various methods that tried to face the problem of facial expression recognition through the analysis of human emotions, focusing on 3D solutions. Traditional and deep-learning approaches for facial expression analysis are detailed, with a further distinction between data modality (2D, 3D, and multi-modal 2D+3D), expression granularity (prototypical facial expression and facial action units), and temporal dynamics (still images and image sequences). With this study, we want to provide a guideline for newcomers who will address this topic, and take stock of neural networks, taking advantage of the golden age of AI. The most important works of recent years have been presented, highlighting the pros and cons and the best outcomes in the entire facial expression recognition field.
The overall recognition accuracy of the expressions is included in a range between 60% and 90%. Some expressions, like anger and fear, generally have the lowest recognition rates. Indeed, the motions of these expressions are moderate compared to happiness or surprise, and thus more challenging to recognize. Regarding the action units, the experiments reached recognition rates in a broader range, between 50% and 95%. The number of features for each AU to be detected will be increased to achieve more accurate results, and the neural networks will gradually be used more and more in the field of facial expression recognition.
Despite the higher dimensional and computational costs, and the greater difficulty of working in real-time, 3D methods have achieved better recognition rates than the more common 2D methods. For this reason, it is essential to use a dataset that contains 3D facial models or 3D video sequences, such as BU-3DFE, Bosphorus, BU-4DFE, D3DFACS, UPM-3DFE, BP4D, or a private one. Predicting the expression of the human face in real-time requires recognition as accurately and as quickly as possible, but it becomes quite complicated when compared to the static images because a video is a collection of many frames, not just a single frame. Soon the recognition of emotions in real-time will be beneficial in the field of artificial intelligence research, with the need to recognize as many emotions of different people in one frame and detect mixed emotions. Many researchers developed algorithms that recognize the six basic expressions, but fewer contributions investigated other types of facial expressions or action units. For applications able to work in uncontrolled conditions, further improvements dealing with FER analysis must be done, enlarging the set of facial expressions and considering spontaneous emotions. Reducing the computational time and memory usage are the other two objectives. The primary motivation behind this is the urgent need to have robust and useful applications, which can be used in the real-world.
In our future work, we plan to include deep learning techniques, working on a private database containing three-dimensional videos and psychological validation of labeled emotions, to perform emotion recognition in the wild.