1. Introduction
As an essential element of human communication, facial expressions that reveal emotions help with the understanding of the intentions of people. For example, people generally infer emotional states such as happiness, sadness, fear, and anger from facial expressions and tone of voice. Therefore, facial emotion recognition (FER) [
1,
2,
3,
4] has been studied in computer vision and machine learning over the past several decades. Indeed, FER technology is rapidly emerging in the field of emotional Information and Communication Technology (ICT), including virtual reality, augmented reality, Advanced Driver Assistance Systems (ADAS), and human-computer interaction.
Facial expressions contain lots of emotional information, and emotions play several roles in human life by being involved in humans’ psychological state, behavior, and reactions. Human judgment and behavior also depend on the emotions felt. Therefore, emotions are used as indicators to infer psychological states. Among the various emotions, negative emotions (sadness, anger, disgust, surprise, fear), particularly fear, cause stress and decrease concentration [
5]. Therefore, accurately recognizing negative emotions can help identify the causes of stress. Furthermore, since negative emotions are risk factors that can adversely affect health, it is essential for maintaining good health to recognize and categorize negative emotions.
In general, humans feel fear when faced with dangerous situations or when threatened. Facial expressions in which the eyebrows are raised with tensed lower eyelids and lips stretched horizontally backward are classified as fear. However, fear emotion recognition is a complicated process associated with a lot of errors among negative emotions [
6]. According to John Cacioppo, a professor of psychology at the University of Chicago, it is said that negative emotions are directly linked to human survival instincts. They are relatively easy to detect due to their high level of emotional expression. However, since fear is the basis for other negative emotions, it is easy to confuse it with other negative emotions [
7].
Most of the existing approaches [
6,
7,
8,
9,
10,
11,
12,
13,
14] for recognizing emotions by classifying facial expressions have applied convolution neural network (CNN) models to visible light images (hereafter referred to simply as visible images). Compared with existing studies that classify facial expressions, there has been significant improvement in recognizing emotions by applying deep learning techniques such as CNN. However, the recognition performance for fear remained low. Facial expression classification based on visible light imaging in an uncontrolled environment (i.e., where lighting and background were not constant) showed low accuracy [
15]. In contrast, thermal imaging is less affected by lighting conditions and can be used even in completely dark environments. Since thermal imaging captures changes in temperature in the face region that is affected by human emotions, it has the potential for emotion recognition through facial expressions. It is even considered an alternative means to compensate for the shortcomings of visible light imaging [
16,
17].
Therefore, this study attempts to take advantage of thermal imaging to supplement the emotion recognition performance of visible images that manifest low recognition rates for fear. In order to do this, the face region was extracted from a visible image and used to train a CNN model, while the one from the thermal image was used to train a residual neural network (ResNet). In addition, each neural network was selected for its relatively good performance given the image. After learning facial expressions with each database by training the neural network, we confirmed that substituting visible images with corresponding thermal images for fear emotion improved overall performance.
The remainder of this study is organized as follows.
Section 2 discusses the existing research on emotion recognition using visible light and thermal images.
Section 3 presents the proposed method for improving fear recognition performance.
Section 4 details the database (DB) construction process and characteristics used in this study and analyzes the results of experiments with the proposed method. The conclusions drawn from the study are stated in
Section 5.
2. Related Works
Much research on general facial expression classification (or FER) has been conducted based on visible images [
18,
19,
20,
21]. Facial expression classification technology, based on visible light imaging that acquires an object’s image by measuring the light reflected from the object, is sensitive to changes in lighting. Furthermore, it is difficult to distinguish between real and fake emotions in images obtained from people who are good at disguising their emotions [
4]. Nguyen, D.H. et al. [
6] proposed a method of extracting facial features by an image classifier to obtain essential information about emotions. Considering the extracted facial features as temporal data, he assigned them to one of seven basic emotion types. Pitaloka, D.A. et al. [
9] proposed a method to increase the classification performance for six facial expressions by applying various preprocessing steps such as face detection and cropping, resizing, data normalization, and histogram equalization.
Jung et al. [
19] used two different types of CNNs. The first type extracted facial features along the time axis from an image sequence. The second extracted the geometric features of facial movements over time by receiving facial landmarks as input. They then proposed a method of integrating the two models to improve the facial expression classification performance. Ahmed Fnaiech et al. [
20] proposed a method to increase the fear recognition rate by projecting visible images from 3D to 2D and using angle deviation. However, there was a limitation in performance comparison as the experiment only classified emotions into two categories: fear and other negative emotions. Samadiani, N. et al. [
21] performed an emotion recognition experiment using Acted Facial Expression in the Wild (AFEW) obtained from the real environment. When this data was used, it was confirmed that the recognition performance of negative emotions was noticeably low, and although attempts were made to improve the recognition performance using multi-modality, the fear emotion recognition performance was still low.
Figure 1 shows several emotion recognition results of lower performance for fear in literature [
6,
10,
11]. The fear recognition rate was significantly lower than other emotions, which was relatively low even compared to other negative emotions.
Unlike the approaches to classifying facial expressions using visible images, the thermal image method, which is expressed in temperature according to the intensity of infrared radiation energy emitted from an object, is less sensitive to changes in lighting and can express an object even in a completely dark environment. Thermal imaging can also be applied to distinguish between spontaneous emotions (real emotions) and deliberate emotions (fake emotions) by capturing changes in body temperature that are affected by human emotions. J.W. Seo et al. [
16] proposed Thermal Face-CNN, a face liveness detection technique that could distinguish a real face from a fake face because the average human face temperature was 36–37 °C. Priya et al. [
22] proposed a method for recognizing emotions based on eigenfaces and principal component analysis (PCA) techniques using thermal imaging.
Hung Nguyen et al. [
23] studied integrating visible images and thermal images to overcome the disadvantages of visible light imaging, which is highly dependent on illuminance. They found the region of interest (ROI) in the thermal image and used a method to integrate the feature vectors by applying wavelet transform to the visible image. However, only the facial expression recognition accuracy for the entire image was evaluated.
As such, previous studies using visible images have suggested various techniques for recognizing emotions in facial expressions. Since the performance of fear recognition is lower than that of other emotions, there have been attempts to introduce thermal imaging to overcome the shortcomings of visible light imaging. However, it is difficult to find a study to improve the recognition performance of negative emotions among all emotions. Therefore, this study intends to use thermal imaging to compensate for the disadvantage, particularly the low recognition performance of fear emotion in visible images. A summary of related works is shown in
Table 1.
In order to build a cooperative algorithm using visible and thermal images, we outline our contributions as follows:
Given a synchronized sequence of visible and thermal images, we try to find discriminative attributes to recognize emotions, especially a negative one(fear).
Based on the discriminative attributes, we design a framework containing appropriate classifiers for both visible and thermal images. One of these classifiers could be supplementary to the other to take advantage of thermal imaging information.
The cause of the low recognition performance of fear emotion is investigated to find conditions for utilizing thermal images.
A new algorithm is derived by statistically analyzing both classifiers’ interactions. There should be a significant factor to differentiate the attributes of each emotion. We try to formulate such characteristics for further development.
3. Method of Improving Recognition Performance for Negative Emotions
3.1. Neural Network Design for Emotion Recognition Based on Visible and Thermal Images
This study proposes a method that compensates for low fear recognition performance using visible images empowered by thermal images using a database built for the study.
Figure 2 shows the neural network structure of the proposed method.
An image of a face region by removing the background was used as the input image. It was obtained by sampling the visible and thermal images at the same time interval that were acquired for 30 s simultaneously. The same image size of 224 × 224 was used for network input and trained on the proposed neural network. Among various neural network structures, we adopted the CNN block that takes visible images as input and is a simple structure consisting of four convolution layers, a pooling layer, and a drop-out layer. ResNet, which takes thermal images as input, uses the residual learning method to reconnect the features used in the previous layer. Since the gradient has a value greater than 1 in all layers, the gradient vanishing problem is solved. This structure has the effect of passing the input information to the next layer, and the amount of change in the input can be detected well. In particular, since 1 × 1 convolution and 3 × 3 convolution groups are used, extensive feature extraction is possible.
Learning was executed by repeating the CNN block three times using the face regions from the images since the recognition accuracy of the visible image when trained by a CNN was higher than by ResNet. Regarding thermal images, the recognition accuracy was higher when trained by ResNet than the other CNN. Thus, learning was conducted with ResNet considering the self-constructed database consisting of sequence data. A ResNet model makes a skip connection to facilitate the data differences over time. The skip connection is similar to long short-term memory (LSTM) in the sense that it is a process of better transmitting the gradient of the previous convolution block.
After learning facial expressions with each database in this way, the results from visible images were combined with those from thermal images, followed by an emotion recognition process based on facial expressions. Visible images are sensitive to changes in lighting, whereas thermal images are robust to changes in lighting. Therefore, it could be expected to recognize emotions better based on thermal images under a specific condition, which is found to be fear emotion recognition. To confirm that the fear recognition performance using thermal images is better than using visible images, the classification results by the neural network, as shown in
Figure 2, were compared and analyzed. More specifically, the variations between the two networks, each of which was trained with visible images and thermal images, were compared to each other to recognize the emotions of the sequence data for each subject.
Figure 3 shows the training procedures. As the backbone was completely trained for facial feature extraction, only 97 epochs were required to achieve the best performance.
3.2. Proposed Emotion Recognition Method
Figure 4 shows graphs representing significant variations in the similarity of fear among temporally continuous visible images. The
x-axis represents the sampled data number of each subject, and the
y-axis represents the similarity value of the emotions. The solid blue line represents the change in the similarity to fear, and the yellow line represents the similarity to sadness. An emotion is incorrectly recognized when the similarity graph to another emotion is located above the solid blue line. As shown in
Figure 4, when fear is misclassified as another emotion, the similarity to fear is less than or equal to a certain value. Considering the portions of misclassification, a specific criterion can be established to overcome the low accuracy of fear emotion. As a result, the similarity graph of fear from visible images will be replaced by that from thermal images.
Where
denotes the number of images whose recognition results are erroneously predicted to be different emotions when visible images representing fear are input, and
denotes a specific position value expressed as a percentage after arranging the images in order. This is expressed by the following equation.
When
was 25, 50, and 75, the entire data set was divided into four equal parts, and the similarity of images corresponding to the boundaries was defined as
,
, and
, respectively.
was the lower quartile,
was the upper quartile, and
was the median. The quartile range is expressed as the difference (
) between the
and
values, and the maximum value (
) and the minimum value (
) of the similarity to fear are defined in Equation (2).
With respect to the emotion similarities of visible images and thermal images, a specific position value of the fear emotion image classified as false negative in visible images is used as a threshold value. For example, suppose the similarity of the visible image is less than the threshold. In that case, the similarity of the visible image is reset to the similarity of the thermal image. When the threshold value is , , and max values, respectively, the similarity value of the visible image is replaced by the value of the corresponding thermal image. Then, an emotion is predicted again for the thermal image. Subsequently, the overall recognition performance is updated by the results from the thermal images.
4. Experimental and Analysis
4.1. Building the Database
A database composed of numerous visible images has been used in the field of FER [
24,
25,
26,
27]. However, emotion recognition could be affected by skin color or cultural differences when using publicly available visible images. To make matters worse, there are few databases consisting of thermal images expressing subjects’ spontaneous or induced emotions. If any, the databases used in the existing thermal image classification [
16,
22,
28,
29] are not publicly available. Therefore, in order to overcome such disadvantages, we constructed a database by acquiring visible and thermal images simultaneously. Visible light and thermal imaging cameras were installed in a space with constant lighting and background condition to acquire images of 53 subjects. For 30 s, four emotions of neutrality, happiness, sadness, and fear were induced to acquire images simultaneously with each camera. Finally, our original image database was constructed, as shown in
Table 2, by dividing it into frames and saving them as still images.
Visible images were saved in high-density (HD) (1280 × 720) resolution, 30 frames per second, in MPEG-4 file format. Subsequently, still images were extracted from the saved images by sampling at regular intervals and removing the unnecessary background. As a result, a dataset was then constructed by only storing the face regions.
The original thermal images were acquired using a forward-looking infrared (FLIR) thermal imaging camera. They were saved in MPEG-4 file format, with an HD (1080 × 1440) resolution, a thermal resolution of 80 × 60, and 8.57 frames per second. A new database was constructed by removing redundant backgrounds from each frame of the original image and extracting only the face regions.
Figure 5 shows a sample database of thermal and visible images built for this study. Since the sampling rate differed, the temporally closest image pair was extracted by manually synchronizing the visible and thermal images as much as possible.
4.2. Comparing Feature Attributes between Visible and Thermal Images
Seventy percent of the constructed DB was used as training data, and 15% was used to validate the neural network proposed in this study. The remaining 15% of the data was used to evaluate the performance in classifying four emotions.
Figure 6 shows a similarity graph learned in the fear class of two different subjects. The
x-axis represents the sampled data number of each subject, and the
y-axis represents the similarity value of the emotions. In the similarity graph, a value closer to 1 indicates a greater similarity to the corresponding emotion, and a value closer to 0 indicates a smaller similarity to the corresponding emotion. For example,
Figure 6a,c show the graphs for the visible image, and
Figure 6b,d for the thermal image, where the solid blue line represents a similarity variation of fear, and the solid yellow line represents a similarity variation of sadness. The red box in
Figure 6 represents a significant difference in the similarity measurements between the visible and thermal images acquired simultaneously. In the visible image shown in
Figure 6a,c, there is a radical deviation in the similarity values classified as fear. Precisely, there are even more points where the similarity values classified as fear intersect the similarity values classified as sadness. In other words, fear is often mistaken for sadness.
On the other hand, as shown in
Figure 6b,d, there is a small deviation in the similarity values classified as fear in the sequence data of the thermal image. For example, in the thermal image, the similarity graph of fear does not intersect with others, and their numerical values constantly maintain a higher value. It indicates that it was correctly recognized as fear in all sequence images. These results suggest that fear is better recognized based on thermal images than visible images.
4.3. Comparative Analysis of Classification Performance Using Four Emotions
Figure 7 shows the recognition performance for four emotions in visible and thermal images. For emotions like neutral, happiness, and sadness, the recognition performance based on visible images is higher than that based on thermal images. However, the recognition accuracy for fear is 94.98 with thermal images, which is higher than that with visible images. In line with previous studies, the recognition performance with visible images is particularly low for fear. Such a trend was demonstrated consistently throughout the entire data.
As a result of calculating the recognition accuracy for all four emotions, the accuracy of the thermal image was 94.61%, and the accuracy of the visible image was 96.52%, confirming that the recognition performance of the visible image was higher than that of the thermal image. Since the visible image has more feature information than the thermal image, the performance is higher with the visible image on average in terms of the overall emotion recognition performance.
Figure 8 shows a boxplot showing the similarity distribution for each emotion, with the test data accounting for about 15% of the database. Fifty percent of the data are distributed based on the median between the top and bottom sides of the box. The whiskers, which are the solid lines in the figure, that extend above and below the box, represent the maximum and minimum values in the quartile range of the data. Since the dotted line includes extreme values (outliers), it could be evaluated as being outside of the valid range. A larger box size represents a greater deviation of the data, and the intersections between the effective value ranges indicate a more severe difficulty in distinguishing between the data. For example, in
Figure 8a, the lower whisker of fear overlaps the upper whiskers of the other three emotions. Specifically, most of the effective range of sadness is overlapped with the effective range of fear, then, neutral and happy follow in descending order of difficulty. On the contrary, in
Figure 8b, the blue box and whiskers, which is the effective range of fear, do not overlap any effective ranges of the other three emotions. Instead, only their dotted lines intersect each other, which degrades the recognition performance at the minimum.
As a result of checking the similarity distribution of visible and thermal images, the deviation of similarity values is larger with visible images than with thermal images, as shown in
Figure 8. In other words, the blue whiskers representing fear and the yellow whiskers representing sadness intersect seriously, as in the red box in
Figure 8, indicating more cases of mistaking fear for sadness.
Conversely, there is a smaller deviation of similarity values to fear using thermal images, and the effective values of the data intersect less with other emotions, leading to better fear recognition performance. This was also confirmed in
Figure 6, which showed the similarity variation according to the sequence data of individual subjects.
By comparing
Figure 8 with the boxplot of neutral in
Figure 9, the difference between visible and thermal images can be more clearly revealed than in the case of fear. Both the visible and thermal images show little deviation in the similarity value of neutral emotion for the entire DB image. Accordingly, the neutral emotion was accurately recognized as the similarity hardly intersected with other emotions. The small deviation of the similarity value indicates that the difference in the recognition rate between the sequence data is small for each subject, and the recognition accuracy is also high.
4.4. Improving the Recognition Performance of Fear Using Thermal Images
In this study, the recognition results of thermal images were used to compensate for the low recognition performance of fear among negative emotions using visible images. In the distribution of false negative images whose actual labels were fear but recognized as another emotion based on visible images, the similarity values to fear ranges between 0 and 0.4976, as in
Figure 10. Hence, to increase the recognition performance of the data predicted as a false negative, the recognition performance using visible images was evaluated with the boundaries of
(0.3108),
(0.4073), and the maximum value (0.4976) is calculated by Equations (1) and (2).
The emotion recognition accuracy was the highest when the visible image data was replaced with thermal image data based on the maximum distribution of the similarity value of the false negatives. Therefore, the resultant recognition performance for each emotion is shown in
Table 3 by substituting thermal images for all the visible images with a classification similarity of 0.4976 or less to fear.
By applying the selective thermal images to compensate for the low performance of the visible images, the performance is improved evenly in the recognition performance for fear emotion, as shown in
Figure 11.
As a result of re-evaluating the recognition performance by synchronizing the fear recognition results based on the thermal images to the visible images, the fear recognition accuracy improved from 94.01% to 99.17%, with the recall, precision, and F1 scores all improved, as shown in
Figure 11.
Since there was no existing study using a DB consisting of visible light and thermal images simultaneously acquired to compare with the proposed method, the following method was used to compare the proposed method with the previous studies indirectly. First, both the publicly accessible visible image DB used in a previous study and the visible image DB constructed in this study were learned by the proposed method. Then, the fear recognition rates were compared. The visible image data of all DBs used for comparison were used for CNN training by matching the input data size to the same size. The CK+ DB consists of images including the upper body, and the FER2013 data consists of images obtained from the side or other angles in addition to those obtained from the front of the face, which usually leads to degradation in the fear recognition performance compared to the DB constructed in this study. The recognition performance of other emotions was 75–93%. As shown in
Figure 12, the fear recognition accuracy using the DB constructed in this study is 94.01%, whereas that using FER2013 and CK+ is 76.99% and 76.2%, respectively.
Among the previous studies that performed emotion classification with the CNN using the open DB, the overall emotion recognition accuracy of the study [
11] that performed emotion classification using a CNN with the FER2013 database was 61.7%, and the overall emotion recognition accuracy of the study [
30] that performed emotion classification with the CK+ DB was 80.3%. For the DB constructed in this study, the emotion recognition accuracy using visible images was 96.52%.
As shown in
Figure 13, the overall emotion recognition accuracy using the method proposed in this study improved from 96.52% to 99.09%, showing an improvement in emotion recognition performance with other performance metrics as well. Through this relative comparison, it was demonstrated that the low fear recognition performance using visible images can be improved by using thermal images as proposed in this study.
In order to conduct a fair comparison, there should be an existing DB that contains synchronized visible and thermal images acquired simultaneously. Unfortunately, we have not found such a DB yet. Thus, we tried to compare our system with others by using several public DBs containing only visible images. Such an indirect comparison shows that the proposed system gives a comparable recognition performance over other systems using visible images and that the DB we constructed has a good quality that provides temporally synchronized visible and thermal images.
5. Conclusions
This study used thermal images to improve low recognition performance for fear emotion with visible images. A DB was constructed by simultaneously acquiring visible and thermal images. Then only the face regions were extracted from the face images. The CNN was trained using visible images. The database constructed by extracting only the face regions from the thermal images were also used for learning with a ResNet-18 model. Subsequently, the learning results of the thermal image DB, which showed strength in the classification of fear, were synchronized with the learning results of the visible image DB.
First, the emotion similarity was calculated based on the fear images falsely recognized as another emotion in the visible image. For images with a similarity value lower than the threshold, the emotion recognition performance was evaluated by replacing the similarity of the visible image with that of the thermal image. For the visible images, the fear recognition rate improved from 78.08% to 98.44% in recall, from 97.97% to 98.25% in precision, from 94.01% to 99.17% in accuracy, and from 86.9% to 98.34% in F1 score. Finally, it showed an average improvement of 4.54% in classification performance compared to emotion classification using only visible images.
We confirmed that thermal imaging could complement visible images in emotion recognition, unlike the existing emotion recognition technology that utilized the features of visible and thermal images individually. The most important contribution of this study is that we found significant characteristics of thermal imaging that remarkably differentiated the fear emotion attributes and that it led to an efficient system integration yielding significant performance improvement in recognizing fear among negative emotions. As a result, we have found a potential application of thermal imaging in emotion recognition throughout this research. We are currently working on designing a more sophisticated system by elaborating the decision-making routines, whereby the other negative emotions will be dealt with by resolving false positive errors for a real-time process.
Future research is planned to improve facial expression classification and emotion recognition performance based on thermal images by using only a part of the thermal face image data or extracting new features through preprocessing. In addition, it will be possible to improve the emotion recognition performance by applying an algorithm such as extracting various correlations from input data by composing an ensemble network with visible images.