A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications
Abstract
:1. Introduction
- ■
- e-Healthcare:- The real-time FERS incorporates the healthcare system via the system that is used to analyze and detect the image’s visualization of the patient’s feelings remotely by identifying facial expressions [8] for patients with different ages, puberty levels, and genders collected from a giant cloud and from social networks. The m-Health provides mobile device-based practices to patients to support their medicine administration and daily healthcare facilities. e-Health is an electronic health service that uses information and communication technology for delivering facilities digitally and for processing patients and doctors through computers for drug administration. Both e-Health and m-Health provide immense support to healthcare industries in building e-Healthcare systems to ensure that patients, doctors, medical professionals, and businesses benefit, as well as to ensure the establishment of a healthy civilization with technological advancements in smart cities. Electronic healthcare systems provide services to patients to physically localize and monitor through recognizing their voice, speech, gesture movement, and facial expressions. Our proposed facial expression recognition system (FERS) will improve the services of healthcare systems. It is a significant challenge to obtain good results in the context of more efficient and less costly health services. Hence, while integrating the FERS into the healthcare framework, all healthcare requirements, such as automated intelligent sensors, sophisticated tools, security, authenticity, access, and privacy, should also be considered.
- ■
- Social IoT:- Social IoT systems represent an evolution of IoT-based systems. It establishes a platform for interconnecting subjects or objects worldwide through social relationships. It provides better services to users by relaxing the common interests between the users. Now, the services of social IoT are exploited in emotion-recognition as these emotions relate to the social activities of humans in their daily life. Hence, the integration of social IoT services will make life easier with several social care facilities for people [9]. The proposed FERS is useful for developing IoT-based smart devices and appliances. It can be used for several entities such as education, marketing research, retail, government, media and content, gaming, and finance. During online teaching, the facial expressions of students can be compared to their interest and understanding of topics that have been taught to them. The sentiments from online trading and investment strategies will be beneficial for the financial development of the organization. The emotion analysis using customer reviews and shopkeepers’ experience will bring good marketing research for the organization.
- ■
- Emotion AI:- Emotion AI [10] has wide applications in human resource management, such as in any business organization. It helps the human resource management system (HRMS) during the recruitment of a candidate for selection. This emotion AI considers several traits such as voice and text to analyze the sentiments in candidates.
- ■
- Cognitive AI:- Cognitive AI [11] provides methods and technologies to build a decision-making system based on the behavior and reasoning ability of a person. It helps a person to make decisions through a system. Job searching, salary prediction, carrier path selection for job-seeker problems, cyber-security with enabled AI, and natural language processing for sentiment analysis problems are under cognitive AI categories. Thus, social interaction, planning, interpretation, decision-making, competence of emotion, and self-learning capabilities are the processes of cognitive AI.
- We have designed a fast and efficient end-to-end deep learning-based framework using the convolutional neural network approach for learning face representation by adding some extra levels of feature representation schemes to improve the robustness and generalization of the model.
- The obtained predictive model detects and learns powerful high-level features from the input image and extracts more distinctive and discriminant features that provide effective results for the proposed FERS under various illumination changes as well as pose and age variation artifacts.
- To enhance the performance of the FERS, several experiments have been carried out with a trade-off between the batch vs. epoch, data augmentation, progressive image resizing, hyper-parameter tuning, and transfer learning techniques for the better prediction of expression types on the human face and for improvement of the performance as well as robustness of the proposed system.
- The proposed method finds the solution for the challenging issues of FERS. At the same time, a series of experiments have been conducted to reduce the training loss and over-fitting problems that arise due to inadequate training data and bias in the expressions’ variation.
2. Related Work
3. Proposed Methodology
3.1. Face Preprocessing
3.2. Feature Representation for Expression Classification
- Convolution:- The Convolution layer is the core building block of a CNN model that performs most of the computation operations. Convolution is a linear matrix operation consisting of some set of kernels or filters . The kernel is a small-sized matrix of weights that slide over the input [50] and performs element-wise matrix multiplication. The convolution operation essentially performs dot products between some sets of learnable filters and local regions of the input image , and produces an output matrix of dimension . Here, is calculated by , where is the stride that governs how many numbers of cells will be moved by the filter to the right and down, from the top-left corner to the bottom-right corner, in the input image to calculate the next cell in the result. Additionally, is the padding that shrinks the height and width of the volumes. Mathematically, the formulation of the convolution operation is denoted as follows [51]: for input feature vector and a filter vector , the convolution operation is obtained as , where the operator ★ denotes the convolution operation and represents the sliding vector inner product between the input feature and the flipped kernel . It measures the similarity between the two vectors. The primary benefits of the convolution operation are: (i) parameter or weight sharing, as a feature detector is used in one part and transfers into other parts of the image; (ii) the fact that it reduces the number of effective parameters and image translation; and (iii) the sparsity of connections, i.e., the hidden layers’ input and output dependencies.
- Max-pooling:- A pooling operation is a mathematical operation that performs pixel-wise average or median operations to reduce the input image size by half its size. The effective advantages of using pooling operations concern a means of removing noise, correcting images, and overcoming incidental occlusions [52]. The pooling layer is used to reduce the size of the representation to speed up the process as well as to make some of the features it detects more robust. There are different types of pooling operations, such as average pooling, fractional max-pooling, and max-pooling. Max-pooling is a commonly used pooling operation that is used in most CNN models. Max-pooling calculates the maximum value for patches of a feature map and uses it to create a down-sampled feature map. It is usually used after a convolutional layer. The primary benefits of max-pooling are as follows: (i) it is a translation invariance, i.e., it translates the image by a small amount that does not significantly affect the values of most pooled outputs; (ii) has reduced computational costs; (iii) has faster matching; and (iv) has improved accuracy.
- Fully Connected Layers:- It has been stated that fully connected layers and convolutional layers are distinct, but it has been observed that fully connected layers are a special case of convolutional layers [53]. In our proposed CNN model, we used two fully connected layers denoted as and . Here, neurons in have full connections to all activation in . The activation function can be computed with a matrix multiplication followed by a bias offset. Let represent the single output vector of layer and let denote the weight matrix of the . Suppose is the weight vector of the corresponding neuron of the column vector of in layer [54]. Then, the output of is obtained by . The output of fully connected layers is independent of the input image size. Fully connected layers of a CNN architecture will reduce the full image size, compute the single vector of class scores, and produce a resulting vector of size .
- Dense Layers:- The dense layer is a type of fully connected connection layer in deep neural networks [55]. In a dense layer, all input layers are connected to the output layers by a weight. It performs linear operations with parameters and generates parameters [56] that are also connected to the next layer as inputs. It utilizes dense connections between layers with matching feature map size , where is the activation function, e.g., ReLU defined as .
- Batch Normalization:- Batch is used to normalize the inputs of the previous layers at each batch, maintaining the values in a comparable range with the mean equal to 0 and the standard deviation equal to 1. This helps the CNN model to prevent skews at any one particular point and increases the computation speed. We applied the batch normalization after every convolution layer and then passed these values to the ReLU activation function. Batch normalization acts as a regularizer and allows the model to use higher learning rates [57]. It is used in various image classification problems and achieves higher accuracy with fewer training steps. Batch normalization also has a beneficial effect on the gradient flow through the network by reducing the dependence of gradients on the scale of their parameters or initial values. It also regularizes the model and reduces the need for dropout layers. We calculated the batch normalization mathematically as follows: For a a mini-batch of size m and with values of , i.e., activation and omit for clarity, the mini-batch is expressed as . and are the learning parameters, are normalized values, and are their corresponding linear transformations denoted by batch normalizing transform, i.e., . Thus, consider the following: mini-batch mean, ; mini-batch variance, ; normalization, ; and scale and shift, .
- Regularization:- Regularization strategies are designed to reduce the test error of a machine learning algorithm, possibly at the expense of the training error [58]. The popular regularization methods that exist in the field of deep learning [59] are dropout, R1-regularization, and discriminative regularization, among others. We employed the dropout regularization technique on the penultimate layer ( are the numbers of filters) for our proposed deep CNN model with constrain: norms of the weight vector [60]. The dropout regularization technique drops a unit during the training time with a specified probability. Dropout prevents co-adaptation of the network’s hidden units by randomly dropping out a portion or setting the hidden units to zero during forward and backward propagation. The neural network becomes too reliant on particular connections. Instead of using for output hidden unit in forward propagation, here, dropout uses , where the operator ’⊗’ performs element-wise matrix multiplication and is the masking vector of the Bernoulli random variable. At test time, all units are present and the learned weight vectors are scaled by such that , where represents the class score computed without dropout. The advantage of using dropout is that it prevents artificial neural networks from over-fitting. Intuitively, dropout can be thought of as creating an implicit ensemble of neural networks. This means that a selected subset of units for each training sample, including their incoming and outgoing connections, are temporarily removed from the network. Suppose a dropout probability of 0.5 is used; in this case, roughly half of the activation in each layer is deleted for every training sample, thus preventing hidden units from relying on other hidden units present.
- Optimisation:- The proposed FERS problem has been solved by stochastic optimization methods to optimize our CNN models. In this study, we used the popular first-order gradient-based Adam optimizer of the stochastic objective function. The popular optimization methods used for solving FERS problems are Adagard, SGD, RMSProp, SGD with momentum, AggMo, Demon, Demon CM, DFA, and Adadelta optimization methods. They use their stochastic mini-bath method. This method estimates the learning rate based on lower-order momentum. Adam [61] uses only the first two moments of gradient and the learning rate or steps size . The weight updates for the Adam optimizer are mathematically calculated as , where is a smaller number. The primary advantages of using the Adam optimizer are that it works well and is suitable for problem-solving for large training data sets. Adam can handle non-stationary objective functions as in RMSProp while overcoming the sparse gradient issue drawbacks that appear in RMSProp. Adam is favorable compared to other stochastic optimizers. The implementation of Adam is straightforward and computationally efficient with less memory required.
3.3. Factors Affecting the Performance of the Proposed FERS
- Data Augmentation:- The data augmentation technique is used to expand the training samples in order to improve the performance of recognition and the ability to generalize the models. In machine learning, image augmentation techniques artificially increase the amount of training data by applying transformation methods to the existing data [63]. The classical augmentation techniques that were employed are bilateral filtering, unsharp filtering, horizontal flip, vertical flip, Gaussian blur, additive Gaussian noise, image scale, image cropping, translation, image rotation, shear mapping, image zooming, image filling, and contrast normalization methods from [15] for the purpose of image augmentation. The whole training images were flipped horizontally by applying simple image data augmentation techniques. In this work, we applied these techniques for each resolution of the images.
- Fine Tuning:- Fine-tuning allows for higher-order feature representations in the base model to make them more relevant for the face recognition tasks. For example, VGG used many layers and generated a higher dimensional feature vector, and thus thw inference was quite costly at run-time due to huge parameters. In this case, fine-tuning techniques were applied when freezing some layers and the number of parameters, and the model was retrained to reduce computational overheads.
- Progressive Resizing:- Progressive image resizing is an eminent technique that sequentially resizes all images while training the CNN models on smaller, i.e., tinier images to larger image sizes. The progressive resizing technique is used to train a CNN with image size, saving the weights, and then the CNN is retrained again for other iterations with the images of increased sizes greater than n. This technique was used for super-resolution [64], where low-resolution images gradually increased to the image with a higher resolution during training processes. The advantages of using progressive resizing are that it improves generalization and reduces overfitting problems.
- Transfer Learning:- The principle concept behind transfer learning for facial expression recognition and classification problems is that a model trained on large data sets for one problem is effectively used as a generic model in some way on other related problems. The model that has been trained earlier is known as the pre-trained model. Our proposed deep learning convolution neural network model uses a transfer learning technique in which the weights of the pre-trained model and/or a set of layers from the pre-trained model are used for the new model to solve similar problems. Similarly, the weights of have been adopted to solve the model. The benefits of using transfer learning are that it reduces the training time and can result in lower generalization errors.
- Scores Fusion:- In the proposed system, three CNN architectures have been proposed. These architectures take images of different sizes as inputs. Thus, during the recognition of facial expressions on the test sample F, there are three different classification score vectors, namely , , and , where each is the classification score by the architecture and for expression class. These classification scores are fused together using score-level post-classification fusion approaches [14] to increase the performance of the recognition system. In this work, two score-level fusion techniques, namely Sum-rule and Product-rule, were employed. The Sum-rule and Product-rule techniques are defined as follows:
4. Experimentation
4.1. Database Used
4.2. Results and Discussion
4.2.1. Effect of Data Augmentation Techniques
4.2.2. Effect of Progressive Image Resizing
4.2.3. Effect of Transfer Learning
4.2.4. Effect of Score Fusion
4.2.5. Comparison
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Mollahosseini, A.; Chan, D.; Mahoor, M.H. Going deeper in facial expression recognition using deep neural networks. In Proceedings of the 2016 IEEE (WACV), Lake Placid, NY, USA, 7–10 March 2016; pp. 1–10. [Google Scholar]
- Jaimes, A.; Sebe, N. Multimodal human–computer interaction: A survey. Comput. Vis. Image Underst. 2007, 108, 116–134. [Google Scholar] [CrossRef]
- Kaukomaa, T. Facial Expressions as an Interactional Resource in Everyday Face-to-Face Conversation; Helsingin Yliopisto: Helsinki, Finland, 2015. [Google Scholar]
- Tsai, H.H.; Chang, Y.C. Facial expression recognition using a combination of multiple facial features and support vector machine. Soft Comput. 2018, 22, 4389–4405. [Google Scholar] [CrossRef]
- Ekman, P.; Friesen, W.V. Constants across cultures in the face and emotion. J. Personal. Soc. Psychol. 1971, 17, 124. [Google Scholar] [CrossRef] [Green Version]
- Sun, X.; Lv, M. Facial expression recognition based on a hybrid model combining deep and shallow features. Cogn. Comput. 2019, 11, 587–597. [Google Scholar] [CrossRef]
- Jack, R.E.; Sun, W.; Delis, I.; Garrod, O.G.; Schyns, P.G. Four not six: Revealing culturally common facial expressions of emotion. J. Exp. Psychol. Gen. 2016, 145, 708. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Muhammad, G.; Alsulaiman, M.; Amin, S.U.; Ghoneim, A.; Alhamid, M.F. A facial-expression monitoring system for improved healthcare in smart cities. IEEE Access 2017, 5, 10871–10881. [Google Scholar] [CrossRef]
- Jarwar, M.A.; Chong, I. Exploiting IoT services by integrating emotion recognition in Web of Objects. In Proceedings of the 2017 International Conference on Information Networking (ICOIN), Da Nang, Vietnam, 11–13 January 2017; pp. 54–56. [Google Scholar]
- Barrett, L.F. AI Weighs in On Debate about Universal Facial Expressions. Nature 2021, 589, 202–203. [Google Scholar] [CrossRef] [PubMed]
- Lisetti, C.L.; Schiano, D.J. Automatic facial expression interpretation: Where human-computer interaction, artificial intelligence and cognitive science intersect. Pragmat. Cogn. 2000, 8, 185–235. [Google Scholar] [CrossRef] [Green Version]
- Fernández-Dols, J.M.; Russell, J.A. The Psychology of Facial Expression; Cambridge University Press: Cambridge, UK, 1997. [Google Scholar]
- Ekman, P.; Rosenberg, E.L. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS); Oxford University Press: Oxford, UK, 1997. [Google Scholar]
- Umer, S.; Dhara, B.C.; Chanda, B. Face recognition using fusion of feature learning techniques. Measurement 2019, 146, 43–54. [Google Scholar] [CrossRef]
- Umer, S.; Rout, R.K.; Pero, C.; Nappi, M. Facial expression recognition with trade-offs between data augmentation and deep learning features. J. Ambient. Intell. Humaniz. Comput. 2021, 1–15. [Google Scholar] [CrossRef]
- Mavadati, S.M.; Mahoor, M.H.; Bartlett, K.; Trinh, P.; Cohn, J.F. Disfa: A spontaneous facial action intensity database. IEEE Trans. Affect. Comput. 2013, 4, 151–160. [Google Scholar] [CrossRef]
- Pantic, M.; Rothkrantz, L.J.M. Automatic analysis of facial expressions: The state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1424–1445. [Google Scholar] [CrossRef] [Green Version]
- Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the IEEE Computer Society Conference on (CVPR), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar] [CrossRef] [Green Version]
- Jindal, R.; Vatta, S. Sift: Scale invariant feature transform. IJARIIT 2010, 1, 1–5. [Google Scholar]
- Ojala, T.; Pietikäinen, M.; Harwood, D. A comparative study of texture measures with classification based on featured distributions. Pattern Recognit. 1996, 29, 51–59. [Google Scholar] [CrossRef]
- Liu, M.; Shan, S.; Wang, R.; Chen, X. Learning expressionlets on spatio-temporal manifold for dynamic facial expression recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 1749–1756. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
- Krumhuber, E.G.; Küster, D.; Namba, S.; Shah, D.; Calvo, M.G. Emotion recognition from posed and spontaneous dynamic expressions: Human observers versus machine analysis. Emotion 2019, 21, 447–451. [Google Scholar] [CrossRef] [PubMed]
- Le Cun, Y.; Bottou, L.; Bengio, Y. Reading checks with multilayer graph transformer networks. In Proceedings of the 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Munich, Germany, 21–24 April 1997; Volume 1, pp. 151–154. [Google Scholar]
- Shojaeilangari, S.; Yun, Y.W.; Khwang, T.E. Person independent facial expression analysis using Gabor features and Genetic Algorithm. In Proceedings of the 2011 8th International Conference on Information, Communications & Signal Processing, Singapore, 13–16 December 2011; pp. 1–5. [Google Scholar]
- Dhall, A.; Goecke, R.; Lucey, S.; Gedeon, T. Static facial expression analysis in tough conditions: Data, evaluation protocol and benchmark. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 2106–2112. [Google Scholar]
- Zhao, X.; Liang, X.; Liu, L.; Li, T.; Han, Y.; Vasconcelos, N.; Yan, S. Peak-piloted deep network for facial expression recognition. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherland, 8–16 October 2016; pp. 425–442. [Google Scholar]
- Tian, Y.I.; Kanade, T.; Cohn, J.F. Recognizing action units for facial expression analysis. IEEE Trans. Pattern Anal. Mach. Intell. 2001, 23, 97–115. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- An, L.; Yang, S.; Bhanu, B. Efficient smile detection by extreme learning machine. Neurocomputing 2015, 149, 354–363. [Google Scholar] [CrossRef] [Green Version]
- Jain, A.K.; Li, S.Z. Handbook of Face Recognition; Springer: London, UK, 2011; Volume 1. [Google Scholar]
- Turk, M.; Pentland, A. Eigenfaces for recognition. J. Cogn. Neurosci. 1991, 3, 71–86. [Google Scholar] [CrossRef]
- Sonka, M.; Hlavac, V.; Boyle, R. Image Processing, Analysis, and Machine Vision; Cengage Learning: Stamford, CT, USA, 2014. [Google Scholar]
- He, X.; Yan, S.; Hu, Y.; Niyogi, P.; Zhang, H.J. Face recognition using laplacianfaces. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 328–340. [Google Scholar]
- Delac, K.; Grgic, M.; Grgic, S. Independent comparative study of PCA, ICA, and LDA on the FERET data set. Int. J. Imaging Syst. Technol. 2005, 15, 252–260. [Google Scholar] [CrossRef]
- Yang, M.; Zhang, L. Gabor feature based sparse representation for face recognition with gabor occlusion dictionary. In Proceedings of the European Conference on Computer Vision, Heraklion, Greece, 5–11 September 2010; pp. 448–461. [Google Scholar]
- Gutta, S.; Wechsler, H.; Phillips, P.J. Gender and ethnic classification of face images. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 194–199. [Google Scholar]
- Zhang, G.; Wang, Y. Multimodal 2D and 3D facial ethnicity classification. In Proceedings of the 2009 Fifth International Conference on Image and Graphics, Xi’an, China, 20–23 September 2009; pp. 928–932. [Google Scholar]
- Zhang, Z.; Lyons, M.; Schuster, M.; Akamatsu, S. Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron. In Proceedings of the Third IEEE International Conference on Automatic Face and Gesture Recognition, Nara, Japan, 14–16 April 1998; pp. 454–459. [Google Scholar]
- Bartlett, M.S.; Littlewort, G.; Frank, M.; Lainscsek, C.; Fasel, I.; Movellan, J. Recognizing facial expression: Machine learning and application to spontaneous behavior. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 2, pp. 568–573. [Google Scholar]
- Rose, N. Facial expression classification using gabor and log-gabor filters. In Proceedings of the 7th International Conference on Automatic Face and Gesture Recognition (FGR06), Southampton, UK, 10–12 April 2006; pp. 346–350. [Google Scholar]
- Wu, T.; Bartlett, M.S.; Movellan, J.R. Facial expression recognition using gabor motion energy filters. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 42–47. [Google Scholar]
- Zavarez, M.V.; Berriel, R.F.; Oliveira-Santos, T. Cross-database facial expression recognition based on fine-tuned deep convolutional network. In Proceedings of the 2017 30th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Niteroi, Brazil, 17–20 October 2017; pp. 405–412. [Google Scholar]
- Gu, W.; Xiang, C.; Venkatesh, Y.; Huang, D.; Lin, H. Facial expression recognition using radial encoding of local Gabor features and classifier synthesis. Pattern Recognit. 2012, 45, 80–91. [Google Scholar] [CrossRef]
- Almaev, T.R.; Valstar, M.F. Local gabor binary patterns from three orthogonal planes for automatic facial expression recognition. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; pp. 356–361. [Google Scholar]
- Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zhu, X.; Ramanan, D. Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2879–2886. [Google Scholar]
- Mita, T.; Kaneko, T.; Hori, O. Joint haar-like features for face detection. In Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Beijing, China, 17–21 October 2005; Volume 2, pp. 1619–1626. [Google Scholar]
- Liu, Y.; Cao, Y.; Li, Y.; Liu, M.; Song, R.; Wang, Y.; Xu, Z.; Ma, X. Facial expression recognition with PCA and LBP features extracting from active facial patches. In Proceedings of the 2016 IEEE International Conference on Real-time Computing and Robotics (RCAR), Angkor Wat, Cambodia, 6–10 June 2016; pp. 368–373. [Google Scholar]
- Tuceryan, M.; Jain, A.K. Texture Analysis. In Handbook of Pattern Recognition & Computer Vision; World Scientific Publishing Company: Singapore, 1993. [Google Scholar]
- Dumoulin, V.; Visin, F. A guide to convolution arithmetic for deep learning. arXiv 2016, arXiv:1603.07285. [Google Scholar]
- Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
- Irani, M.; Peleg, S. Motion analysis for image enhancement: Resolution, occlusion, and transparency. J. Vis. Commun. Image Represent. 1993, 4, 324–335. [Google Scholar] [CrossRef] [Green Version]
- Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv 2016, arXiv:1606.06160. [Google Scholar]
- Ma, W.; Lu, J. An equivalence of fully connected layer and convolutional layer. arXiv 2017, arXiv:1712.01252. [Google Scholar]
- Ferro-Pérez, R.; Mitre-Hernandez, H. ResMoNet: A Residual Mobile-based Network for Facial Emotion Recognition in Resource-Limited Systems. arXiv 2020, arXiv:2005.07649. [Google Scholar]
- Wang, C.Y.; Mark Liao, H.Y.; Wu, Y.H.; Chen, P.Y.; Hsieh, J.W.; Yeh, I.H. CSPNet: A new backbone that can enhance learning capability of cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
- Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv 2015, arXiv:1502.03167. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Mescheder, L.; Geiger, A.; Nowozin, S. Which training methods for GANs do actually converge? arXiv 2018, arXiv:1801.04406. [Google Scholar]
- Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv 2012, arXiv:1207.0580. [Google Scholar]
- Kingma, D.P. Adam: A Method for Stochastic Optimization. arXiv 2015, arXiv:1412.6980. [Google Scholar]
- Scherer, D.; Müller, A.; Behnke, S. Evaluation of pooling operations in convolutional architectures for object recognition. In Proceedings of the International Conference on Artificial Neural Networks, Thessaloniki, Greece, 15–18 September 2010; pp. 92–101. [Google Scholar]
- Hernández-García, A.; König, P. Further advantages of data augmentation on convolutional neural networks. In Proceedings of the International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; pp. 95–103. [Google Scholar]
- Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
- Jain, V.; Crowley, J.L. Smile detection using multi-scale gaussian derivatives. In Proceedings of the 12th WSEAS International Conference on Signal Processing, Robotics and Automation, Cambridge, UK, 20–22 February 2013. [Google Scholar]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; pp. 94–101. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
- Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017. [Google Scholar]
- Rao, Q.; Qu, X.; Mao, Q.; Zhan, Y. Multi-pose facial expression recognition based on SURF boosting. In Proceedings of the 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi’an, China, 21–24 September 2015; pp. 630–635. [Google Scholar]
- Zhang, X.; Mahoor, M.H.; Mavadati, S.M. Facial expression recognition using lp-norm MKL multiclass-SVM. Mach. Vis. Appl. 2015, 26, 467–483. [Google Scholar] [CrossRef]
- Gao, Y.; Liu, H.; Wu, P.; Wang, C. A new descriptor of gradients self-similarity for smile detection in unconstrained scenarios. Neurocomputing 2016, 174, 1077–1086. [Google Scholar] [CrossRef]
- Sun, X.; Xia, P.; Zhang, L.; Shao, L. A ROI-guided deep architecture for robust facial expressions recognition. Inf. Sci. 2020, 522, 35–48. [Google Scholar] [CrossRef]
- Liu, M.; Li, S.; Shan, S.; Chen, X. Au-aware deep networks for facial expression recognition. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; pp. 1–6. [Google Scholar]
Layer | Output Shape | Image Size | Parameters | Layers | Output Shape | Image Size | Parameters |
---|---|---|---|---|---|---|---|
Block-1 | Block-3 | ||||||
Conv2D (3 × 3)@32 | (32) | (48, 48, 32) | ((3 × 3 × 3) + 1) × 32 = 896 | Conv2D (3 × 3)@96 | (,96) | (12,12,96) | ((3 × 3 × 64) + 1) × 96 = 55,392 |
Batch Norm | () | (48, 48, 32) | 4 × 32 = 128 | Batch Norm | (,96) | (12,12,96) | 4 × 96 = 384 |
Activation ReLU | () | (48, 48, 32) | 0 | Activation ReLU | (,96) | (12,12,96) | 0 |
Maxpool2D (2 × 2) | (,32) = n/2 | (24, 24, 32) | 0 | Maxpool2D (2 × 2) | (,96) = /2 | (6,6,96) | 0 |
Dropout | (,32) | (24, 24, 32) | 0 | Dropout | (,96) | (6,6,96) | 0 |
Block-2 | Block-4 | ||||||
Conv2D (3 × 3)@64 | (,64) | (24, 24, 64) | ((3 × 3 × 32) + 1) × 64 = 18,496 | Conv2D (3 × 3)@96 | (,96) | (6,6,96) | ((3 × 3 × 96) + 1) × 96 = 83,040 |
Batch Norm | (,64) | (24, 24, 64) | 4 × 64 = 256 | Batch Norm | (,96) | (6,6,96) | 4 × 96 = 384 |
Activation ReLU | (,32) | (24, 24, 64) | 0 | Activation ReLU | (,96) | (6,6,96) | 0 |
Maxpool2D (2 × 2) | (,64) = /2 | (12, 12, 64) | 0 | Maxpool2D (2 × 2) | (,96) = /2 | (3,3,96) | 0 |
Dropout | (,64) | (12, 12, 64) | 0 | Dropout | (,64) | (3,3,96) | 0 |
Block-5 | |||||||
Conv2D (3 × 3)@64 | (,64) | (3,3,64) | 55,360 | ||||
BatchNorm | (,64) | (3,3,64) | 256 | ||||
ActivationReLU | (,64) | (3,3,64) | 0 | ||||
Maxpool2D (2 × 2) | (,64), = /2 | (1,1,64) | 0 | ||||
Dropout | (,64) | (1,1,64) | 0 | ||||
Layer | Output Shape | Image Size | Parameter | ||||
Flatten | (1, ) | (1, 64) | 0 | ||||
Dense | (1, 256) | (1, 256) | (1 + 64) × 256 = 16,640 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 256) | (1, 256) | (256 + 1) × 256 = 65,792 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 7) | (1, 7) | (256 + 1) × 7 = 1799 | ||||
Total parameters for the input image size: | 300,871 | ||||||
Total number of trainable parameters: | 299,143 | ||||||
Non-trainable parameters: | 1728 |
Layer | OutputShape | ImageSize | Parameters | ||||
---|---|---|---|---|---|---|---|
Block-1 | |||||||
Conv2D (3 × 3)@32 | () | (96, 96, 32) | 896 | ||||
BatchNorm | () | (96, 96, 32) | 128 | ||||
ActivationReLU | () | (96, 96, 32) | 0 | ||||
Maxpool2D (2 × 2) | (, 32), | (48, 48, 32) | 0 | ||||
Dropout | (,32), | (48, 48, 32) | 0 | ||||
Block-2 | Block-4 | ||||||
Conv2D (3 × 3)@32 | (,32) | (48,48,32) | ((3 × 3 × 3) + 1) × 32 = 9248 | Conv2D (3 × 3)@96 | (,96) | (12,12,96) | ((3 × 3 × 64) + 1) × 96 = 55,392 |
Batch Norm | (,32) | (48,48,32) | 4 × 32 = 128 | Batch Norm | (,96) | (12,12,96) | 4 × 96 = 384 |
Activation ReLU | (,32) | (48,48,32) | 0 | Activation ReLU | (,96) | (12,12,96) | 0 |
Maxpool2D (2 × 2) | (,32) | (24, 24, 32) | 0 | Maxpool2D (2 × 2) | (,96) | (6,6,96) | 0 |
Dropout | (,32) | (24, 24, 32) | 0 | Dropout | (,96) | (6,6,96) | 0 |
Block-3 | Block-5 | ||||||
Conv2D (3 × 3)@64 | (,64) | (24, 24, 64) | ((3 × 3 × 32) + 1) × 64 = 18,496 | Conv2D (3 × 3)@96 | (,96) | (6,6,96) | ((3 × 3 × 96) + 1) × 96 = 83,040 |
Batch Norm | (,64) | (24, 24, 64) | 4 × 64 = 256 | Batch Norm | (,96) | (6,6,96) | 4 × 96 = 384 |
Activation ReLU | (,32) | (24, 24, 64) | 0 | Activation ReLU | (,96) | (6,6,96) | 0 |
Maxpool2D (2 × 2) | (,64) | (12, 12, 64) | 0 | Maxpool2D (2 × 2) | (,96) | (3,3,96) | 0 |
Dropout | (,64) | (12, 12, 64) | 0 | Dropout | (,64) | (3,3,96) | 0 |
Block-6 | |||||||
Conv2D (3 × 3)@64 | (,64) | (3,3,64) | 55,360 | ||||
BatchNorm | (,64) | (3,3,64) | 256 | ||||
ActivationReLU~ | (,64) | (3,3,64) | 0 | ||||
Maxpool2D (2 × 2) | (,64), | (1,1,64) | 0 | ||||
Dropout | (,64) | (1,1,64) | 0 | ||||
Layer | Output Shape | Image Size | Parameter | ||||
Flatten | (1, ) | (1, 64) | 0 | ||||
Dense | (1, 256) | (1, 256) | (1 + 64) × 256 = 16,640 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 256) | (1, 256) | (256 + 1) × 256 = 65,792 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 7) | (1, 7) | (256 + 1) × 7 = 1799 | ||||
Total parameters for the input image size: | 310,247 | ||||||
Total number of trainable parameters: | 308,455 | ||||||
Non-trainable parameters: | 1792 |
Layers | Output Shape | Image Size | Parameters | ||||
---|---|---|---|---|---|---|---|
Block-1 | |||||||
Conv2D (3 × 3)@16 | () | (192, 192, 16) | 448 | ||||
BatchNorm | () | (192, 192, 16) | 64 | ||||
ActivationReLU | () | (192, 192, 16) | 0 | ||||
Maxpool2D (2 × 2) | (,16), | (96, 96, 16) | 0 | ||||
Dropout | (,16) | (96, 96, 16) | 0 | ||||
Block-2 | |||||||
Conv2D (3 × 3)@32 | (,32) | (96, 96, 32) | 4640 | ||||
BatchNorm | (,32) | (96, 96, 32) | 128 | ||||
ActivationReLU | (,32) | (96, 96, 32) | 0 | ||||
Maxpool2D (2 × 2) | (, 32), | (48, 48, 32) | 0 | ||||
Dropout | (,32) | (48, 48, 32) | 0 | ||||
Layers | Output Shape | Image Size | Parameters | Layers | Output Shape | Image Size | Parameters |
Block-3 | Block-5 | ||||||
Conv2D (3 × 3)@32 | (,32) | (48,48,32) | ((3 × 3 × 3) + 1) × 32 = 9248 | Conv2D (3 × 3)@96 | (,96) | (12,12,96) | ((3 × 3 × 64) + 1) × 96 = 55,392 |
Batch Norm | (,32) | (48,48,32) | 4 × 32 = 128 | Batch Norm | (,96) | (12,12,96) | 4 × 96=384 |
Activation ReLU | (,32) | (48,48,32) | 0 | Activation ReLU | (,96) | (12,12,96) | 0 |
Maxpool2D (2 × 2) | (,32) | (24, 24, 32) | 0 | Maxpool2D (2 × 2) | (,96) | (6,6,96) | 0 |
Dropout | (,32) | (24, 24, 32) | 0 | Dropout | (,96) | (6,6,96) | 0 |
Block-4 | Block-6 | ||||||
Conv2D (3 × 3)@64 | (,64) | (24, 24, 64) | ((3 × 3 × 32) + 1) × 64 = 18,496 | Conv2D (3 × 3)@96 | (,96) | (6,6,96) | ((3 × 3 × 96) + 1) × 96 = 83,040 |
Batch Norm | (,64) | (24, 24, 64) | 4 × 64 = 256 | Batch Norm | (,96) | (6,6,96) | 4 × 96 = 384 |
Activation ReLU | (,32) | (24, 24, 64) | 0 | Activation ReLU | (,96) | (6,6,96) | 0 |
Maxpool2D (2 × 2) | (,64) | (12, 12, 64) | 0 | Maxpool2D (2 × 2) | (,96) | (3,3,96) | 0 |
Dropout | (,64) | (12, 12, 64) | 0 | Dropout | (,64) | (3,3,96) | 0 |
Block-7 | |||||||
Conv2D (3 × 3)@64 | (,64) | (3,3,64) | 55,360 | ||||
BatchNorm | (,64) | (3,3,64) | 256 | ||||
ActivationReLU~ | (,64) | (3,3,64) | 0 | ||||
Maxpool2D (2 × 2) | (,64), | (1,1,64) | 0 | ||||
Dropout | (,64) | (1,1,64) | 0 | ||||
Layer | Output Shape | Image Size | Parameter | ||||
Flatten | (1, ) | (1, 64) | 0 | ||||
Dense | (1, 256) | (1, 256) | (1 + 64) × 256 = 16,640 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 256) | (1, 256) | (256 + 1) × 256 = 65,792 | ||||
Batch Normalization | (1, 256) | (1, 256) | 1024 | ||||
Activation Relu | (1, 256) | (1, 256) | 0 | ||||
Dropout | (1, 256) | (1, 256) | 0 | ||||
Dense | (1, 7) | (1, 7) | (256 + 1) × 7 = 1799 | ||||
Total parameters for the input image size: | 314,503 | ||||||
Total number of trainable parameters: | 312,679 | ||||||
Non-trainable parameters: | 1824 |
Database | Class | Training | Testing |
---|---|---|---|
KDEF | 7 | 1210 | 1213 |
GENKI | 2 | 2000 | 2000 |
CK+ | 6 | 663 | 146 |
SFEW | 7 | 346 | 354 |
Database | Data Augmentation | No Data Augmentation |
---|---|---|
75.95 | 71.67 | |
KDEF | 78.18 | 74.92 |
80.92 | 78.21 | |
92.45 | 87.91 | |
GENKI | 94.13 | 89.77 |
95.59 | 91.16 | |
95.89 | 88.15 | |
CK+ | 96.22 | 92.67 |
96.71 | 95.20 | |
34.05 | 32.56 | |
SFEW | 34.91 | 33.11 |
35.72 | 33.34 |
Method | KDEF | GENKI |
---|---|---|
75.95 | 92.45 | |
78.18 | 94.13 | |
80.92 | 95.59 | |
Sum-Rule | 81.53 | 96.03 |
Product-Rule | 82.63 | 96.75 |
Method | CK+ | SFEW |
95.89 | 34.05 | |
96.22 | 34.91 | |
96.71 | 35.72 | |
Sum-Rule | 97.07 | 36.15 |
Product-Rule | 97.32 | 36.79 |
Method | Accuracy (%) | Remarks |
---|---|---|
Vgg16 [67] | 65.08 | Images used (980), expression class (7), train/test split |
ResNet50 [68] | 72.32 | Images used (980), expression class (7), train/test split |
Zavare et al. [42] | 72.55 | Images used (980), expression class (7) |
Images type (frontal), 10-fold cross validation | ||
Inception-v3 [69] | 75.04 | Images used (980), expression class (7), train/test split |
Rao et al. [70] | 74.05 | Images used (720), expression class (6) |
Images type (frontal), 10-fold cross validation | ||
Proposed | 82.63 | Images used (980), proposed CNN for seven expression classes |
Method | Accuracy (%) | Remarks |
---|---|---|
Vgg16 [67] | 72.08 | VGG16 CNN for seven expression classes |
ResNet50 [68] | 82.30 | ResNet 50 CNN for seven expression classes |
Inception-v3 [69] | 85.38 | Inception-v3 CNN for seven expression classes |
An et al. [29] | 88.50 | Feature (HOG), classifier (ELM) |
Zhang et al. [71] | 94.21 | Feature (CNN), classifier (Softmax) |
Gao et al. [72] | 94.33 | Feature (ensemble), classifier (ensemble) |
Proposed | 96.75 | Proposed CNN for seven expression classes |
Method | Accuracy (%) | Remarks |
---|---|---|
ResNet50 [68] | 91.87 | Images used (981), expression class (7), train/test split |
Inception-v3 [69] | 94.07 | Images used (981), expression class (7), train/test split |
Sun et al. [73] | 94.67 | Images used (510), expression class (7), k-fold cross-validation |
Proposed | 96.81 | Images used (981), proposed CNN for seven expression classes |
Method | Accuracy (%) | Remarks |
---|---|---|
Vgg16 [67] | 24.78 | Images used (700), expression class (7), train (346)/test (354) |
ResNet50 [68] | 24.98 | Images used (700), expression class (7), train (346)/test (354) |
Inception-v3 [69] | 29.52 | Images used (700), expression class (7), train (346)/test (354) |
Liu et al. [74] | 26.14 | Images used (700), expression class (7), train (346)/test (354) |
Proposed | 36.79 | Images used (700), expression class (7), Train (346)/Test (354) |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Hossain, S.; Umer, S.; Asari, V.; Rout, R.K. A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications. Appl. Sci. 2021, 11, 9174. https://doi.org/10.3390/app11199174
Hossain S, Umer S, Asari V, Rout RK. A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications. Applied Sciences. 2021; 11(19):9174. https://doi.org/10.3390/app11199174
Chicago/Turabian StyleHossain, Sanoar, Saiyed Umer, Vijayan Asari, and Ranjeet Kumar Rout. 2021. "A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications" Applied Sciences 11, no. 19: 9174. https://doi.org/10.3390/app11199174
APA StyleHossain, S., Umer, S., Asari, V., & Rout, R. K. (2021). A Unified Framework of Deep Learning-Based Facial Expression Recognition System for Diversified Applications. Applied Sciences, 11(19), 9174. https://doi.org/10.3390/app11199174