An Innovative Neighbor Attention Mechanism Based on Coordinates for the Recognition of Facial Expressions
Abstract
:1. Introduction
- Our CNAM method is capable of making use of the landmark information embedded in features extracted from a preprocessor, e.g., MobileFaceNet [19], together with information again embedded in features extracted by a preprocessing module, which attempts to overcome the inherent noise label issues within a manually labeled training dataset, e.g., from IR50 [18,23], and obtains very respectable results on the following datasets: RAF-DN, AffectNet(7cls), AffectNet(8cls), and CK+.
- Our CNAM method, in comparison with other SOTA methods, achieves results which are among the very best—if not the best—results known at present.
- In using both qualitative and quantitative analysis techniques, e.g., confusion matrix analysis, grad-CAM, t-SNE, and three statistical indicators, namely, the Silhouette Coefficient (SC), the Davies–Bouldin Index (DBI), and the Calinski–Harabasz Index (CHI), we are able to obtain some insights into the behaviors of feature vectors after the insertion of CNAM; basically, the mean intra-cluster distance has decreased, while the inter-cluster distance has increased, and these have moved in the right direction after the insertion of the CNAM modules.
2. Related Work
2.1. Visual Attention Mechanism
2.2. Fine-Grained Visual Recognition
2.3. Facial Expression Recognition
3. Method
3.1. Overview
3.2. Preprocessing Unit
3.2.1. Coordinate Attention Module
3.2.2. Neighborhood Attention Module
3.3. Postprocessing Unit
3.3.1. MLP Classifier
3.3.2. Cross Entropy Loss
4. Experiments
4.1. Datasets
- RAF-DB (Real world Affective Faces Database) [8] is a large-scale labeled facial expressions dataset. It comprises 315 individuals who are university students or faculty members, performing a range of expressions, such as smile, giggle, crying, anger, fear, shock, surprise, disgust, expressionless, surprise, happiness. The recorded images were subsequently labeled manually by crowdsourcing into seven classes: neutral, happy, sad, surprise, fear, disgust, and anger.
- AffectNet [9] is presently one of the most extensive publicly accessible dataset in FER, containing approximately 1 million facial images paired with labels which depict the underlying emotion of the faces in the images. Two datasets, AffectNet(7cls) and AffectNet(8cls), are extracted from this dataset, containing seven classes and eight classes of emotion, respectively. AffectNet(7cls) contains the following labels: neutral, happy, sad, surprise, fear, disguist, and anger; while AffectNet(8cls) contains an additional category: contempt in addition to those in the AffectNet(7cls).
- The CK+ (Extended Cohn-Kanada) dataset [10] is a small facial expression classification dataset. Images in this dataset are divided into seven classes: neutral, happy, sad, surprise, fear, disgust, anger. It is noted that this dataset is comparatively much smaller than the other three datasets.
4.2. Implementation Details
4.3. Ablation Studies
4.3.1. Size of the Neighborhood in the Neighborhood Attention Module
4.3.2. The Effects of Having a Coordinate Attention Module
4.4. Qualitative and Quantitative Analysis of CNAM Method
4.4.1. Confusion Matrices
4.4.2. Heatmap Visualizations of Applying CNAM on the RAF-DB Dataset
- In both Figure 6 and Figure 7, the second row shows the effect of the influence of the vertical direction, and the third row shows the influence of the horizontal direction of the coordinate attention, respectively. Due to the coarseness of the grad-CAM, it is rather more difficult to pinpoint what might have contributed to the coordinate attention in correctly or wrongly assigning the image to a certain category.
- The bottom rows of both Figure 6 and Figure 7 show the outcome of the CNAM after the neighborhood attention module. Again, due to the coarseness of the grad-CAM visualizations, it is rather difficult to draw hard and fast rules. It appears that in Figure 6, correct identification of the landmarks could have contributed to their prediction in the correct category; while bn contrast, in Figure 7, it appears that incorrect identification of the landmarks might have contributed to their being predicted to be in the wrong category.
4.4.3. Feature Visualizations Using t-SNE Method
4.4.4. Statistical Indicators
- All three indexes give consistent results, indicating that the clusters formed by the features after the CNAM are better than those before the CNAM.
- For CK+, both the SC after CNAM (0.886) and the DBI after CNAM (0.1197) signify that the clusters formed by features prior to their entry to the MLP classifier are well formed, and well separated. This is corroborated by the t-SNE plot in Figure 8(4b). Therefore, an MLP classifier could easily provide 100% accuracy.
- For RAF-DB, SC after CNAM is higher than that before CNAM, and is is also , which indicates that the clusters are better formed than before CNAM. But the DBI is , which is considerably far from 0, indicating that the clusters are relatively well formed, though the purity of some of the clusters might be less than ideal. This provides some quantitative measures to the observation made in the t-SNE plot (see Figure 8(3b)). This is further confirmed in the confusion matrix (see Figure 5c); even after the MLP classifier, there are wrong predictiions pertaining to some of the categories, e.g.,there is a 0.14 probability that a “surprise” expression could be misclassified as “fear”.
- For AffectNet(7cls) and AffectNet(8cls), the SC scores are very close to 0, indicating that there are significant overlaps among the clusters. This could signify that it would be very difficult for the MLP classifier to correctly predict some of the samples. This is confirmed by the t-SNE plots (see Figure 8(1b) and Figure 8(2b), respectively). This is further collaborated by the confusion matrices of the respective datasets (see Figure 5a,b).
- Note that the CHI does not add much value except that it confirms the observations made on the SC and DBI.
- Please note that information conveyed by the confusion matrix, t-SNE, and the statistical indexes are statistical in nature; i.e., they cannot refer to a particular testing sample. For this, one would need to be dependent on the grad-CAM plot relating to an individual sample.
4.4.5. Summary
4.5. Comparison of the Performance of CNAM Method with Those Obtained by Other State-of-the-Art Methods
4.5.1. The Comparative Results on the RAF-DB, AffectNet(7cls), and AffectNet(8cls) Datasets
Methods | Years | RAF-DB | AffectNet(7cls) | AffectNet(8cls) |
---|---|---|---|---|
SCN [70] | CVPR2020 | 87.03 | - | 60.23 |
PSR [71] | CVPR2020 | 88.98 | 63.77 | 60.68 |
LDL-ALSG [72] | CVPR2020 | 85.53 | 59.35 | - |
RAN [73] | TIP2020 | 86.90 | - | - |
DACL [27] | WACV2020 | 87.78 | 65.20 | - |
KTN [74] | TIP2021 | 88.07 | 63.97 | - |
DMUE [75] | CVPR2021 | 89.42 | 63.11 | - |
FDRL [76] | CVPR2021 | 89.47 | - | - |
VTFF [77] | TAC2021 | 88.14 | 61.85 | - |
ARM [78] | 2021 | 90.42 | 65.20 | 61.33 |
TransFER [64] | ICCV2021 | 90.91 | 66.23 | - |
DAN [21] | 2023 | 89.70 | 65.69 | 62.09 |
EfficientFace [79] | AAAI2021 | 88.36 | 63.70 | 60.23 |
MA-Net [34] | TIP2021 | 88.42 | 64.53 | 60.29 |
Meta-Face2Exp [80] | CVPR2022 | 88.54 | 64.23 | - |
EAC [81] | ECCV2022 | 90.35 | 65.32 | - |
POSTER [14] | 2022 | 92.05 | 67.31 | 63.34 |
POSTER-V2 [15] | 2023 | 92.21 | 67.49 | 63.77 |
DDAMFN [22] | 2023 | 91.35 | 67.03 | 64.25 |
ARBex [82] | 2023 | 92.47 | - | - |
S2D [83] | 2024 | 92.57 | 67.62 | 63.06 |
DCJT [84] | 2024 | 92.24 | - | - |
BTN [85] | 2024 | 92.64 | 67.60 | 64.29 |
FMAE [86] | 2024 | 93.09 | - | 65.00 |
ours | - | 92.37 | 67.63 | 64.14 |
- It is observed that the leader of the pack for RAF-DB and AffectNet(8cls) is FMAE (facial masked autoencoder) [86]. FMAE [86] is the first to use the following strategy: train a robust model on a masked augmented large dataset and then fine tune this robust model through an optimization process onto a relatively small dataset, like RAF-DB and AffectNet(8cls). It first created a large FER dataset through merging a number of existing FER datasets to obtain a Face9M dataset, which has approximately 9 million samples. For each sample, it uses the masked autoencoder (MAE) [87] to reconstruct the original image from a heavily masked input image, which can be trained in a self-supervised fashion. After the training has been completed, the lightweight decoder is discarded, and the trained encoder will be used in the fine tuning stage to adapt to a smaller dataset, like RAF-DB, through a fine tuning process of minimizing a simple classification loss, e.g., cross entropy loss. This does not involve any detection of landmark locations, as both RAF-DB and AffectNet(8cls) do not have any labeled landmark information. The success of this method may be attributed to two factors: very large training dataset of high resolution, and the ability of the MAE method in providing good and robust method with which to extract features.Fundamentally, FMAE works at the input image level, while CNAM, and many other methods, e.g., BTN, work as a postprocessing module; i.e., they process the information extracted by using some preprocessing steps to the input image. Therefore, in order to consider the idea of using an FMAE-like method, one needs to first overcome this fundamental issue. In the CNAM method, this can be easily achieved if we do not use the CA module but instead directly use the NA module to process the incoming image. So, it is possible to conduct the following experiments: first, we create a new large FER dataset by merging all the existing FER datasets. Let us denote this new large FER dataset as . Then, we use an NA module as both the encoder and decoder in an MAE-like processing of with heavy masks (up to mask), trained in a self-supervised fashion. This will produce a robust pretrained NA model, which could be adapted to smaller datasets, e.g., RAF-DB, by fine tuning the pretrained model using a cross entropy loss, or other similar losses, which are used for classification purposes. Because the NA module processes information in a more sophisticated manner when compared with an autoencoder, it is highly likely that this method could produce new SOTA results.
- The second-best-performing method is BTN (batch transformer network) [85]. This is one of the few which recognize the importance of information within a batch of size B. It uses the same preprocessing step as CNAM, i.e., IR50 for feature extraction and MobileFaceNet for landmark features. Instead of using CA like CNAM, it processes the outputs from IR50 and MobileFaceNet in a multi-level fashion, i.e.,a pyramidal vision-like processing of the outputs, with the output from a lower level feeding as the input to a higher level; and then it combines these two multi-level outputs. Then, it processes this output using the batch transformer, which is essentially processing the information available in the batch, looking for the relationship between a particular query with features of the predicted classes in the batch. The success of this method currently exposes two weak points in our CNAM method: we do not process the CA in a multi-level manner, and we do not make use of the “free” information which is available in a batch. As indicated, we use a reasonably large batch size: 144, with 7 or 8 classes, and each class in a batch could consists of >15 images. A simple idea to extend CNAM would be to process the CA in a multi-level framework and then to divide the features according to their labeled information into N classes, where N is the number of categories in the dataset. We would then process the features in the same class as a neighborhood, and then one may use NA to obtain the relationships of one class of features with those of other classes. This is because in NA, the neighborhoods do not need to be contiguous, i.e., following one another; they could be considered just as a region in the feature space, and NA may be conducted on these discrete regions. As indicated above, FMAE uses a very crude way of processing the input image and does not make use of the inherent information concerning the landmark features; there is a real possibility that this more refined way of using CNAM could yield an even better accuracy than that provided by FNAE or BTN.
- It is interesting to observe the efficiency of CA when compared with other directional-based methods, e.g., DAN (distract your attention network) [21], DDAMEN (dual direction attention mixed feature network) [22]. CA is more effective in discovering the relationships between the horizontal and vertical features, which characteristize the human face, than the directional-based methods, because the directional-based methods use a directional convolutional kernel, while CA uses weighting in the horizontal or vertical directions (see, e.g., the grad-CAM visualizations in Section 4.4).
- It is instructive to consider the influence of the idea to incorporate landmark information in FER studies. Prior to the popularization of this idea by POSTER [14], the best methods, like DAN [21], do not use this information explicitly. But POSTER [14] and POSTERv2 [15] show that by using a pretrained OPenPose model to convey some rudimentary information concerning landmarks, the accuracy jumps by about 2%, which is a significant jump in this field. Since then, a number of papers, like S2D (static to dynamic) [83], BTN [85], and CNAM, further incorporate this landmark information in their methods and achieve SOTA results. This observation underlies one of the main reasons why we consider it to be important to incorporate this landmark information into FER. With an extention of FMAE, FMAE-IAL (FMAE–identity adversarial learning) [86], it is possible to make use of datasets which have ladnmark labels (such datasets are important in studying the challenging problem of FER in the wild, i.e., unaligned face images, as compared with those images in the four datasets used in this paper, which are largely aligned and center-cropped). Such labels are crucial to obtaining the best results on these landmark-labeled datasets.
- Methods like FMAE [86], BTN [85], and CNAM, S2D [83] do not consider the important issue: label noise in the datasets. This issue of label noise arises because the labels on the images in the four datasets are obtained manually. Despite our best efforts in attempting to eliminate the label noise issue, e.g., by crowdsourcing, e.g., RAF-DB [8], inevitably, there will be still some label noise in the datasets. There have been some attempts to address this important issue, such as Meta-Face2Exp [80] and EAC [81]. Probably because these were introduced prior to the idea that landmark information is important to FER, their results are not competitive when compared with those later methods like BTN [85], FMAE [86], or CNAM. However, it might be possible to incorporate some existing ideas of how to minimize label noise effect, like that of EAC [81], to improve the models are based on landmark location information, e.g., BTN [85] and CNAM.
4.5.2. The Comparative Results on Class-Wise Classification in the RAF-DB, AffectNet(7cls), and AffectNet(8cls) Datasets
- As a general observation, the accuracies of some classes in the RAF-DB dataset are high, while for some other classes, they are not so high. For the AffectNet(7cls), they are all within a small band—around 60%—except for the category “Happy”. To some extent, this observation could also be made for the AffectNet(8cls), except that the figures are lower, e.g., for the “Happy” category; they are between 76% and 80%, while for other categories, they are less than those in the corresponding figures in the AffectNet(7cls) dataset. This could signify that images which are classified as “Happy” are easier to recognize, while images in the “Fear” and “Disgust” categories are relatively harder to recognize by the methods represented in Table 5. Indeed, this table is a simplified presentation of the results contained in the confusion matrix; i.e., these values are the diagonal values represented in the confusion matrices. From this, we may conclude that if CNAM finds it difficult make a classification, then the other methods also will also find it difficult, though the degree of difficulty could be different. Had the confusion matrices been available for all the methods, then it might have been possible to conclude the relative accuracies for the wrongly classified images among the categories in the dataset. Alternatively, some simple metrics, like FP (false positive rate), F1, or AUC, might reveal much more concerning the behaviors of the method at hand.
Dataset | Method | Accuracy of Emotions (%) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Neutral | Happy | Sad | Surprise | Fear | Disgust | Anger | Contempt | Mean Acc (%) | ||
RAF-DB | MViT [88] | 89.12 | 95.61 | 87.45 | 87.54 | 60.81 | 63.75 | 78.40 | - | 80.38 |
RAF-DB | VTFF [77] | 87.50 | 94.09 | 87.24 | 85.41 | 64.86 | 68.12 | 85.80 | - | 81.20 |
RAF-DB | TransFER [64] | 90.15 | 95.95 | 88.70 | 89.06 | 68.92 | 79.37 | 88.89 | - | 85.86 |
RAF-DB | POSTER [14] | 92.06 | 97.22 | 92.89 | 90.58 | 68.92 | 71.88 | 88.27 | - | 85.97 |
RAF-DB | POSTER++ [15] | 92.35 | 96.96 | 91.21 | 90.27 | 67.57 | 75.00 | 88.89 | - | 86.04 |
RAF-DB | APViT [89] | 92.06 | 97.30 | 88.70 | 93.31 | 72.97 | 73.75 | 86.42 | - | 86.36 |
RAF-DB | BTN [85] | 92.21 | 97.05 | 92.26 | 91.49 | 72.97 | 76.25 | 88.89 | - | 87.30 |
RAF-DB | CNAM | 91.84 | 96.71 | 93.24 | 91.19 | 68.92 | 75.00 | 88.89 | - | 86.54 |
AffectNet(7cls) | APViT [89] | 65.00 | 88.00 | 63.00 | 62.00 | 64.00 | 57.00 | 69.00 | - | 66.86 |
AffectNet(7cls) | POSTER [14] | 67.20 | 89.00 | 67.00 | 64.00 | 64.80 | 56.00 | 62.60 | - | 67.23 |
AffectNet(7cls) | POSTER++ [15] | 65.40 | 89.40 | 68.00 | 66.00 | 64.20 | 54.40 | 65.00 | - | 67.45 |
AffectNet(7cls) | BTN [85] | 66.80 | 88.40 | 66.20 | 64.20 | 64.00 | 60.60 | 63.00 | - | 67.60 |
AffectNet(7cls) | CNAM | 65.20 | 88.40 | 65.00 | 67.20 | 63.20 | 58.80 | 65.00 | - | 67.54 |
AffectNet(8cls) | POSTER [14] | 59.40 | 80.20 | 66.60 | 63.60 | 63.60 | 59.80 | 58.80 | 54.71 | 63.34 |
AffectNet(8cls) | POSTER++ [15] | 60.60 | 76.40 | 66.80 | 65.60 | 63.00 | 58.00 | 60.20 | 59.52 | 63.76 |
AffectNet(8cls) | BTN [85] | 61.60 | 77.40 | 68.80 | 65.60 | 65.60 | 54.80 | 63.80 | 57.00 | 64.32 |
AffectNet(8cls) | CNAM | 64.47 | 78.60 | 65.40 | 65.40 | 62.60 | 61.40 | 57.80 | 61.80 | 64.68 |
4.5.3. The Results on the CK+ Dataset
4.6. Limitations of CNAM
- CNAM may be considered a postprocessing module as it depends on the preprocessing models IR50 and MobileFaceNet. It is not a model which could directly process the image inputs. This is due to the fact that the CA module could only handle postprocessing duties from the landmark features provided by MobileFaceNet. It is possible to remedy this situation. If we take out the CA module and only have the NA module, in this case, the NA module would be able to process information directly from input images, e.g., rendering an image as a series of patches. But in this case, one would be disposing of the advantages of the CA module in processing the landmark features, which other SOTA methods, e.g., POSTERv2 [15], found useful. So, this would only make sense if one wishes to explore the possibility of FMAE [86] in the context of having access to a large composite FER dataset like Face9M [86].
- CNAM does not utilize the “free” information available in a batch as it uses a batch processing methodology. The “free” information within a batch could improve the CNAM performance further. So, how this batch information could best be exploited in a CNAM method would be an interesting direction for future research.
- CNAM does not explore the challenges posed by label noise, which are present, to some extent, in the FER datasets. This issue was highlighted when we applied CNAM to the CK+ dataset, which resulted in a generalization error of 100%, which could not be true in normal circumstances. This could be explained as purely fortuitous; the creators of this dataset happened to select a testing dataset which did not contain any label noise. Usually, however, one expects that there will be label noise in the data; therefore, it should be impossible to obtain 100% accuracy. It might therefore be interesting to explore some ways in which to alleviate label noise and incorporate it into CNAM.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Darwin, C.; Prodger, P. The Expression of the Emotions in Man and Animals; Oxford University Press: Oxford, OH, USA, 1998. [Google Scholar]
- Kumari, J.; Rajesh, R.; Pooja, K. Facial Expression Recognition: A Survey. Procedia Comput. Sci. 2015, 58, 486–491. [Google Scholar] [CrossRef]
- Huang, Y.; Chen, F.; Lv, S.; Wang, X. Facial Expression Recognition: A Survey. Symmetry 2019, 11, 1189. [Google Scholar] [CrossRef]
- Dang, V.T.; Do, H.Q.; Vu, V.V.; Yoon, B. Facial Expression Recognition: A Survey and its Applications. In Proceedings of the 2021 23rd International Conference on Advanced Communication Technology (ICACT), PyeongChang, Republic of Korea, 7–10 February 2021; pp. 359–367. [Google Scholar] [CrossRef]
- Wang, Y.; Yan, S.; Liu, Y.; Song, W.; Liu, J.; Chang, Y.; Mai, X.; Hu, X.; Zhang, W.; Gan, Z. A Survey on Facial Expression Recognition of Static and Dynamic Emotions. arXiv 2024, arXiv:2408.15777. [Google Scholar]
- Wang, J.; Wang, Y.; Liu, Y.; Yue, T.; Wang, C.; Yang, W.; Hansen, P.; You, F. Experimental Study on Abstract Expression of Human-Robot Emotional Communication. Symmetry 2021, 13, 1693. [Google Scholar] [CrossRef]
- Masuyama, N.; Loo, C.K.; Seera, M. Personality affected robotic emotional model with associative memory for human-robot interaction. Neurocomputing 2018, 272, 213–225. [Google Scholar] [CrossRef]
- Li, S.; Deng, W.; Du, J. Reliable Crowdsourcing and Deep Locality-Preserving Learning for Expression Recognition in the Wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2584–2593. [Google Scholar]
- Mollahosseini, A.; Hasani, B.; Mahoor, M.H. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. In IEEE Transactions on Affective Computing; IEEE: Piscataway, NJ, USA, 2019; p. 1. [Google Scholar]
- Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The Extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 94–101. [Google Scholar]
- Fathallah, A.; Abdi, L.; Douik, A. Facial Expression Recognition via Deep Learning In Proceedings of the 2017 IEEE/ACS 14th International Conference on Computer Systems and Applications (AICCSA), Hammamet, Tunisia, 30 October–3 November 2017. [CrossRef]
- Cheng, S.; Zhou, G. Facial Expression Recognition Method Based on Improved VGG Convolutional Neural Network. Int. J. Pattern Recognit. Artif. Intell. 2020, 34, 2056003. [Google Scholar] [CrossRef]
- Huang, Z.Y.; Chiang, C.C.; Chen, J.H.; Chen, Y.C.; Chung, H.L.; Cai, Y.P.; Hsu, H.C. A study on computer vision for facial emotion recognition. Sci. Rep. 2023, 13, 8425. [Google Scholar] [CrossRef] [PubMed]
- Zheng, C.; Mendieta, M.; Chen, C. Poster: A pyramid cross-fusion transformer network for facial expression recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision-Workshops, Paris, France, 2–6 October 2023; pp. 3146–3155. [Google Scholar]
- Mao, J.; Xu, R.; Yin, X.; Chang, Y.; Nie, B.; Huang, A. POSTER++: A simpler and stronger facial expression recognition network. arXiv 2023, arXiv:2301.12149. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Hassani, A.; Walton, S.; Li, J.; Li, S.; Shi, H. Neighborhood Attention Transformer. In Proceedings of the International Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Zhou, H.; Meng, D.; Zhang, Y.; Peng, X.; Du, J.; Wang, K.; Qiao, Y. Exploring Emotion Features and Fusion Strategies for Audio-Video Emotion Recognition. In Proceedings of the ICMI ’19: 2019 International Conference on Multimodal Interaction, Suzhou, China, 14–18 October 2019; pp. 562–566. [Google Scholar]
- Chen, S.; Liu, Y.; Gao, X.; Han, Z. MobileFaceNets: Efficient CNNs for Accurate Real-Time Face Verification on Mobile Devices. arXiv 2018, arXiv:1804.07573. [Google Scholar] [CrossRef]
- Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Process. Lett. 2016, 23, 1499–1503. [Google Scholar] [CrossRef]
- Wen, Z.; Lin, W.; Wang, T.; Xu, G. Distract your attention: Multi-head cross attention network for facial expression recognition. Biomimetics 2021, 8. [Google Scholar] [CrossRef] [PubMed]
- Zhang, S.; Zhang, Y.; Zhang, Y.; Wang, Y.; Song, Z. A Dual-Direction Attention Mixed Feature Network for Facial Expression Recognition. Electronics 2023, 12, 3595. [Google Scholar] [CrossRef]
- Deng, J.; Guo, J.; Yang, J.; Xue, N.; Kotsia, I.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 4685–4694. [Google Scholar] [CrossRef]
- Shan, C.; Gong, S.; McOwan, P.W. Facial expression recognition based on local binary patterns: A comprehensive study. Image Vis. Comput. 2009, 27, 803–816. [Google Scholar] [CrossRef]
- Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
- Ng, P.C.; Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003, 31, 3812–3814. [Google Scholar] [CrossRef] [PubMed]
- Farzaneh, A.H.; Qi, X. Facial expression recognition in the wild via deep attentive center loss. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 2402–2411. [Google Scholar]
- Marrero Fernandez, P.D.; Guerrero Pena, F.A.; Ren, T.; Cunha, A. Feratt: Facial expression recognition with attention net. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
- Zheng, H.; Fu, J.; Mei, T.; Luo, J. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5209–5217. [Google Scholar]
- Li, J.; Jin, K.; Zhou, D.; Kubota, N.; Ju, Z. Attention mechanism-based CNN for facial expression recognition. Neurocomputing 2020, 411, 340–350. [Google Scholar] [CrossRef]
- Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3588–3597. [Google Scholar]
- Wei, X.; Zhang, Y.; Gong, Y.; Zhang, J.; Zheng, N. Grassmann pooling as compact homogeneous bilinear pooling for fine-grained visual classification. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 355–370. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
- Zhao, Z.; Liu, Q.; Wang, S. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Trans. Image Process. 2021, 30, 6544–6556. [Google Scholar] [CrossRef]
- Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1106–1114. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
- Lu, L.; Wang, P.; Cao, Y. A novel part-level feature extraction method for fine-grained vehicle recognition. Pattern Recognit. 2022, 131, 108869. [Google Scholar] [CrossRef]
- Conde, M.V.; Turgutlu, K. CLIP-Art: Contrastive pre-training for fine-grained art classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3956–3960. [Google Scholar]
- Lin, T.Y.; RoyChowdhury, A.; Maji, S. Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1449–1457. [Google Scholar]
- Gao, Y.; Beijbom, O.; Zhang, N.; Darrell, T. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 317–326. [Google Scholar]
- Kong, S.; Fowlkes, C. Low-rank bilinear pooling for fine-grained classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 365–374. [Google Scholar]
- Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Learning deep bilinear transformation for fine-grained image representation. Adv. Neural Inf. Process. Syst. 2019, 32, 4277–4286. [Google Scholar]
- Sermanet, P.; Frome, A.; Real, E. Attention for fine-grained categorization. arXiv 2014, arXiv:1412.7054. [Google Scholar]
- Liu, X.; Xia, T.; Wang, J.; Yang, Y.; Zhou, F.; Lin, Y. Fully convolutional attention networks for fine-grained recognition. arXiv 2016, arXiv:1603.06765. [Google Scholar]
- Sun, M.; Yuan, Y.; Zhou, F.; Ding, E. Multi-attention multi-class constraint for fine-grained image recognition. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 805–821. [Google Scholar]
- Hu, T.; Qi, H.; Huang, Q.; Lu, Y. See better before looking closer: Weakly supervised data augmentation network for fine-grained visual classification. arXiv 2019, arXiv:1901.09891. [Google Scholar]
- Liu, H.; Feng, J.; Qi, M.; Jiang, J.; Yan, S. End-to-end comparative attention networks for person re-identification. IEEE Trans. Image Process. 2017, 26, 3492–3506. [Google Scholar] [CrossRef]
- Lan, X.; Wang, H.; Gong, S.; Zhu, X. Deep reinforcement learning attention selection for person re-identification. arXiv 2017, arXiv:1707.02785. [Google Scholar]
- Xu, J.; Zhao, R.; Zhu, F.; Wang, H.; Ouyang, W. Attention-aware compositional network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 2119–2128. [Google Scholar]
- Zhao, L.; Li, X.; Zhuang, Y.; Wang, J. Deeply-learned part-aligned representations for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 3219–3228. [Google Scholar]
- Liu, Y.; Yan, J.; Ouyang, W. Quality aware network for set to set recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5790–5799. [Google Scholar]
- Si, J.; Zhang, H.; Li, C.G.; Kuen, J.; Kong, X.; Kot, A.C.; Wang, G. Dual attention matching network for context-aware feature sequence based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5363–5372. [Google Scholar]
- Li, S.; Bak, S.; Carr, P.; Wang, X. Diversity regularized spatiotemporal attention for video-based person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 369–378. [Google Scholar]
- Chen, G.; Rao, Y.; Lu, J.; Zhou, J. Temporal coherence or temporal motion: Which is more critical for video-based person re-identification? In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part VIII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 660–676. [Google Scholar]
- Khorramshahi, P.; Kumar, A.; Peri, N.; Rambhatla, S.S.; Chen, J.C.; Chellappa, R. A dual-path model with adaptive attention for vehicle re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6132–6141. [Google Scholar]
- Ke, X.; Cai, Y.; Chen, B.; Liu, H.; Guo, W. Granularity-aware distillation and structure modeling region proposal network for fine-grained image classification. Pattern Recognit. 2023, 137, 109305. [Google Scholar] [CrossRef]
- Liu, X.; Zhang, L.; Li, T.; Wang, D.; Wang, Z. Dual attention guided multi-scale CNN for fine-grained image classification. Inf. Sci. 2021, 573, 37–45. [Google Scholar] [CrossRef]
- Wen, Y.; Zhang, K.; Li, Z.; Qiao, Y. A discriminative feature learning approach for deep face recognition. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part VII 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 499–515. [Google Scholar]
- Farzaneh, A.H.; Qi, X. Discriminant distribution-agnostic loss for facial expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 406–407. [Google Scholar]
- Xue, F.; Wang, Q.; Guo, G. Transfer: Learning relation-aware facial expression representations with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3601–3610. [Google Scholar]
- Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Rousseeuw, P.J. Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell. 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
- Calinski, T.; Harabasz, J. A Dendrite Method for Cluster Analysis. Commun. Stat. Theory Methods 1974, 3, 1–27. [Google Scholar] [CrossRef]
- Wang, K.; Peng, X.; Yang, J.; Lu, S.; Qiao, Y. Suppressing uncertainties for large-scale facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6897–6906. [Google Scholar]
- Vo, T.H.; Lee, G.S.; Yang, H.J.; Kim, S.H. Pyramid with super resolution for in-the-wild facial expression recognition. IEEE Access 2020, 8, 131988–132001. [Google Scholar] [CrossRef]
- Chen, S.; Wang, J.; Chen, Y.; Shi, Z.; Geng, X.; Rui, Y. Label distribution learning on auxiliary label space graphs for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 13984–13993. [Google Scholar]
- Wang, K.; Peng, X.; Yang, J.; Meng, D.; Qiao, Y. Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition. IEEE Trans. Image Process. 2020, 29, 4057–4069. [Google Scholar] [CrossRef]
- Li, H.; Wang, N.; Ding, X.; Yang, X.; Gao, X. Adaptively learning facial expression representation via C-F labels and distillation. IEEE Trans. Image Process. 2021, 30, 2016–2028. [Google Scholar] [CrossRef] [PubMed]
- She, J.; Hu, Y.; Shi, H.; Wang, J.; Shen, Q.; Mei, T. Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Ruan, D.; Yan, Y.; Lai, S.; Chai, Z.; Shen, C.; Wang, H. Feature Decomposition and Reconstruction Learning for Effective Facial Expression Recognition. In Proceedings of the Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021. [Google Scholar]
- Ma, F.; Sun, B.; Li, S. Facial expression recognition with visual transformers and attentional selective fusion. IEEE Trans. Ons Affect. Comput. 2021, 14, 1236–1248. [Google Scholar] [CrossRef]
- Shi, J.; Zhu, S.; Liang, Z. Learning to amend facial expression reprepresentation via de-albino and affinity. arXiv, 2021; arXiv:2103.10189. [Google Scholar] [CrossRef]
- Zhao, Z.; Liu, Q.; Zhou, F. Robust lightweight facial expression recognition network with label distribution training. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual Event, 2–9 February 2021; Volume 35, pp. 3510–3519. [Google Scholar]
- Zeng, D.; Lin, Z.; Yan, X.; Liu, Y.; Wang, F.; Tang, B. Face2Exp: Combating Data Biases for Facial Expression Recognition. In Proceedings of the Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Zhang, Y.; Wang, C.; Ling, X.; Deng, W. Learn From All: Erasing Attention Consistency for Noisy Label Facial Expression Recognition. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
- Wasi, A.T.; Sërbetar, K.; Islam, R.; Rafi, T.H.; Chae, D.K. ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning. arXiv 2023, arXiv:2305.01486. [Google Scholar] [CrossRef]
- Chen, Y.; Li, J.; Shan, S.; Wang, M.; Hong, R. From Static to Dynamic: Adapting Landmark-Aware Image Models for Facial Expression Recognition in Videos. arXiv 2023, arXiv:2312.05447. [Google Scholar] [CrossRef]
- Yu, C.; Zhang, D.; Zou, W.; Li, M. Joint Training on Multiple Datasets With Inconsistent Labeling Criteria for Facial Expression Recognition. In IEEE Transactions on Affective Computing; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
- Her, M.B.; Jeong, J.; Song, H.; Han, J.H. Batch Transformer: Look for Attention in Batch. arXiv 2024, arXiv:2407.04218. [Google Scholar] [CrossRef]
- Ning, M.; Salah, A.A.; Ertugrul, I.O. Representation Learning and Identity Adversarial Training for Facial Behavior Understanding. arXiv 2024, arXiv:2407.11243. [Google Scholar] [CrossRef]
- He, K.; Chen, X.; Xie, S.; Li, Y.; Dollar, P.; Girshick, R. Masked Autoencoders Are Scalable Vision Learners. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Li, H.; Sui, M.; Zhao, F.; Zha, Z.; Wu, F. MVT: Mask Vision Transformer for Facial Expression Recognition in the wild. arXiv 2021, arXiv:2106.04520. [Google Scholar] [CrossRef]
- Xue, F.; Wang, Q.; Tan, Z.; Ma, Z.; Guo, G. Vision transformer with attentive pooling for robust facial expression recognition. IEEE Trans. Affect. Comput. 2023, 14, 3244–3256. [Google Scholar] [CrossRef]
- Gera, D.; Balasubramanian, S. Landmark guidance independent spatio-channel attention and complementary context information based facial expression recognition. Pattern Recognit. Lett. 2021, 145, 58–66. [Google Scholar] [CrossRef]
- Cai, J.; Meng, Z.; Khan, A.S.; O’Reilly, J.; Li, Z.; Han, S.; Tong, Y. Identity-free facial expression recognition using conditional generative adversarial network. In Proceedings of the 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 19–22 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1344–1348. [Google Scholar]
- Liu, Y.; Li, Y.; Ma, X.; Song, R. Facial expression recognition with fusion features extracted from salient facial areas. Sensors 2017, 17, 712. [Google Scholar] [CrossRef]
- Connie, T.; Al-Shabi, M.; Cheah, W.P.; Goh, M. Facial expression recognition using a hybrid CNN–SIFT aggregator. In Proceedings of the International Workshop on Multi-Disciplinary Trends in Artificial Intelligence, Gadong, Brunei, 20–22 November 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 139–149. [Google Scholar]
- Cornejo, J.Y.R.; Pedrini, H.; Flórez-Revuelta, F. Facial expression recognition with occlusions based on geometric representation. In Proceedings of the Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 20th Iberoamerican Congress, CIARP 2015, Montevideo, Uruguay, 9–12 November 2015; Proceedings 20. Springer: Berlin/Heidelberg, Germany, 2015; pp. 263–270. [Google Scholar]
- Aouayeb, M.; Hamidouche, W.; Soladie, C.; Kpalma, K.; Seguier, R. Learning vision transformer with squeeze and excitation for facial expression recognition. arXiv 2021, arXiv:2107.03107. [Google Scholar]
Datasets | Training Set Size | Testing Set Size | Classes |
---|---|---|---|
RAF-DB | 12,271 | 3068 | 7 |
AffectNet(7 cls) | 280,401 | 3500 | 7 |
AffectNet(8 cls) | 283,501 | 3999 | 8 |
CK+ | 327 | 266 | 7 |
Configuration | Accuracy % | ||||
---|---|---|---|---|---|
CA | NA () | NA () | NA (5 × 5) | NA () | |
90.71 | |||||
✓ | ✓ | 91.72 | |||
✓ | ✓ | 92.37 | |||
✓ | ✓ | 91.59 | |||
✓ | 90.87 | ||||
✓ | 90.25 |
Dataset | Phase | SC ↑ | DBI ↓ | CHI ↑ |
---|---|---|---|---|
RAF-DB | Before CNAM | 0.2606 | 1.6819 | 1392.0486 |
After CNAM | 0.3958 | 1.4916 | 1799.1164 | |
AffectNet(7cls) | Before CNAM | −0.0267 | 3.8635 | 295.1931 |
After CNAM | 0.0581 | 3.1627 | 662.3302 | |
AffectNet(8cls) | Before CNAM | −0.0596 | 4.5313 | 318.6951 |
After CNAM | −0.0002 | 3.7724 | 614.1517 | |
CK+ | Before CNAM | 0.7543 | 0.2979 | 1516.6460 |
After CNAM | 0.8860 | 0.1197 | 5420.1134 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Peng, C.; Li, B.; Zou, K.; Zhang, B.; Dai, G.; Tsoi, A.C. An Innovative Neighbor Attention Mechanism Based on Coordinates for the Recognition of Facial Expressions. Sensors 2024, 24, 7404. https://doi.org/10.3390/s24227404
Peng C, Li B, Zou K, Zhang B, Dai G, Tsoi AC. An Innovative Neighbor Attention Mechanism Based on Coordinates for the Recognition of Facial Expressions. Sensors. 2024; 24(22):7404. https://doi.org/10.3390/s24227404
Chicago/Turabian StylePeng, Cheng, Bohao Li, Kun Zou, Bowen Zhang, Genan Dai, and Ah Chung Tsoi. 2024. "An Innovative Neighbor Attention Mechanism Based on Coordinates for the Recognition of Facial Expressions" Sensors 24, no. 22: 7404. https://doi.org/10.3390/s24227404
APA StylePeng, C., Li, B., Zou, K., Zhang, B., Dai, G., & Tsoi, A. C. (2024). An Innovative Neighbor Attention Mechanism Based on Coordinates for the Recognition of Facial Expressions. Sensors, 24(22), 7404. https://doi.org/10.3390/s24227404