Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review

Udahemuka, Gustave; Djouani, Karim; Kurien, Anish M.

doi:10.3390/app14178071

Open AccessReview

Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review

by

Gustave Udahemuka

^1,*

,

Karim Djouani

^1,2

and

Anish M. Kurien

¹

Department of Electrical Engineering, French South African Institute of Technology, Tshwane University of Technology, Private Bag X680, Pretoria 0001, Gauteng, South Africa

²

Laboratoire Images, Signaux et Systèmes Intelligents (LiSSi), Université de Paris-Est Créteil (UPEC), 94000 Créteil, France

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 8071; https://doi.org/10.3390/app14178071

Submission received: 8 August 2024 / Revised: 28 August 2024 / Accepted: 30 August 2024 / Published: 9 September 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The dynamic expressions of emotion convey both the emotional and functional states of an individual’s interactions. Recognizing the emotional states helps us understand human feelings and thoughts. Systems and frameworks designed to recognize human emotional states automatically can use various affective signals as inputs, such as visual, vocal and physiological signals. However, emotion recognition via a single modality can be affected by various sources of noise that are specific to that modality and the fact that different emotion states may be indistinguishable. This review examines the current state of multimodal emotion recognition methods that integrate visual, vocal or physiological modalities for practical emotion computing. Recent empirical evidence on deep learning methods used for fine-grained recognition is reviewed, with discussions on the robustness issues of such methods. This review elaborates on the profound learning challenges and solutions required for a high-quality emotion recognition system, emphasizing the benefits of dynamic expression analysis, which aids in detecting subtle micro-expressions, and the importance of multimodal fusion for improving emotion recognition accuracy. The literature was comprehensively searched via databases with records covering the topic of affective computing, followed by rigorous screening and selection of relevant studies. The results show that the effectiveness of current multimodal emotion recognition methods is affected by the limited availability of training data, insufficient context awareness, and challenges posed by real-world cases of noisy or missing modalities. The findings suggest that improving emotion recognition requires better representation of input data, refined feature extraction, and optimized aggregation of modalities within a multimodal framework, along with incorporating state-of-the-art methods for recognizing dynamic expressions.

Keywords:

deep learning; dynamic emotion expression; emotion computing; multimodal emotion recognition

1. Introduction

Human emotions enrich human-to-human communication and interaction by allowing people to express themselves beyond the verbal domain. They influence personal decisions, given that emotion contributes to behavioral flexibility apart from the instinctive response and influences learning, knowledge transfer, perception and actions [1,2,3,4,5]. Positive emotions such as happiness can contribute to developing physical, intellectual and social resources, and expressing such emotion by interacting positively with others can increase one’s liking for others [6]. Negative emotions, on the other hand, can affect the immune system so that it cannot handle infections and tumor cells, and expressing hostility can increase anger experienced towards others [7]. The recurrence of some emotions might lead to an emotion becoming an individual’s personality trait [3], and emotion can be exploited to evaluate how an event and its consequences are relevant to an individual. In a human–machine interaction setup, machines or virtual entities equipped with emotional artificial intelligence recognize their human users’ emotional and attentional expressions. Efficient automatic emotion recognition may be useful in various sectors, such as healthcare (e.g., rehabilitation of individuals with emotion recognition deficits or with the inability to express feelings, such as in cases of flat effect), education, security, safety, consumer marketing and entertainment.

Human emotion is a dynamic process caused by an eliciting stimulus that triggers a particular emotion. The process outputs are signals of changes that illustrate changes in different reaction components of emotion, with three of the major components being involuntary physiological arousal, motor expression and subjective feeling, irrespective of the theory of emotion, whether physiological, neurological or cognitive. Motor expression fulfills a communicative function at the interindividual level. Differentiation of emotions relies on the output signal of the emotion process [8]. The other two reaction components of emotion are action tendency (a conative reaction that may involve a change in motor expression) and cognitive appraisal of the eliciting stimulus. However, according to some research psychologists, cognitive appraisal precedes an emotion instead of being one of the reaction components of an emotion. The changes in subjective feeling, motor expression, and cognitive and conative reactions involve neurophysiological activities [8]. Machines extract the emotional state of an individual from signal changes in motor expression such as speech signals (e.g., [9]), visual signals of dynamic facial expressions [10] and other visual emotion cues (postures, hand/body gestures and gesticulations, emotional eyes and eye movements, lip and body movements (e.g., [11])) and tactile signals. Machines also infer emotions from biosignal changes in the physiological reactions of various organ systems associated with emotional states. Some examples of biosignals/physiological signals are brain signals (electroencephalograph (EEG), functional magnetic resonance imaging (fMRI)), heart rate (ECG), blood pressure, blood sugar level, dilation/compression of blood vessels, respiratory rhythm/breathing rate, muscle contraction level (electromyography (EMG)), electrodermal activities, perspiration rate, and skin temperature (e.g., [12,13]). A particular felt emotion and its expression are often related. However, emotional reaction is commonly followed by emotional regulation. A felt emotion might then be suppressed in the motor expression reaction component of the emotion [14]. Noise in biosignals is the principal effect that constrains the recognition performance of physiologically based emotion recognition systems. Emotion has a high response synchronization of changes in its reaction components/output and has a short duration compared to other affect phenomena [15]. The efficacy of the extraction of emotions depends on the recognition performance of algorithms, emotion modality, and the type of information processed by the algorithms.

Each emotion modality provides unique information regarding an individual’s emotion that cannot be retrieved from a different modality. The extraction of emotion from body movements, which entails a full-body motion pattern, has the advantages of recognizing the nonverbal emotions of a person from any camera view or recognizing the emotions of a person from a long distance far from a camera (e.g., [16]). Speech and facial expression can provide information that is not present in body movements and human body gestures, or vice versa. Humans employ different emotion modalities simultaneously and in combination (i.e., use one modality to complement and enhance another). Each modality has its limitations and advantages. To effectively capture an emotion, multimodal emotion recognition approaches potentially have a higher recognition performance than unimodal emotion recognition (e.g., [17]). However, in the case of an emotional change, facial expression signals, biosignals/physiological signals and speech signals tend to manifest before other signals. Hence, computer vision-based emotion recognition focuses primarily on facial emotional expression. Evidence has shown that emotional information from the face, voice and body interact, with body motion and posture often highlighting and intensifying the emotion expressed in the face and voice. However, the facial expression features might differ for the mute condition versus the speaking condition [18].

The parts of the limbic system and the systems and parts subject to the limbic system directly or indirectly impact, coordinate, and control emotional responses. The occurrence of a particular emotion and the intensity of an emotion with respect to an expression or with respect to how an individual is sensitive to a present stimulus/event (and the duration of an emotion episode) depend on past emotional experiences stored in a limbic area of the cerebral cortex. An emotion experience is a previous event that the individual associates with the present stimulus/event. The eventuality of an emotional state given a stimulus also depends on other effects that change over time, such as motivational aspects of behavior, personality, hormone concentration and release and situational variables. Therefore, emotion is dynamic and subject to change over time. Furthermore, cognitive and conative behaviors impact the emotional process through the appraisal of emotional stimuli [2,15,19]. For example, the emotional process has a regulator influenced by the social environment [2,19]. Society and group norms may induce emotions via emotional contagion that supersede the individual’s normal behavior. Other affective phenomena (i.e., preferences, attitudes, moods, affect dispositions, and interpersonal stances) also regulate and can affect emotion [4,15]. For the emotional recognition of an individual, subtle emotional expression is key to identifying the emotion that an individual feels. Subtle emotional expression has a universal component across all cultures (that has an evolutionary adaptation) and a particular sociocultural component (that is culturally specific) [20]. Recognition of the subtle emotional expression involves micro-emotional expressions hidden in the background of individual expression. Such emotions will depend, among others, on physical health conditions and individual cultures or demographic groups such as age [21]. Deep learning mechanisms have proven advantages in the extraction of fine details.

Human perception of emotion depends on the morphological features of expressions of emotion, the temporal context (appraisal of an emotion expression depends on the preceding perceived emotion expression), and the spatial contextual information (judgment of an emotional expression of an individual at a given time is with respect to the group the person is in at that time) [22]. Moreover, the expression judgment is conditional on the perceiver’s culture, life experiences (and training), physical and mental health conditions (e.g., impairments in facial emotion recognition in patients with neurological conditions such as stroke, traumatic brain injury, and frontotemporal dementia), mood, demographic factors (age, gender, and education level) [22,23,24,25,26,27,28,29]. There is a difference in the agreement of emotion of an expresser, given that people interpret emotion differently based on their background. Psychologists assess the emotion perception ability of individuals using different emotion recognition tests such as the Ekman 60 Faces Test, Emotion Recognition Task (ERT) and Emotion Evaluation Test (EET) [30]. Automatic emotion recognition using machine learning can rely on both motor expression and physiological data. The recognition method must remove the obstacles encountered by human perceivers in recognizing the expressers’ emotions and must recognize the genuine emotions felt.

The recognition techniques extract static or dynamic emotional expressions from the emotion process. Static expressions are the static displays of posed emotional expressions/signals, such as a photograph for which the body emotion is depicted as a static display or still image (emotion is not active, but a mere frozen definition displayed). In contrast, the dynamic expression is retrieved from morphed photographs, videos, or motion history/energy images to form a dynamic display that includes a preceding expression and an end expression, similar to dynamic point-light and full-light displays for biological motion, given a stimulus. In real life, emotional expressions are typically dynamic, changing from one state to another [22,31]. On a human level, static information and dynamic information, which include movement cues, differ in how they are processed psychologically or neurologically by the perceiver. Therefore, a dynamic framework is needed to capture the emotional dynamics fully. Emotion recognition using dynamic expression is used for facial expressions and body motion. However, human emotion recognition systems currently employ largely static images. The dynamic facial expression reflects the emotion conveyed by an individual as it unfolds over time. Movement embedded in the dynamic facial expression enhances emotion recognition performance accuracy and increases perceptions of a weak emotional intensity [32]. Dynamic facial expression helps to extract hidden or micro-emotions that eyes can hardly detect by analyzing the motion of these essential features [33]. In addition, the dynamic facial expression enhances emotion recognition or the discrimination of emotions felt by an individual [31,34] and generalizes better (i.e., achieves a “high ecological validity”) than static emotion [35]. For a high accuracy, automatic emotion recognition must consider subtle emotion expressions and recognize an emotion rapidly. Emotion micro-expressions or subtle emotion expressions are commonly recognized based on movement cues, and such cues are embedded in the dynamic expression of an emotion.

Emotion is recognized based on either macro-expressions or subtle expressions (micro-expressions). For the former, the recognized emotion is explicitly expressed at a strong intensity (at the local level) and identified based on strong features. In contrast, for the latter, the recognized emotion is expressed in a suppressed manner that is non-localized and identified based on weak features. Subtle expressions are spontaneous expressions that cannot be feigned. Effectively extracting expression features from images and voices is a critical problem affecting the accuracy of subtle expression recognition. For the model to handle the subtleties impacted, for instance, by culture, culture-specialized fine-tuning should be applied after the main model is trained. By considering subtle features, the recognition helps to recognize the true feeling and has an improved accuracy compared to other methods. In light of these considerations, this article reviews subtle emotion recognition and contrasts such recognition with macro-expression recognition. This article also reviews multimodal emotion recognitions that consider spontaneous visual or vocal expressions or involuntary physiological arousal. These modalities exhibit an instantaneous nature and play a critical role in various fields, including security, healthcare (such as patient care and therapeutic support), and human–computer interactions, particularly in systems involving robots capable of perceiving and responding to human emotions. The main contributions of this study are listed as follows: we examine the state of the art in multimodal emotion recognition based on spontaneous emotion modalities and subtle emotion recognition to reveal the emotional state related to the true feeling of an individual. This paper also discusses the use of deep learning to improve emotion recognition and its current challenges. Subtle expressions are solely identified through dynamic displays/expressions such as dynamic facial expressions and dynamic emotional body movements that effectively emulate the 3D spatio-temporal processing style associated with a human brain when recognizing emotions. Given that the human brain activity associated with the interpretation of emotion has spatio-temporal dynamics, this review article discusses the state-of-the-art methodologies in dynamic expression. The challenges of deep learning for effective automatic emotion recognition are highlighted.

Numerous research review papers have recently discussed multimodal emotion recognition, highlighting its strengths and limitations [36,37,38,39,40,41,42,43,44,45,46]. Existing reviews often focus on multimodal emotion recognition in the context of audio–visual integration and, to a lesser extent, incorporate electroencephalogram (EEG) data while overlooking other physiological data such as heart rate and skin conductance. They do not look at the aspect of subtle expressions and dynamic expressions. They also require more scrutiny regarding the robustness and generalizability of methods, particularly with respect to bias. Although they present fusion methods, they must investigate or compare them across varied contexts to identify the best practices. Therefore, we address these gaps and examine, in large part, the methodological characteristics unique to multimodal emotion recognition, combining visual, vocal or physiological signals. The research question for this study is as follows:

RQ1: How can multimodal emotion recognition methods, using visual, vocal and physiological signals be optimized to enhance robustness and accuracy, and what is the impact of deep learning techniques and dynamic expression analysis in overcoming these challenges?

In this study, we identify the limitations of current approaches to multimodal emotion recognition. We focus on the fine-grained recognition of emotional states and present emotion recognition in a way that enhances the explainability of a classifier’s output, thereby ensuring better user adoption and trust in emotional computing. We categorize multimodal and subtle emotion recognition techniques into distinct groups based on their implementations, providing concise explanations for each and comparing their performance. We conduct a comprehensive analysis of the current state of emotion recognition, similar to ref. [47], by evaluating the performance of various methods on the same dataset. The remaining sections of this paper are organized as follows. Section 2 discusses the research methods employed in conducting this review study. In Section 3, the human emotion categorizations are reviewed. In Section 4, automatic human emotion recognition methods are elaborated on and reviewed. Then, an overview of multimodal emotion analysis research is provided. Deep learning solutions and challenges are introduced in Section 5. Section 6 discusses the findings and current limitations of multimodal emotion recognition. Finally, the conclusion, including recommended future directions, is given in Section 7.

2. Materials and Methods

2.1. Search Strategy

In this study, we employed a comprehensive database search strategy to systematically identify, evaluate, and synthesize relevant research. We followed a review protocol outline that adheres to one of the established guidelines for conducting systematic literature reviews [48], as illustrated in Figure 1. Before initiating the search, we defined the research question (RQ1) specified in Section 1: “How can multimodal emotion recognition methods using visual, vocal and physiological signals be optimized to enhance robustness and accuracy, and what is the impact of deep learning techniques and dynamic expression analysis in overcoming these challenges?”

To refine our search query, we utilized a forward and backward snowballing strategy. We developed a single search expression comprising three subexpressions, linked by a Boolean AND, as detailed in Table 1. The first subexpression, listed in the first row of Table 1, includes terms related to the use of multiple modalities, and these terms are connected by a Boolean OR. The second subexpression, outlined in the second row, consists of terms associated with affective computing, and the terms are also linked by a Boolean OR. The final subexpression, detailed in the last column, encompasses terms related to various visual, vocal, and physiological modalities, linked by a Boolean OR as well.

This targeted search strategy enabled the compilation of a focused and up-to-date collection of records that accurately reflect the current state of research. This collection provides a solid foundation for drawing meaningful conclusions and suggesting directions for future studies.

We queried relevant databases with records on affective computing using the search function. We selected nine databases for this review study: Web of Science Core Collection (WoS), Scopus, EBSCOhost Database Collection, ProQuest Central (including the publicly available content database and Coronavirus Research database), ACM Digital Library (ACM Guide to Computing Literature section), IEEE Xplore, ScienceDirect, SpringerLink and PubMed. We searched within document titles, abstracts, and keywords when available. The keywords included author-specified terms or keyword plus (WoS). In other databases, we used subject terms (EBSCOhost), all subject and indexing terms (ProQuest), index terms (IEEE Xplore), and other terms (PubMed). In ProQuest, we searched both the summary text and the abstract. We searched within the full text for SpringerLink, since the abstract search was not available, and we exported all SpringerLink records, including chapters (which encompass conference papers), books (which include conference proceedings and reference works), articles, reference work entries and protocols. The EBSCOhost databases used in the search included E-journal, Academic Search Complete, MEDLINE (and MEDLINE with full text), APA PsycInfo, Business Source Complete, OpenDissertations, LISTA, ERiC, RILM Abstracts of Music Literature with full text, APA PsycArticles, Health Source: Nursing/Academic Edition, MLA International Bibliography with full text, Music Index with full text, CAB Abstracts with full text, MasterFILE Premier, newspaper sources, regional business news, SPORTDiscus with full text, eBook Academic Collection (EBSCOhost) and eBook Collection (EBSCOhost).

After exporting the search records from the databases, we removed duplicate entries and retained the remaining records for initial screening. We reviewed titles and abstracts during this initial screening and conducted a more detailed full-text assessment afterwards. We used the Rayyan platform to manage the duplication removal and screening process efficiently.

2.2. Study Selection

After removing duplicates, we proceeded to the screening phase, where we applied inclusion and exclusion criteria to assess the relevance of each record to this review study. We retained only those records that directly addressed the research question. The inclusion/exclusion criteria focused on relevance, study type, completeness, and publication language. We excluded records related to textual modality when this modality was part of a bimodal emotion recognition system, as this review does not address textual inputs. Additionally, we removed editorial reviews, internal reviews, incomplete studies, extended abstracts, and white papers. To capture relevant developments and recent advancements, we set the publication range from 2012. This time frame allowed us to provide a comprehensive overview of current trends, methods, and findings while maintaining the review’s relevance and up-to-date status. It also facilitated the identification of current research gaps.

Figure 2 presents the PRISMA flow diagram demonstrating the study selection process used in this review. The query initially yielded 15,371 records without duplications across all the databases. This number is skewed due to the number of SpringerLink records being high, because this query searches the full text. During the screening phase based on titles and abstracts, the abstracts underwent scrutiny. we excluded non-English articles, studies published before 2012, incomplete studies, and studies that utilize textual modality in a bimodal framework, reducing the record count to 983 for detailed examination. Upon a thorough full-text review of these papers, we further excluded 716 records. The exclusion criteria included the absence of significant insights (often found in short conference papers) or exclusive focus on unimodal techniques. In the end, this study selection process led to the inclusion of 267 records in the final study. Our meticulous selection ensures that the records included in our review are highly relevant to our research questions and objectives, providing a comprehensive understanding of multimodal emotion recognition and addressing any existing gaps in the literature.

Finally, we thoroughly examined the selected papers to extract essential insights, identify trends, and draw conclusions. This examination explored various modalities, data collection methods, fusion techniques, and machine learning approaches related to multimodal emotion recognition. By doing so, we effectively addressed our research questions and better understood the field’s current state. This process also revealed potential gaps or shortcomings in the existing literature, paving the way for future research. It enabled us to delve deeper into the recognition of subtle expressions, spontaneous emotions, and dynamic displays and assess method accuracy and robustness. Examining various study types provided valuable insights into the practical implementation of multimodal emotion recognition, ensuring that our research remains relevant and applicable to real-world scenarios.

3. Human Emotion Categorization Models

The method of representing emotion is essential for understanding affective computing models. There is no common agreement on a unique categorization of emotions, and no ultimate emotion categorization covers all emotions. Psychologists have developed many different affect models. However, three groups of emotion representation models are so far applicable to emotion computing in applications such as affective human–computer interfaces, namely discrete emotion models, dimensional models, and componential models [49,50]. Some emotional states (e.g., shame or guilt) do not have expressions that humans can recognize in human-to-human communication, and they cannot be recognized using current technology.

Discrete emotion models or categorical classification approaches such as Ekman’s model [51,52] and Shaver’s model [53] define a discrete set of emotions. Ekman’s model represents six basic emotion states with universal facial expressions according to Ekman’s theory (the so-called universal emotion states are anger, disgust, happiness, sadness, fear and surprise). A non-emotion-neutral state is added to the emotion recognition task. Each facial emotion state is defined by a combination of action units (components of muscle movements). Another example of categorical representations of emotions is the 27 categories of emotions bridged with smooth gradients [54], where emotion is represented in a semantic space (categories, factor loading). Complex emotions are derived from a number of basic emotions. There are different componential models, such as Oatley’s model [55].

Dimensional emotion models characterize emotion as a multi-dimensional signal with several dimensions. They utilize continuous values instead of dividing emotions into several categories like the discrete emotion representation model [56]. The dimensional models classify emotions in detail using multiple dimensions of emotion. One example of these models is the wheel of emotions [57,58], which proposes eight basic bipolar emotions, with intensity presented on three levels. The wheel of emotions also defines combinations of emotions (e.g., love is a combination of joy and trust). A three-dimensional model of affect, called the valence, arousal, dominance (VAD) model (also known either as pleasure, arousal, dominance model (PAD) or evaluation, activation, power (EAP)) [59,60] represents a multi-dimensional emotion in three independent dimensions. The valence, pleasure or evaluation defines positive or negative emotion and expresses the pleasant or unpleasant. The arousal or activation ranges from sleep/passive to excitement/active and represents the degree of activation. Dominance or power varies from submissive to dominant and indicates the perceived level of control of an emotional state.

A two-dimensional valence–arousal model of emotions, also referred to as the circumplex model of affect [61], is the most commonly used for recognition tasks. Since the values of each dimension can vary continuously, the subtle differences between different emotions can be distinguished, and the evolution process of emotional states can be tracked via real-time labeling of emotional states. Whissel [62] used the valence–arousal (or evaluation–activation) representation across different scales. Other dimensional emotion models are the Ortony, Clore and Collins (OCC) model [63] and Lovheim’s model [64]. The hourglass model of emotions [65,66] argues that emotions are distributed in an hourglass space. The model is an emotion categorization model optimized for polarity detection. The model has empirical evidence in the context of sentiment analysis. The model is a biologically inspired and psychologically motivated emotion categorization model and represents affective states both using labels and four independent but concomitant affective dimensions that can potentially describe the full range of human emotional experiences. Fontaine et al. proposed the addition of an unpredictability dimension to obtain a set of four dimensions (valence, potency, arousal, unpredictability) [67]. Dimensional representations provide a method of describing emotional states that is more tractable than their discrete counterparts (in the case of naturalistic data or applications where a wide range of emotional states occur). They are also better equipped to deal with non-discrete emotions and variations in emotional states over time (changing from one universal emotion label to another would be impractical in real-life scenarios) [68]. Psychological research shows that the value of some emotional dimensions is closely related to human cognitive behaviors such as memory and attention [69], which makes it easier for machines to understand and respond to users’ emotional behaviors based on the results of dimensional emotional predictions. Dimensional representations clarify the interrelation between different emotion states; however, the data used to train it are very specialized and error-prone [70]. The dimensional model is fine-tuned, and there is improved recognition of dimensional models of emotions. The activation and evaluation dimensions effectively discriminate between emotional states [71]. Some dimensional models do not enable the study of compound emotions or do not model the fact that two or more emotions may be experienced at the same time. The Hourglass of emotions overcomes such limitations. The accuracy of automatic emotion recognition is applicable to only a few emotions rather than the almost arbitrary emotions covered by our system. A recognition system that covers fewer emotions should more easily yield a high recognition accuracy.

4. Human Emotional Expression Recognition

The performance of an emotion recognition system greatly depends on the dataset’s characteristics. Publicly available datasets for emotion recognition are used to identify specific emotions and serve as a benchmark to establish a standard comparison of emotion recognition algorithms. Each database has its advantages and disadvantages. Based on current methodologies for dataset collection for emotion recognition, data can be categorized as natural/naturalistic, induced and acted. Acted data are recordings of subjects acting based on pre-decided scripts—emotion data are collected while an individual feigns an emotion. Induced data are data collected when an individual observes a scene that is expected to induce a specific emotion—emotion data are collected while an individual experiences multimedia stimuli. Natural data, such as natural videos, are gathered from recordings taken without the expressers knowing that the emotional data are being extracted from them. Acted datasets can suffer from inaccurate actions by subjects, leading to corrupted samples or erroneous information for the training dataset (feigned expression). For the induced data, an individual might express an emotion we do not expect. In such a case, data validation will be the next step; for example, using facial action units (FAUs) in the case of facial expressions. The main obstacle is obtaining accurate data to train the recognition. Class imbalances exist in almost all the databases. Therefore, to address this problem, recognition methods implement different data augmentation mechanisms, and the evaluation of the recognition methods uses performance measures that are less sensitive to imbalance.

Emotion data and emotion recognition methods focus on a single modality (emotion state assessed either from vocal expression, visual expression or physiological cues) or multimodality (emotion state assessed, for example, from audio–visual expressions). The emotion recognition model based on single-modality data has some limitations that can be solved using effective emotional multimodality. However, since the multimodal methods take advantage of a particular unimodal method for each modality, the unimodal methods that are part of the multimodal method must also perform well on their own. Some studies have shown that multimodal methods can have lower performances, which happens when the necessary steps are not taken to ensure that the multimodal methods improve the unimodal methods [72]. Some unimodal methods can recognize subtle emotional expressions, and by combining subtle expression recognition methods from multiple modalities, multimodal methods for subtle expression recognition can be expected to perform better.

Emotion recognition methods can directly classify emotion states. However, they can also classify these states based on non-emotional categories, such as the objective states made of combinations of action units (AUs) for facial recognition [73]. However, the classification must be followed by emotion interpretation to derive the emotions from the objective states. Since emotion is dynamic, emotion has a start time and an end time. The period of emotion can be divided into stages: onset (start time of an emotion experience), apex (time of the maximum emotion intensity that can be inferred; for example, from the maximum muscle movement of the facial expression starting from the onset), and offset (end of an emotional experience). There are also two transition periods: onset to apex and apex to offset. Therefore, the recognition results will depend on the emotion stage of the input data frame (or image) to the recognizer for the case of static expression.

4.1. Single Modality

4.1.1. Visual Modality

Visual emotion modalities such as facial expression, body posture, hand or head gestures and body movement play a significant role and are essential for maintaining human relationships. During face-to-face human interaction, facial expressions significantly affect the message received by the listener [74]. Poor recognition of visual emotion expression in humans is usually associated with an inability to interact effectively in social situations [75]. The visual emotion cues that clearly show a person’s positive or negative emotions are gestures, body movements, changes in body posture, and movement of a part of the body that includes the head, limbs or hands (e.g., scratching of the head). Other visual emotion cues, which form part of a facial expression but can be considered independently to assess the emotion conveyed, are changes in mouth length, pupil size or eye movements. In the case of face-to-face communication, spatial distance from the addressee can also be an emotional cue. In human-to-human interactions, facial expression contributes significantly to emotion perception, while other visual cues, such as posture, have a more subtle effect on emotion perception.

Visual emotion recognition uses video footage or images of visual expressions and exploits computer vision and machine learning algorithms to achieve automated expression recognition. Computer vision approaches are used to extract features from facial data and classify these data as one of the categories of emotional states. The performance of visual emotion recognition depends on both feature extraction and classification. Visual data, such as face data, are less sensitive to noise. Though there have been substantial advances in visual emotion recognition, the current methods do not achieve sufficient performance, given the high intraclass variation.

Facial emotional expression is commonly divided into macro-expressions and micro-expressions. Macro-expressions are more applicable to entertainment applications. The recognition of human emotion based on macro-expressions may be misleading in other applications, given that some people may hide their true emotions. Micro-expressions apply to healthcare, marketing, education, security and other applications. Facial micro-expressions are spontaneous and involve brief facial muscle movements that are not subject to people’s consciousness. They reveal the genuine emotion of the subject [76]. Their intensity is very subtle and occurs in only specific parts of the face [77,78]. They occur when people try to hide their true emotions, either via suppression (deliberate concealment) or repression (non-conscious concealment) [79]. These states occur in a number of video frames without significant recognizable facial motions. The subtlety of micro-expressions can result in spatial features that are insufficient for recognizing the expressions, even in the apex frames. Full facial macro-expression lasts between

1 / 2

and 4 seconds [80] and is easily identifiable by humans. Micro-expressions are fleeting and imperceptible, typically lasting less than

1 / 5

seconds [80]. Some micro-expressions are as short as a

1 / 25

of a second [77,79], and some may last as long as

1 / 2

to 1 second [76,81]. Typically, a micro-expression lasts between

1 / 25

and

1 / 5

of a second. It is difficult for a human to notice or recognize micro-expressions; for example, Frank et al. [82] found that only highly trained individuals can distinguish between various micro-expressions, and the recognition accuracy is just

47 %

. There is a need to design effective methods to automatically recognize micro-expressions, given that human performance on micro-expression recognition remains considerably low.

Macro-expression data are in the form of still images (static displays) or videos, and micro-expression data are in the form of videos. While some algorithms extract emotion from static displays (e.g., static image or apex frame in a visual expression video clip), other algorithms also exploit the temporal information embedded in the video (i.e., the use of the dynamic expression display). Recognition methods that add a time dimension (i.e., dynamic expression), especially micro-expression recognition, have been shown to recognize spontaneous subtle expressions. The exploitation of dynamic displays has been the core of facial micro-expression and dynamic emotional body expression recognition. The dynamic emotional body movements involve both changes in positions and displacements of body joints. Dynamic expression recognition depends on factor loading, the number of significant video co-loadings and categories.

Facial Action Coding System (FACS) encodes facial muscle changes in response to emotion states [83,84]. The system establishes the ground truth of each action unit’s exact beginning and end time. According to FACS, each facial emotional expression is identified based on a combination of action units (AUs), also referred to as facial action units. Davison et al. [73] argued that using facial action units during emotion recognition instead of emotion labels can define micro-expressions more precisely, since the training process can learn based on specific facial muscle movement patterns. They further proved that this leads to higher classification accuracies. “objective classes” based on FACS action units have been used as categories for micro-expression recognition. Traditionally, facial macro-expression recognition has been achieved through the extraction of handcrafted features, including Scale-Invariant Feature Transform (SIFT), Histogram of Oriented Gradients (HOG), and Local Binary Patterns (LBP). A classifier subsequently utilizes these features to perform the recognition task (see Table 2 for examples). LBP, with its simplicity in computation, is commonly used because of its robustness towards illumination changes and image transformations [85]. Classical LBP characterizes the local textural information by encoding a vector of binary code into histograms. Handcrafted feature extraction methods for identifying macro-expressions are categorized as either geometric-based or appearance-based. The action units or the Euclidean distance between the action units for facial expression recognition are the basis of geometric features; one of the features most used in facial expression recognition (e.g., [86,87,88]). Geometric features lack information on some parts of the face. They are sensitive to noise, and the accumulated errors during tracking produce inaccurate features. Other types of features commonly used in facial expression recognition are appearance-based features. These features do not require detecting part of a face; instead, they consider the entire face structure [89,90]. They can encode fine patterns in the facial image and are less sensitive to noise. However, a misalignment of the face reduces the recognition performance due to the extraction of features from unintended locations. Methods that combine geometric-based and appearance-based features for facial expression recognition (e.g., [91]) and deep learning-based methods (e.g., [92,93]) overcome the shortcomings of the geometric feature-based methods and appearance-based methods. However, none of the current classical methods consider the shape deformations related to each facial emotional expression.

Facial micro-expressions are characterized by small-scale facial motions that result in feature vectors with a low discriminative power. Micro-expression recognition aims to identify small-scale facial motions. The first successful recognition method for spontaneous facial micro-expressions was presented in ref. [96]. The method extended from texture features to spatial–temporal features. Researchers have mainly chosen the local binary pattern with three orthogonal planes (LBP-TOP) as the primary baseline feature extractor. The LBP-TOP is a spatiotemporal extension of the classic local binary pattern (LBP) descriptor. LBP-TOP extracts the histograms from the three planes, XY, XT and YT, and concatenates them into a single-feature histogram. Wang et al. [97] proposed a method to reduce the redundancies in the LBP-TOP by utilizing only six intersection points in the 3D plane to construct the feature descriptor. An integral projection technique was also proposed to preserve the property of micro-expressions and enhance the discrimination of micro-expressions [98]. Spatio-temporal LBP with integral projection (STLBP-IP) applies the LBP operator to horizontal and vertical projections based on difference images. The method is shape-preserving and robust against white noise and image transformations. Examples of other handcrafted features for micro-expression recognition are LBP-MOP [99] (with an improved integral projection). The LBP-TOP, with preprocessing using the temporal interpolation model (TIM) [100], uniformly samples a fixed number of image frames from the constructed data manifold. LBP-TOP (sparsity-promoting dynamic mode decomposition [DMDSP]) acts to select only the significant temporal dynamics when synthesizing a dynamically condensed sequence [21]. LBP-TOP (EVM + HIGO) magnifies the video (Eulerian video magnification) in an attempt to accentuate the subtle changes before feature extraction [101,102]. Zhao et al. [103] used the dynamic features to recognize macro-expressions; specifically, LBP-TOP.

The case of adaptive motion magnification combined with the LBP-TOP method was proposed in ref. [104]. Adaptive motion magnification techniques are used to emphasize micro-expression motion. Although motion magnification improves expression class separability, magnification parameters, such as frequency bands and magnification factors, are sensitive enough to achieve a good performance. Motion information is used to portray the subtle changes exhibited by micro-expressions. For example, the extraction of a derivative of optical flow called an optical strain was used originally for micro-expression spotting [105] but later adopted as a feature descriptor for micro-expression recognition [106,107]. Leveraging the discriminativeness of optical flow, other approaches that exploit the optical flow have been proposed, such as bi-weighted oriented optical flow (Bi-WOOF) and FDM [108,109]. LBP-TOP-HOOF [110], a handcrafted approach that uses K-SVD, uses optical strain and aims to preserve the temporal dimension as its dynamics [107]. LBP-TOP + STM [111] also uses TIM, the same as in ref. [100]. Table 3 presents a comprehensive overview of micro-expression recognition methods that utilize handcrafted features.

Handcrafted feature extraction methods demand highly specialized knowledge of data, and the extraction can be highly computationally involved. The existing handcrafted features have limitations due to their reliance on prior knowledge and heuristics. Deep learning models have been used to extract emotion features and distinguish emotional states. Unlike the handcrafted feature extraction, deep learning models require less specialized data knowledge and minimal reliance on prior knowledge. Handcrafted feature extraction requires feature detection during the preprocessing stage, which affects the overall feature extraction and classification. The handcrafted features also fail to perform well on a dataset with more image variation and partial faces. Spontaneous micro-expression recognition using handcrafted features has a sophisticated design and limitations regarding the recognition rate for practical applications [114]. Deep learning overcomes the imbalanced dataset problem by effectively learning features from imbalanced data using data augmentation. Data augmentation is also used to avoid overfitting. Deep learning also has the property of translation invariance. With the success of the convolutional neural network (CNN) in visual object recognition and detection, visual emotion recognition methods have relied mainly on CNN. Although deep learning is still developing, it has already achieved good emotion recognition rates. However, further exploration of various deep learning frameworks is necessary to enhance these rates and improve the recognition of subtle movements in facial expressions. Table 4 provides an overview of different deep learning approaches used specifically for macro-expression extraction, illustrating the advancements and options available in this area.

The first deep learning-based method for facial micro-expression analysis was proposed in the year 2016. The basic micro-expression recognition algorithms use the standard 2D CNNs (see Figure 3) commonly used for static display on the apex frame of the original video clips and extract spatial information (e.g., use of ResNet [115]). However, these algorithms do not encode the muscle motions. Other methods first encode the muscle motion (e.g., subtle facial emotion) from the original videos. The motion-encoded data, such as the optical flow data, are directly supplied to the off-the-shelf 2D CNN (e.g., ResNet and VGG). This 2D CNN accepts 2D input RGB color images using transfer learning (e.g., use of ResNet [116]) or via retraining (e.g., use of ResNet [117]) or modifying the off-the-shelf/backbone networks (e.g., modification of the ResNet for the custom 2D CNN presented in [116]). If the networks are retrained, they can be initialized using large facial datasets such as ImageNet2012 (e.g., in [117]). Optical flow provides robust features to the diversity of facial textures. Consequently, to ensure a reduced computational complexity in terms of time, the optical flows are extracted between the onset and apex frames of micro-expression video clips—it is between these two moments that the motion is most likely to be the strongest. Commonly, one optical field (either vertical, horizontal, or magnitude) or a pair of horizontal and vertical fields are used as inputs to the 2D CNN. For facial micro-expression, the vertical optical field that encodes the vertical motion has been proven to offer better results than the horizontal optical field. Furthermore, to reduce the computational complexity over time, the vertical optical field alone was used with a customer 2D CNN shallow convolutional network [116] for fast training.

Table 4. Facial macro-expression recognition using deep learning.

Ref	Year	Database	Features/Classifier	Best Performance
[118]	1994	in-house	Neural network	Acc: $68 %$ to $89 %$
[119]	2002	JAFFE	Neural network	Acc: $73 %$
[120]	2005	DFAT-504	Gabor filters + AdaBoost/SVM	Acc: $93.3 %$
[121]	2016	FER2013 + SFEW 2.0	DNNRL	Acc: $71.33 %$
[122]	2017	CK+ Oulu-CASIA MMI	PHRNN + MSCNN	Acc: $98.50 %$ Acc: $86.25 %$ Acc: $81.18 %$
[123]	2018	in-house	Wavelet entropy/Neural network	Acc: $96.80 %$
[124]	2018	KDEF CK+	CNN/SVM	Acc: $96.26 %$ Acc: $95.87 %$
[125]	2020	FACES Lifespan CIFE FER2013	VGG-16/RF	Acc: $97.21 %$ Acc: $98.08 %$ Acc: $84.00 %$ Acc: $71.50 %$
[126]	2023	CK+ FER2013	CNN + DAISY/RF	Acc: $98.48 %$ Acc: $70.00 %$
[127]	2023	JAFFE CK+ FER2013 SFEW 2.0	VGG19 + GoogleNet + ResNet101/SVM	Acc: $97.62 %$ Acc: $98.80 %$ Acc: $94.01 %$ Acc: $88.21 %$
[128]	2024	Aff-Wild2	Landmarks/GCN	Acc: $58.76 %$

Peng et al. [114] proposed a two-stream 3D CNN model (see Figure 4). In contrast to a standard one-stream 3D CNN model (see Figure 5) that also captures the temporal dimension of facial expressions, the two-stream 3D CNN model used two datasets (CASME and CASME II) to increase the amount of data available for meaningful training of a deep learning network. The two datasets have different frame rates, each dataset was fed into a stream. The model uses a shallow network to avoid overfitting. The input of each stream receives the optical flow data to enrich the input data with motion information and to ensure that the shallow network acquires high-level features. After feature extraction using the 3D CNN, classification was implemented using a support vector machine (SVM) on four categories: negative, positive, surprise and others [114]. Using expression states (i.e., stages of a single expression) instead of an expression as the input increases input data and improves the expression class separability of the learned micro-expression features [129]. The method proposed by Kim et al. [129] uses a 2D CNN to learn the spatio-temporal feature representation in the first stage and consequent LSTM in the second stage (see Figure 6). The spatial model is learned, and the expression-state features learned from the trained spatial model are fed into an LSTM network to learn the temporal dependencies of expression-state spatial features for a given emotion [129]. The spatial features of an emotion category are learned at the expression state level of a video emotion using a CNN. Three expression states (onset, apex, offset) and the transitions from onset to apex and apex to offset are counted as states. The aim is to cluster the expression states of each emotion category, and, by doing so, to cluster the emotion categories. In other words, the spatial features of emotion are learned, regardless of the expression state. The loss function ensures continuity in consecutive states of an emotion category. Several objective functions are optimized during spatial learning to improve expression class separability, and the expression states are adopted in the objective function during spatial feature learning.

The temporal dimension represents the emotion dynamics over time, which is crucial for recognizing facial movements and discriminating between emotion classes. Spatial and temporal modules are needed to attain a good performance. Khor et al. [130] propose a method that uses an LSTM temporal module to learn temporal dynamics, preserve the temporal dimension and characterize temporal dynamics. For the spatial learning stage, the input data are encoded by a CNN to a fixed-length vector,

ϕ (x_{t})

, that represents the spatial features at time t. Subsequently, for the temporal learning stage, a 4096-fixed length

ϕ (x_{t})

is then passed to a recurrent neural network (LSTM) to learn the temporal dynamics. This is similar to ref. [129], but with no consideration of expression state. Two variants are proposed, both having the following inputs: optical flow (3D flow image made of horizontal flow dimension, vertical flow dimension and optical flow magnitude), optical strain and grayscale images. The spatial enrichment variant uses a single VGG-16 CNN applied to a large stacked 3D optical flow image, a 2D optical strain image, and a grayscale image of the microexpression data. The VGG-16 was trained from scratch, and the trained model gives a 4096-fixed-length feature vector at the last fully connected layer to be fed into the LSTM. The temporal enrichment model uses three pre-trained VGG-Face models that were trained on large-scale Labeled Faces in the Wild (LFW) data for face recognition [131] (i.e., transfer learning that allowed for fast convergence because both microexpression data and LFW data involve faces and their components). A single VGG-Face model takes the data of a 3D optical flow image, 3D optical strain converted from 2D optical strain or 3D grayscale image converted from the 2D grayscale image, and each VGG-Face model outputs a 4096-fixed length vector

ϕ (x_{t})

. The data in the last fully connected layers are fused before being fed into the LSTM (12288-length spatial feature vector). Each video sequence was interpolated using the temporal interpolation model with the aim of keeping the length of the LSTM input constant. The learning rate is tuned to be smaller than typical rates because of the subtleness of micro-expressions, which poses difficulties for learning. The temporal dimension enrichment (TE) variant outperforms its spatial dimension enrichment (SE) variant counterpart, demonstrating the importance of fine-tuning separate networks for each type of data (for a small dataset). SE performs better than TE for larger training datasets composed of the CASME II and SAMM datasets. SE needs more data than TE, given that it has a large input dimension (curse of dimensionality). The use of optical flow is more beneficial than the use of raw pixel intensities in providing a proper characterization of the input data to the network, according to the results of the ablation study in ref. [130].

Instead of optical flow, neural spikes can be encoded from human facial and physiological data to capture emotion. Considering these spikes’ timing, a three-dimensional representation of human emotion is generated. This process uses brain-inspired spiking neural networks (SNNs), such as KEDRI’s NeuCube, which belong to the third generation of artificial neural networks (ANN). SNNs consist of 3D spatio-temporal structures applied to facial and physiological emotion recognition. This method achieves classification accuracy comparable to state-of-the-art deep learning approaches that utilize facial expressions and physiological signals. Table 5 presents a categorization of deep learning methods specifically designed for micro-expression extraction.

A deep learning model’s discriminative power is primarily determined by the quality of the features it generates. The model must also exhibit strong generalization across a wide range of subjects. Moreover, the classifier plays a crucial role in both discriminative power and generalization [130].

Data augmentation techniques are crucial to enhance generalization and improve model performance. Various methods of data augmentation exist, with temporal interpolation being particularly useful for spatio-temporal methods. Temporal linear interpolation can be applied to a sequence of frames to increase and balance data classes [114,129]. Additionally, a temporal interpolation model can be employed to fit the sample sequence into a recurrent model that expects a fixed temporal length [130].

The performance of new automatic facial microexpression recognition methods is commonly compared against the baseline LBP-TOP method (presented results or reproduced LBP-TOP) [130]. The recognition performance measures are the F1 score, weighted average recall (WAR), or accuracy. The unweighted average recall (UAR) is beneficial for balanced accuracy (averaging the accuracy scores of each individual class without consideration of class size), and the macro-averaged F1 score provides a balanced metric when considering highly imbalanced data [111].

The micro-expression recognition methods are evaluated using spontaneous micro-expressions databases. Publicly available datasets for visual features are used as the benchmark databases for visual emotion recognition. With the advent of spontaneous micro-expressions databases, many methods of microexpression recognition have been proposed. The databases comprise micro-expression video sequences. Existing public facial expression datasets are given in Table 6. Some micro-expression databases (e.g., USF-HD [105] and Polikovsky et al.’s [142] databases) contained micro-expressions that are actually either posed or acted out instead of naturally spontaneous ones. The posed expressions are captured by requesting that the participants form a facial expression for a certain emotion (e.g., facial image data in the entire CK dataset and a significant portion of the CK+ dataset). The acted expressions are collected when the participants are instructed to perform certain reactions and make expressions as responses to certain stimuli (e.g., eNTERFACE dataset). The posed and acted expressions are induced based on different numbers of basic emotions. They are hidden expressions—it is quite common for the real emotion of a participant not to match the facial expression made. The disadvantages of posed and acted datasets are that the occurrence duration of their micro-expressions might not be the same as the duration of the naturally spontaneous ones (e.g., USF-HD and Polikovsky et al.’s are longer (2/3 s) than Paul Ekman’s definition (1/3 s)). Spontaneous natural expressions with expression classes are labeled by trained coders based on the presence of FACS (e.g., microexpression databases SMIC, CASME II and SAMM). There are also limitations on the size of the datasets, which can be insufficient for proper experimentation and analysis (YorkDDT [143] contained only 18 micro-expressions). CASME II and SAMM micro-expression datasets have objective classes, as provided in ref. [143].

Emotions can also be observed from images and videos of hands/heads and body gestures. Gesture-based emotion recognition has applications in realistic gaming environments and allows neuro-atypical individuals to better integrate into society [154]. Emotion recognition from full body movement (i.e., a set of body posture/gesture features) and other emotional cues can be used in rehabilitation [155]. Noroozi et al. [156] proposed a method to assess human emotions from body gestures with near-human (perception) accuracy. The method uses static expressions extracted from the head’s position and the body’s inclination. By using a neural network with 50 actors across body types and ethnicities in different poses, the method provided more than

87 %

accuracy, despite the varied input datasets. Note that while the absolute accuracy is

87 %

, the accuracy compared to human emotion recognition was close to

95 %

in this case. Santhoshkumar et al. [16] proposed using CNNs and achieved their objective with an accuracy of over

95 %

from videos. There are recent reviews available on emotional body gesture recognition [156] and emotion recognition based on body movement [157]. Table 7 provides an overview of body gesture-based recognition methods.

Gesture-based emotion recognition is challenged by the degree of subjectivity involved in establishing benchmarks. Despite the fact that visual cues have higher informational contributions in human-to-human interactions, the main disadvantage of visual emotion is that the emotion can be feigned/suppressed. The facial data must be diversified across different ethnic groups and cultures [161].

4.1.2. Speech Modality

Speech/voice emotion recognition is preferred for applications in e-learning and telemarketing, call centers, phone banking and many other applications [162]. Psychological studies of emotion prove that vocal parameters, especially pitch, intensity, speaking rate and voice quality, play an essential role in emotion recognition and sentiment analysis [163]. Vocal emotion cues, which play a crucial role in emotion recognition, include prosodic variables/features (such as pitch, pause duration (pause), tone of voice, intonation, rate of speech, and voice probabilities) and energy-related features. Other emotion recognition features used by some researchers are the zero-crossing rate, signal energy, entropy of energy, spectral centroid, spectral spread, spectral entropy, spectral flux, spectral roll-off, vocal formants (especially their frequencies), Teager energy operator (TEO)-based features (or Teager–Kaiser energy operator), beat histogram (beat sum, strongest beat), perceptual linear predictive coefficients (PLPs) (such as linear prediction cepstral coefficients (LPCC)), log frequency power coefficients (LFPC), Mel-frequency cepstral coefficients (MFCCs) and bark-frequency cepstral coefficients (BFCCs) [164,165]. Some features, such as prosodic features, are insignificant for involuntary emotional expression. Various works have been carried out based on the types of features needed for better analysis [166,167]. Speech data are difficult to find, and datasets are made of a few samples. Public databases for speech data are solely acted data with no prosody control. Table 8 details the public datasets commonly used for speech emotion recognition.

Recent speech emotion recognition methods are detailed in Table 9 and Table 10.

Morais et al. [178] used feature engineering for upstream and ECAPA for downstream processes and reported an accuracy of

77 %

. Balakrishnan et al. [184] used a speech dataset, namely sustained emotionally colored machine–human interaction using a nonverbal expression project (SEMAINE), to test two models: one utilizing a basic neural network, and a method based on a CNN to analyze speech emotions. The CNN exhibited better accuracy and allowed for learning throughout the operation, thus minimizing the risk of forgetting and giving a better performance. Alu et al. [185] used CNNs and reported an accuracy of over

70 %

. Tzirakis et al. [186] used CNNs that took a raw speech waveform and identified the individual’s emotional state with a greater accuracy than other established methods. Schuller et et al. [187] recently designed speech emotion recognition using LSTM and restricted Boltzmann machines (RBMs).

Unlike facial affect recognition, voice affect recognition depends on the language and dialects of the speakers. Voice is affected by some external noise. The models for speech emotion recognition are trained using datasets that are appropriate to the applications they are designed for [188]. Further studies showed that acoustic parameters change due to oral variations and depend on personality traits [166,167]. Therefore, the data must be diversified for methods to deal with different language dialects (and speech accents). The speech emotion recognition methods must consider the scenario of multiple speakers.

4.1.3. Physiological Modality

Emotion recognition using physiological data is one of the applications of brain–computer interfaces (BCIs), in which computers decode people’s emotional states from their brain signals. It has wide applications in healthcare.

Motor expressions (physical expressions) of emotion, such as visual (e.g., facial expression and body gesture) and verbal expressions, can be deceptive to a certain degree. For example, people might feign an emotion they do not feel or express an unfelt emotion simply because it is socially desirable. Physiological data, such as electroencephalographs (EEGs), body temperature, electrocardiograms (ECGs) and electromyograms (EMGs), are less imitable, and their analyses are not easily affected by human subjectivity. Therefore, they are more reliable indicators of genuine emotional states [189] and are always a predictable indicator of the mental state of the person involved. EEG physiological signals are the most commonly used physiological data in emotion research. They can indicate people’s immediate responses to emotional stimuli with a high temporal resolution and cannot be misled by human expressions.

EEG signals are distributed across multiple channels, and using more channels can provide more precise data for emotion recognition. However, the disadvantage/inadequacy is that it is more expensive for data collection and more intrusive for the participants/holders/subjects.

EEG-based emotion recognition using handcrafted features exploits linear or nonlinear dynamic features extracted from EEG signals. Linear features include power spectrum and wavelet features. In contrast, nonlinear features include different types of entropies (approximate entropy, sample entropy), Hurst exponents, fractal dimensions (FDs), correlation dimensions (CDs), the largest Lyapunov exponent (LLE) and higher-order statistics [190,191]. Nonlinear characteristics can reveal emotions better than linear characteristics.

The empirical mode decomposition (EMD) method was used to decompose EEG signals and derive the sample entropy of the derived intrinsic mode functions (IMFs) [192]. In addition, some studies have employed a divide-and-conquer strategy for feature extraction. For example, EEG signals were divided into segments according to the time windows in ref. [193]. Support vector machines (SVMs) are the classification method commonly used for EEG-based emotion recognition. Apart from SVMs, other classical classification methods that have been used for EEG-based emotion recognition are K-nearest neighbor (KNN) and linear discriminant analysis (LDA) [194]. The classification methods that have been used on handcrafted features for EEG-based emotion recognition are detailed in Table 11.

Previous work has been on EEG-based deep learning of emotion recognition. Salama et al. [201] packed multiple channels of EEG signals into 3D data of segments of 2D signals within a specific period. Then, they used a 3D CNN to classify the 3D data for emotion state recognition.

The benefits of exploiting fewer EEG channels are less intrusiveness for the participants and a smaller cost for data processing. Two channels of EEG signals were used for a bispectral analysis to obtain the bispectrum features [202]. In addition, the energy bands and FD features were extracted from two channels of EEG signals [203].

EEG signals are always affected by noise and artifacts (the signal-to-noise ratio of EEG signals is low), which leads to poor robustness of the emotion recognition model and affects the emotion recognition accuracy. Because there is a limitation on available resources, or because the specific bio-signal is extremely noisy [204,205], data augmentations (for example, employing GAN) are used to generate high-quality simulated data and to enhance the model [206,207]. A preprocessing step is also undertaken to attenuate noise and remove artifacts. Some filtering (noise cancellation) approaches, such as surface Laplacian (SL) filtering [194], are also used to attenuate noise and remove artifacts. Even given the number of studies related to emotion recognition using EEG signals’ features, with their respective advantages and drawbacks, the main limitation is related to the availability and quality of data [204,205].

ECG-based emotion recognition also uses handcrafted features. A Bayesian NN was used to analyze the heartbeat data obtained by a fitness tracker to identify the emotional state of the user wearing the tracker [208]. Heartbeat data were obtained as a photoplethysmograph. With ECG data, a neural network was used for emotion recognition. Two NNs were proposed, one for signal transformation recognition and another for identifying emotions. The first signal transformation recognition network was trained using an extensive dataset to identify specific transformations in the input signal. The second network identified the emotions associated with each transformation successfully. The two-network model performed better than the traditional approaches, and the authors reported accuracies close to

98 %

[209]. Canonical correlation relating these signals to emotion alone could predict the correct emotions from physiological data with over

85 %

efficiency [210]. Publicly available datasets for ECG that have been used for emotion recognition are detailed in Table 12.

There is a need for better physiological datasets for training NNs to identify emotions. However, wearable devices (non-disturbing) equipped with sensors to provide physiological data, such as photoplethysmogram (PPG), electrodermal activity (EDA), and skin temperature (SKT), are noisy and rudimentary. Therefore, they are used for specialized use cases (for example, in healthcare).

4.2. Multimodal Emotion Recognition

Multimodal emotion computing is useful in contexts such as e-learning, telehealth, automatic video content tagging and human–computer interactions when assessing a person’s emotional state requires a high precision or a human’s validation in the loop of the computing system is not available [211]. Multimodal emotion computing is also exploited in real-time multimodal interactions between humans and computer-generated entities (and multimodal interactive agents) [212].

A multimodal emotion analysis approach combines more than one aspect of human emotional cues (visual expression, speech expression or physiological signal/data) from an individual to identify/interpret the emotional state. It aggregates and infers the emotional information associated with user-generated multimodal data. The human brain considers multisensory information together for decision-making. In everyday social situations, the way humans express and perceive emotions is usually multimodal, and humans rely on multimodal information more than unimodal information [213]. Emotions are conveyed multimodally by multiple channels, such as voice, facial expression, physiological responses (such as sweating), and body gestures/postures. During the perception of emotion (multimodal emotional computing), the audio, visual, and possibly tactile modalities are concurrently and cognitively exploited to enable the effective extraction of emotion. For example, we obtain a better understanding of speakers’ emotions when we see their facial expressions while they are speaking, and a decision can be made accordingly. Together, vocal and visual mediums provide more information than they provide alone. When the information driving the brain’s reaction is greater than the aggregated dissimilar sensory inputs, the brain relies on several sensory input sources to validate events. Using all the sources compensates for any incomplete information that can hinder decision processes. Research has been conducted to replicate this multimodal approach in emotion computing (to shift from unimodal to multimodal emotion computing, expecting an effective emotion computing system). The ability of the multimodal framework to significantly achieve performance improvements over unimodal systems has been confirmed in numerous studies [47,214].

Multimodal data can be used for the extraction of event-related potential (ERP) information. The ERP technique involves, for example, EEG signal segmentation based on detecting short-term changes in facial landmarks. The multimodal framework can also be implemented by fusing multimodal data consisting of facial expressions and physiological signals such as ECG, heart rate, skin temperature/body temperature, skin conductance, respiration signals and pupil size. For example, videos provide multimodal data regarding vocal and visual modalities in input audiovisual presentations. Vocal and facial expressions can provide essential cues to identify better true affective states (in multimodal emotion recognition using visual and aural information). Aural data in a video express the tone of the speaker. In contrast, visual data convey facial expressions, which can further aid in understanding the user’s affective state. Moreover, a combination of video data and physiological data should enhance the recognition of the human affective state and create a better emotion model.

Multimodal emotion computing integrates the outputs of unimodal systems by fusing/combining information from different modalities for analysis using various (multimodal) fusion techniques for data collected from multimodal sources. Existing techniques for fusing information from different modalities for emotion computing are categorized into two groups: decision-based fusion and feature-based fusion. Data are processed as a combined entity in feature-level fusion, or data are split into modalities and then analyzed in decision-level fusion [47]. Some algorithms use a combination of both of these approaches or rely on specified rules to fuse modalities. The framework of a typical multimodal emotion recognition system is given in Figure 7. It consists of two fundamental steps: processing unimodal data separately and fusing them all. Both steps are equally important. The multimodal system proceeds via the fusion of unimodal information. Unimodal emotion recognition, an essential preprocessing step for multimodal fusion, must perform well to build an intelligent multimodal system. Poor analysis of a modality can worsen the multimodal system’s performance, while an inefficient fusion method can ruin the multimodal system’s stability. Researchers can identify essential and appropriate visual, audio, and physiological machine learning analysis methods from the literature and fuse them using state-of-the-art methods. Multimodal data, such as the data obtained from a video, can be a valuable source of information for emotion analysis, but major challenges need to be addressed. For example, the ways we convey and express opinions vary from one person to another. Some people may express their emotions more vocally, while others do so more visually. When people express their emotions with more vocal modulation, the audio data may contain major cues for emotion mining. On the other hand, when a person is using more facial expressions, most of the cues needed for emotion mining can be assumed to reside in their facial expressions.

Single modalities exhibit limitations in terms of meeting robustness, accuracy, and overall performance requirements, which greatly restrict such systems’ usefulness in practical, real-world applications. The key advantages associated with multimodal approaches are that they are likely to generate more realistic results and provide the possibility of realistic human-to-machine interactions in real-time [47]. The aim of data fusion (i.e., multimodal recognition approach) is to increase the accuracy and reliability of estimates. However, multimodal analysis and processing are complex and still in the development stage. Before real-time multimodal emotion recognition makes its way into our everyday lives (for example, enabled on smart devices), some of the major research challenges that need to be addressed to effectively integrate various modalities include synchronization that is hindered by noisy characteristics in different sensors (audio and speech in a video). Synchronization involves integrating information from different sources across different time scales and measurement values. The lack of data synchronization from heterogeneous inputs hinders the modeling of temporal information. Multimodal approaches also require more data, given the increase in input dimension (quantity of data needed in multimodal approaches increases as the input dimension increases), which raises the issues of the limitation in data available for analysis and handling the vast quantities of data involved. For real-time analysis of multimodal big data, an appropriately scalable big data architecture and platform must be designed to effectively cope with the heterogeneous big data challenges of growing space and time complexity. Though multiple modalities should complement each other to enhance recognition performance, an expression in one modality can distort an expression in a different modality [18]. Various studies have been conducted using only visual features for multimodal emotion analysis. The spatio-temporal 3D structure handles the same unimodal and multimodal effects.

Many studies have used visual and audio modalities for multimodal affect recognition (see Table 13). The human ability to recognize emotions from speech audio is about

60 %

. One study shows that sadness and anger are detected more easily from speech, while recognizing joy and fear is less reliable [215]. Audio data can be extracted from the video emotional stimuli.

Temporal and spatial processing of EEG features involves event-related potentials. An adaptive multimodal system combines visual, auditory, and autonomous nervous system signals from a user to recognize the emotions of the user. Emotional features from brain EEG signals and emotional features from the corresponding audio signal or facial expression images are fused. Recognition is achieved by analyzing the bio-signals of an individual, such as facial, EEG and eye data, using machine learning models. Multimodal emotion recognition architecture is based on audio and EEG data. Eye data and EEG signals are integrated as features in machine learning models, such as SVMs, to identify emotions [216,217,218,219,220,221,222].

Given that the fusion aspects for multimodal emotion computing can be borrowed from multimodal methods that use text modality (textual data) and spoken/natural language analysis for mainly sentiment analysis, such multimodal affective computing models are considered in Table 13. Instead of fusing different modalities at abstract levels that ignore time-dependent interactions between modalities, Gu et al. [223] used hierarchical multimodal architecture (with attention and word-level fusion) to classify utterance-level sentiment and emotion from text and audio data (multimodal data). The two modalities are fused at the feature level using a CNN to analyze the combined data. The accuracy of multimodal affective computing increased over the unimodal framework. In addition, the proposed method enabled visualization of the contribution/interpretability of each modality to the total performance due to the proposed synchronized attention over modalities. Table 13 lists the deep learning methods of multimodal emotion-related signals, while Table 14 lists the multimodal emotion recognition methods that use handcrafted features.

Table 13. Multimodal emotion recognition using deep learning features.

Ref	Year	Multimodal Database	Elicitation	Features	Classifier	Average Accuracy	Fusion Method	Modalities
[223]	2018	IEMOCAP, EmotiW	-	BiGRU, attention layer	CNN	Best acc: $72.7 %$ , WF1 = $0.726$	Word-level feature-level fusion, CNN	Audio/text
[164]	2015	eNTERFACE	-	-	(SVM), ELM	Acc: $88 %$	Feature-level fusion	Audio/facial/ text
[224]	2021	RAVDESS	Acted	xlsr-Wav2Vec2.0 AUs/ bi-LSTM	-	Acc: $86.70 %$	Decision-level fusion using multinomial logistic regression	Audio/facial
[225]	2024	eNTERFACE’05	Induced	MobileNetV2 spectrogram/(2D CNN with a federated learning concept)	-	Acc: $93.29 %$ (subject-dependent)	Decision-level fusion using average probability voting	Audio/facial
[226]	2023	WESAD CASE k-EmoCon	-	temporal convolution-based modality-specific encoders	FC	Acc: $84.81 %$ val: $63.29$ , ar: $66.32$ val: $64.07$ , ar: $50.42$	Feature-level fusion using a transformer	EDA, BVP, TEMP
[227]	2022	In-house dataset	Induced	CBAM and ResNet34	MLP	Acc: $78.32 %$ (subject-dependent)	Data-level fusion	EEG/facial
[228]	2022	RAVDESS SAVEE	-	ConvLSTM2D and CNN (MFCCs + MS + SC + TZ) CNN	MLP	Acc: $86 %$ Acc: $99 %$	Feature-level fusion	Audio/video
[229]	2022	SAVEE RAVDESS RML	-	2-stream CNN and bi-LSTM (ZC, EN, ENE) CNN	MLP	Acc: $99.75 %$ Acc: $94.99 %$ Acc: $99.23 %$	Feature-level fusion	Audio/video
[230]	2024	IEMOCAP (facial/audio)	-	AlexNet with contrastive adversarial learning (facial) MFCC, velocity and acceleration + VGGNet Convolutional autoencoder (teacher), CNN (student)	-	Acc: $62.5 - 85.8 %$ per emotion state	Adaptive decision-level fusion	Facial/audio
[231]	2023	ASCERTAIN	Induced	FOX-optimized DDQ	-	Acc: $66.20$	Optimization-based model fusion	Facial/audio/ GSR
[232]	2023	RAVDESS CREMA-D	-	3D CNN with attention mechanism 2D CNN with attention mechanism	-	Acc: $89.25 %$ Acc: $84.57 %$	Cross-attention fusion system (feature-level fusion)	Audio/facial
[233]	2023	M-LFW-F (facial) and CREMA-D (audio)	-	modified Xception model (spectrogram images extracted from the audio signal)	-	Acc: $79.81 %$	Feature-level fusion between entry flow and middle flow	Audio/facial
[234]	2023	AFEW SFEW MELD AffWild2	-	VGG19 for face (spectrogram) ResNet50 for audio	-	Acc: $18.06 %$ Acc: $45.63 %$ Acc: $48.91 %$	Embracenet+, feature-level fusion	Facial/audio
[235]	2024	In-house dataset	Induced	CNN (EEG topography) CNN	-	Acc: $91.21 %$	Decision-level fusion	Facial/EEG
[236]	2024	FEGE	Acted	3D-CNN + FC	-	Acc: $89.16 %$	model fusion through a shared encoder at feature level and Type-2 fuzzy decision system	Facial/gesture

EDA: electrodermal activity, BVP: blood volume pressure, TEMP: skin temperature, EOG: electroocoulogram, EMG: electromyography, GSR: galvanic skin response, DDQ: double deep Q-learning, Acc: accuracy, AUs: action units of video, ZC: zero crossing, EN: energy, ENE: entropy of energy, MS: melspectrograms, SC: spectral contrast, TZ: tonnetz.

Table 15 lists the widely used datasets for multimodal emotion recognition/analysis. To our knowledge, no publicly available datasets for multimodal emotion recognition focus solely on subtle expressions; however, some include a mix of macro and micro-expressions.

Different methods can be used for dimensionality reduction [238], and multiple kernel learning algorithms are employed to analyze the data [164]. HMM was used as a classifier to understand emotion and measure statistical dependence across successive time segments [237]. Preprocessing methods have been proposed to reduce the noise in modality data. CNNs perform well in modeling the spatiotemporal information from the data of the three modalities for emotion recognition. Spectral powers of the pupil diameter data and the blinking rate of the eyes were extracted from the eye data. In contrast, the power spectral density (PSD) was extracted from different frequency bands. These features were then used as inputs for an SVM classifier with a radial basis function (RBF) kernel [219].

The recognition performance of the multimodal fusion classification model (by fusing more than one mode of signal) is significantly better than that of the recognition based on single-modal data (in terms of accuracy). The fusion of multimodal information has proved beneficial in improving the accuracy of emotion recognition based on a single modality. For the dimensional model, the performance can be better in one of the dimensions [249]. The fusion is achieved using feature-level and decision-level fusion methods within the neural encoding algorithm. AdaBoost was employed to develop a fused method in which an SVM was used for processing the PSD features from EEG data, and the face data were processed using a CNN [217].

5. Deep Learning Challenges and Solutions for High-Quality Emotion Recognition

In the literature, different methods have been proposed to improve the performance of facial micro-expression recognition tasks and, in particular, to obtain discriminative features. One of the methods is expression magnification, also referred to as motion magnification, which is a data processing method that magnifies the motion features of the original micro-expression video clips (examples of magnifications for micro-expression recognition can be found in references [250,251,252]). In ref. [253], instead of feeding the features extracted from the whole single face into the classifier, the authors adopted the use of part-informed features, where the motion features extracted from different parts of the face (e.g., eyes, eyebrow, nose and mouth) are treated separately and classified separately. The part-informed features are split for the single feature map obtained after the last convolutional layer to form a part-based classification. Part-informed features can also be fused at a later stage. This method injects structure priors into the classification network. For example, in ref. [117], the feature map after the last convolutional layer is split into two parts to represent the eye and the mouth areas separately. Part-informed features offer fine-grained information from the input source. In addition, the part-informed forces the encoder to learn representations focusing on local motions on the face, which is discriminative for expression reconnecting. To succeed in implementing the part-based deep neural network (or part-based classification in general), variations in the natural scene that are not related to facial expressions, such as head posture and background, must be considered. Image registration must be conducted as data preprocessing to have similar head posture, remove background effects [117], and reduce the impact of head posture variations. In subtle emotions, emotion recognition must consider and handle the dynamic emotion subtleties that depend on an individual’s background or condition to better encode fine patterns present in the facial image. Deep learning-based methods (e.g., [92,254]) overcome the shortcomings of the geometric feature-based methods and appearance-based methods. The extraction of deep learning features, the same as handcrafted features, requires face alignment during preprocessing and may require facial landmarks extracted to crop the region of interest in case the facial data are not pre-cropped. However, public datasets such as CASME II have pre-cropped video frames.

Small sample sizes remain an impediment for deep learning-based approaches, given that the performance of deep learning models depends on the size of the dataset. Therefore, using larger training datasets can improve the performance of deep learning methods. In addition, the problem of class imbalance for supervised learning algorithms that tends to bias the classification towards the majority classes leads to poor classification—classes with more data in the training set are distinguished better and have better accuracies [130]. For example, the temporal interpolation model (TIM) method can be used as a data augmentation approach to increase data but can provide less information than real data [130]. Likewise, combining different data sources can degrade classification instead of improving classification. For example, the model trained on a single domain (CASME II) had more salient locations (action units) than that on a cross-domain (CASME II and SAMM), which can hinder facial action unit-based classifiers. The combination of datasets gathered from different sources may also mask important information [255]. To properly handle class imbalances, one well-accepted balanced performance metric of micro-expression recognition is the F1 score [111]. Though multimodal emotion analysis models should be trained on big data from diverse contexts and diverse people to build generalized models, robust techniques must be designed to account for unseen contexts. For example, given speakers in speech emotion recognition, it has been shown that languages affect the emotional recognition of the speakers. However, a dataset used for the design of speech emotion recognition is limited in the number of languages used by the speakers in the dataset, and this limitation affects the problem of the designed model when it is deployed. Effective modeling of temporal information in big data can also be devised.

For the problem of small training sets, especially for micro-expression recognition tasks, and for the problem of class imbalance, the recognition task can use different deep-learning strategies. One of the strategies is the transfer of domain knowledge from macro-expression recognition tasks, namely the source, such as CK+ [148] or BU-3DFE [145]) to micro-expression recognition (namely, the target) using a transfer learning technique for micro-expression tasks. Different domain adaptation techniques, also referred to as style-aggregated methods, have been used and help to enrich the available training samples. These methods include adversarial training [256,257] such as cycleGAN used in ref. [115,117] to obtain domain-invariant features. These methods can be implemented together with the attention transfer mechanism from a teacher model (for the macro-expression) to a student model (for the microexpression) (e.g., [132]) to improve recognition accuracy. Better transfer learning techniques can be explored for further works targeting micro-expression recognition. Another domain adaptation technique that is different from transfer learning methods is the expression magnification and reduction (EMR) technique [117]. Data augmentation can also be implemented by increasing optical flow data for each micro-expression video clip. For example, in ref. [116], for each micro-expression video clip used for training, in addition to the optical flow between the onset and apex frame, a second flow computed between the onset and apex+1 frame was also included.

The accuracy of emotion recognition is not satisfactory, given the high disagreement regarding emotion state when different methods are assessed and approximate accuracies of less than

80 %

. However, considering the context/situation as secondary data to enhance the recognition of the user’s emotional state can improve the recognition. The representation of input also data contributes to the performance of recognition. For example, the representation of a face image as a 3D object on a manifold is useful in capturing the shape deformation and fine details of an emotional expression. Therefore, the recognition system can first extract the emotional features from the manifold through a deep learning-based approach. Then, operations such as convolution on a manifold can be defined, and the feature extraction task should exhibit a low computational complexity. Secondly, a deep learning classifier can be designed to identify dynamic 3D emotional expressions. The classifier must also differentiate between the different phases of a single emotion (onset, apex, offset). Methods that exploit multi-resolution information or work at multiple scales reduce bias. Furthermore, the emotion recognition system can use deep convolution features to extract salient information at multiple scales to improve recognition. Dynamic expression recognition, combined with a single instance and multi-resolution simultaneously, is a potential approach to improve emotion recognition. The goal is to reduce misclassification errors in state-of-the-art emotion recognition methods. The classifier must be invariant with respect to an individual’s identity, such as age, gender or ethnicity, and must exhibit low computational complexity. A useful emotion recognition model in healthcare, education, security applications and others must identify the genuine emotion of the expresser. Different techniques have been designed to recognize the real emotion related to the real feeling of an individual, mainly based on micro facial expressions. In this regard, techniques must be developed for multimodal emotion recognition to report genuine emotional states. In addition, the multimodal emotion recognition model must include the fact that two or more emotions may be experienced at the same time (i.e., a task for multi-label emotion recognition).

Real-world applications run in environments with varying levels of noise. In particular, speech and physiological data are noisy. However, during the design of the emotion recognition system, data less affected by noise are collected. Though a high performance can be registered, this performance does not represent reality. Therefore noise-robust methods and features are needed for practical applications. There are also challenges involved in collecting multimodal data, including acted, induced or natural data. For real-time analysis of big multimodal data, an appropriate scalable big data architecture and platform should be designed to effectively cope with the heterogeneous big data challenges of growing space and time complexity.

The synchronization of modalities in the case of multimodal emotion computing is another challenge that can be solved by considering the fusion of modalities at the feature level or decision level. The other effect to be considered during the design of fusion techniques is the selection of important recognition per individual. For example, the ways we convey and express opinions vary from one person to another. One person may express his/her emotions more vocally, while others may do so more visually. When a person expresses his/her emotions with more vocal modulation, the audio data may contain major cues for emotion mining. On the other hand, when a person is making use of more facial expressions, most of the cues needed for emotion mining can be assumed to reside in their facial expressions.

6. Discussion

The performance of emotion recognition depends on the type of training data and on whether the data are natural, acted or induced [214]. In addition, methods should implement learning and testing on either diversified or different datasets to test the robustness and ability to learn salient characteristics from the samples. The use of in-the-wild datasets during the development of the emotion recognition model is essential. These data closely mirror real-world scenarios.

The aim is an architecture of an intelligent sensing machine that can quickly learn a large amount of information with little prior knowledge and adapt in real-time to accommodate new data. The model must be able to accurately classify emotion states that are at the borderline between several other emotions (see continuous model/circumplex model). Progress has been made regarding developing advanced intelligent systems that aim to detect and process emotional information contained in multimodal sources. The additional complexity of the multimodal approach is sufficiently justified in terms of the improved accuracy obtained. However, multimodal systems should be able to implement recognition continuously (e.g., determine the emotional changes) while online and performing in real-time. The multimodal system with a specific set of required modalities should be versatile and function in case some inputs of the required modalities are not available. In addition to improving the model’s classification rate, the model should report correct spontaneous emotional states. The validity of correct spontaneous emotion expression presence in both micro-expressions and macro-expressions can be conducted to assess the suggestion of Porter [81], who argues that a spontaneous, genuine expression can be a macro- or micro-expression. The model should also be context-sensitive to adapt to any user [258] and can give a consistent result in any real-world environment. The recognition should also consider the fact that a human can experience different emotions at the same time.

The exploitation of non-Euclidean space for data representation and machine learning operations is a technological advance that can help to improve emotion recognition. Geometric deep learning methods use the non-Euclidean domain to retrieve the fine details related to the face deformation of each facial expression. They represent the input 3D image on a graph or manifold [259]. A facial emotional expression recognition representing the input data on the Riemannian manifold was proposed and showed promise, though it exhibited a low classification accuracy [260].

Significant technological advances have been achieved in deep learning design. However, the data and diversity in the data required for the approaches to execute at the desired degree of accuracy in real-world applications remains challenging. Incorporating knowledge representation and reasoning into models is essential for improving generalization across various contexts, scenarios and cultural diversity. This approach can improve emotion recognition’s ability to infer and adapt to new, previously unknown data, increasing the robustness of emotion recognition methods. Furthermore, leveraging explainability and interpretability is critical for model adaptation.

The inclusion of the temporal dimension of data during model training can enhance recognition accuracy. Furthermore, considering dynamic emotions, which account for how emotions change over time, will considerably increase the effectiveness of emotion recognition methods. The emotion recognition accuracy can be improved by considering the evolution of emotions.

7. Conclusions

An efficient emotion recognition system using video images will benefit the diagnosis and follow-up care of mental disorders (such as autism). Telerehabilitation can notably use the recognition system to enable people in areas with poor access to health care to access quality healthcare remotely. The system can also be used for clinical analysis and serve as a therapeutic approach, and it can also be used for real-time monitoring of patients. A real-time emotion recognition system can be utilized for the safety and security of communities. It can be used in education (for example, in e-learning to assess student learning). The system also plays a significant role in human–machine interactions. The system enables an improved flow of communication and enriches the existing information on individuals. It can help in delivering timely and adequate service from the public and private sectors to individuals and communities.

A decade of research has now established that emotional body expressions are stimuli that are reliably perceived and have a solid neural basis. Future research stemming from these findings needs to address the questions on specificity. In investigating this neural basis, this study increasingly provides evidence for the active component at the core of body expression perception. Reflex-like actions and intentional actions are fundamentally different and presumably sub-served by different brain systems. The challenge is to show that already at the reflex stage, without dependence on conscious action intention, we are in the presence of meaningful behavior. The representation of the input video data on a given manifold and the design of an approach to extract features is required to improve classification accuracy and computation complexity. In the geometric deep learning approach, the use case is the analysis of emotions, where classical machine learning techniques exhibit drawbacks and limitations. The design can be evaluated and validated using emotional video datasets from hybrid data sources (e.g., the 4DFAB database, whose data are freely available upon request). The performance of the feature extraction task is commonly assessed by assessing the discriminative power of the generated features based on the classification accuracy of a classifier that uses such features. Ongoing and future research will assess the generated features by reconstructing the input data and evaluating the reconstruction error in terms of the

l_{2}

norm, similar to the assessment approach of autoencoders and generative adversarial networks. The automatic emotion recognition must solve the problem of people falsely interpreting the emotions.

The development of emotion recognition systems must align with policies that address AI ethical considerations, ensuring that these technologies comply with current regulations. As these systems evolve, context-aware emotion recognition will play a crucial role, enabling adaptation to the nuances of diverse cultural contexts, thus enhancing their applicability across different user groups. Techniques can be developed for deep learning models to achieve more efficient and scalable performance for practical applications by reducing computational complexity without compromising accuracy and robustness.

Author Contributions

Conceptualization, G.U., K.D. and A.M.K.; methodology, G.U. and K.D.; validation, G.U., K.D. and A.M.K.; investigation, G.U.; resources, K.D. and A.M.K.; writing—original draft preparation, G.U.; writing—review and editing, G.U., K.D. and A.M.K.; supervision, K.D. and A.M.K.; project administration, K.D. and A.M.K.; funding acquisition, K.D. and A.M.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Research Foundation (NRF) of South Africa (Grant Number: 90604). Opinions, findings, and conclusions or recommendations expressed in any publication generated by the NRF supported research are those of the author(s) alone, and the NRF accepts no liability whatsoever in this regard.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Frijda, N.H. Passions: Emotion and Socially Consequential Behavior. In Emotion: Interdisciplinary Perspectives; Kavanaugh, R.D., Zimmerberg, B., Fein, S., Eds.; Lawrence Erlbaum Associates, Inc.: Mahwah, NJ, USA, 1996; Volume 1, pp. 1–27. [Google Scholar]
Frijda, N.H. Emotions. In The International Handbook of Psychology; Pawlik, K., Rosenzweig, M.R., Eds.; Sage Publications: London, UK, 2000; pp. 207–222. [Google Scholar]
Magai, C. Personality theory: Birth, death, and transfiguration. In Emotion: Interdisciplinary Perspectives; Kavanaugh, R.D., Zimmerberg, B., Fein, S., Eds.; Lawrence Erlbaum Associates, Inc.: Mahwah, NJ, USA, 1996; Volume 1, pp. 171–201. [Google Scholar]
Keltner, D.; Oatley, K.; Jenkins, J.M. Understanding Emotions; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
Scherer, K.R. Emotion. In Introduction to Social Psychology: A European perspective, 3rd ed.; Hewstone, M., Stroebe, W., Eds.; Blackwell Publishing Ltd.: Oxford, UK, 2001; Chapter 6; pp. 151–195. [Google Scholar]
Fredrickson, B.L. The role of positive emotions in positive psychology: The broaden-and-build theory of positive emotions. Am. Psychol. 2001, 56, 218–226. [Google Scholar] [CrossRef]
Rosenzweig, M.R.; Liang, K.C. Psychology in Biological Perspective. In The International Handbook of Psychology; Pawlik, K., Rosenzweig, M.R., Eds.; Sage Publications: London, UK, 2000; pp. 54–75. [Google Scholar]
Shuman, V.; Scherer, K.R. Psychological Structure of Emotions. In International Encyclopedia of the Social & Behavioral Sciences; Wright, J.D., Ed.; Elsevier Ltd.: Waltham, MA, USA, 2015; Volume 7, pp. 526–533. [Google Scholar]
Koolagudi, S.G.; Rao, K.S. Emotion recognition from speech: A review. Int. J. Speech Technol. 2012, 15, 99–117. [Google Scholar] [CrossRef]
Ko, B.C. A brief review of facial emotion recognition based on visual information. Sensors 2018, 18, 401. [Google Scholar] [CrossRef] [PubMed]
Glowinski, D.; Dael, N.; Camurri, A.; Volpe, G.; Mortillaro, M.; Scherer, K. Toward a minimal representation of affective gestures. IEEE Trans. Affect. Comput. 2011, 2, 106–118. [Google Scholar] [CrossRef]
Horlings, R.; Datcu, D.; Rothkrantz, L.J. Emotion recognition using brain activity. In Proceedings of the 9th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing, Gabrovo, Bulgaria, 12–13 June 2008; p. II–1. [Google Scholar]
Monajati, M.; Abbasi, S.H.; Shabaninia, F.; Shamekhi, S. Emotions states recognition based on physiological parameters by employing of fuzzy-adaptive resonance theory. Int. J. Intell. Sci. 2012, 2, 24190. [Google Scholar] [CrossRef]
Kim, M.Y.; Bigman, Y.; Tamir, M. Emotional regulation. In International Encyclopedia of the Social & Behavioral Sciences, 2nd ed.; Wright, J.D., Ed.; Elsevier Ltd.: Waltham, MA, USA, 2015; Volume 7, pp. 452–456. [Google Scholar]
Scherer, K.R. What are emotions? And how can they be measured? Soc. Sci. Inf. 2005, 44, 695–729. [Google Scholar] [CrossRef]
Santhoshkumar, R.; Geetha, M.K. Deep learning approach for emotion recognition from human body movements with feedforward deep convolution neural networks. Procedia Comput. Sci. 2019, 152, 158–165. [Google Scholar] [CrossRef]
Hassouneh, A.; Mutawa, A.; Murugappan, M. Development of a real-time emotion recognition system using facial expressions and EEG based on machine learning and deep neural network methods. Inform. Med. Unlocked 2020, 20, 100372. [Google Scholar] [CrossRef]
Shah, M.; Cooper, D.G.; Cao, H.; Gur, R.C.; Nenkova, A.; Verma, R. Action unit models of facial expression of emotion in the presence of speech. In Proceedings of the 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Geneva, Switzerland, 2–5 September 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 49–54. [Google Scholar]
Liebrucks, A. The Concept of Social Construction. Theory Psychol. 2001, 11, 363–391. [Google Scholar] [CrossRef]
Scherer, K.R. Appraisal Theory. In Handbook of Cognition and Emotion; Dalgleish, T., Power, M.J., Eds.; John Wiley & Sons Ltd.: Chichester, West Sussex, UK, 1999; pp. 637–663. [Google Scholar]
Le Ngo, A.C.; See, J.; Phan, R.C.W. Sparsity in Dynamics of Spontaneous Subtle Emotions: Analysis and Application. IEEE Trans. Affect. Comput. 2017, 8, 396–411. [Google Scholar] [CrossRef]
Fang, X.; Sauter, D.A.; Van Kleef, G.A. Seeing Mixed Emotions: The Specificity of Emotion Perception From Static and Dynamic Facial Expressions Across Cultures. J. Cross-Cult. Psychol. 2018, 49, 130–148. [Google Scholar] [CrossRef] [PubMed]
Tan, C.B.; Sheppard, E.; Stephen, I.D. A change in strategy: Static emotion recognition in Malaysian Chinese. Cogent Psychol. 2015, 2, 1085941. [Google Scholar] [CrossRef]
Schmid, P.C.; Schmid Mast, M. Mood effects on emotion recognition. Motiv. Emot. 2010, 34, 288–292. [Google Scholar] [CrossRef]
Jack, R.E.; Garrod, O.G.; Yu, H.; Caldara, R.; Schyns, P.G. Facial expressions of emotion are not culturally universal. Proc. Natl. Acad. Sci. USA 2012, 109, 7241–7244. [Google Scholar] [CrossRef] [PubMed]
Grainger, S.A.; Henry, J.D.; Phillips, L.H.; Vanman, E.J.; Allen, R. Age deficits in facial affect recognition: The influence of dynamic cues. J. Gerontol. Ser. B: Psychol. Sci. Soc. Sci. 2017, 72, 622–632. [Google Scholar] [CrossRef]
Martinez, A.M. Visual perception of facial expressions of emotion. Curr. Opin. Psychol. 2017, 17, 27–33. [Google Scholar] [CrossRef]
Holland, C.A.; Ebner, N.C.; Lin, T.; Samanez-Larkin, G.R. Emotion identification across adulthood using the Dynamic FACES database of emotional expressions in younger, middle aged, and older adults. Cogn. Emot. 2019, 33, 245–257. [Google Scholar] [CrossRef]
Barrett, L.F.; Adolphs, R.; Marsella, S.; Martinez, A.M.; Pollak, S.D. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychol. Sci. Public Interest 2019, 20, 1–68. [Google Scholar] [CrossRef]
Khosdelazad, S.; Jorna, L.S.; McDonald, S.; Rakers, S.E.; Huitema, R.B.; Buunk, A.M.; Spikman, J.M. Comparing static and dynamic emotion recognition tests: Performance of healthy participants. PLoS ONE 2020, 15, e0241297. [Google Scholar] [CrossRef]
Krumhuber, E.G.; Kappas, A.; Manstead, A.S. Effects of dynamic aspects of facial expressions: A review. Emot. Rev. 2013, 5, 41–46. [Google Scholar] [CrossRef]
Kamachi, M.; Bruce, V.; Mukaida, S.; Gyoba, J.; Yoshikawa, S.; Akamatsu, S. Dynamic properties influence the perception of facial expressions. Perception 2013, 42, 1266–1278. [Google Scholar] [CrossRef] [PubMed]
Bassili, J.N. Facial motion in the perception of faces and of emotional expression. J. Exp. Psychol. Hum. Percept. Perform. 1978, 4, 373. [Google Scholar] [CrossRef] [PubMed]
Namba, S.; Kabir, R.S.; Miyatani, M.; Nakao, T. Dynamic displays enhance the ability to discriminate genuine and posed facial expressions of emotion. Front. Psychol. 2018, 9, 672. [Google Scholar] [CrossRef] [PubMed]
Sato, W.; Krumhuber, E.G.; Jellema, T.; Williams, J.H. Dynamic emotional communication. Front. Psychol. 2019, 10, 2836. [Google Scholar] [CrossRef]
Ghorbanali, A.; Sohrabi, M.K. A comprehensive survey on deep learning-based approaches for multimodal sentiment analysis. Artif. Intell. Rev. 2023, 56, 1479–1512. [Google Scholar] [CrossRef]
Ahmed, N.; Al Aghbari, Z.; Girija, S. A systematic survey on multimodal emotion recognition using learning algorithms. Intell. Syst. Appl. 2023, 17, 200171. [Google Scholar] [CrossRef]
Zhang, S.; Yang, Y.; Chen, C.; Zhang, X.; Leng, Q.; Zhao, X. Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects. Expert Syst. Appl. 2023, 237, 121692. [Google Scholar] [CrossRef]
Pan, B.; Hirota, K.; Jia, Z.; Dai, Y. A review of multimodal emotion recognition from datasets, preprocessing, features, and fusion methods. Neurocomputing 2023, 561, 126866. [Google Scholar] [CrossRef]
Gladys, A.A.; Vetriselvi, V. Survey on multimodal approaches to emotion recognition. Neurocomputing 2023, 556, 126693. [Google Scholar] [CrossRef]
Ezzameli, K.; Mahersia, H. Emotion recognition from unimodal to multimodal analysis: A review. Inf. Fusion 2023, 99, 101847. [Google Scholar] [CrossRef]
Singh, U.; Abhishek, K.; Azad, H.K. A Survey of Cutting-edge Multimodal Sentiment Analysis. ACM Comput. Surv. 2024, 56, 1–38. [Google Scholar] [CrossRef]
Hazmoune, S.; Bougamouza, F. Using transformers for multimodal emotion recognition: Taxonomies and state of the art review. Eng. Appl. Artif. Intell. 2024, 133, 108339. [Google Scholar] [CrossRef]
Liu, H.; Lou, T.; Zhang, Y.; Wu, Y.; Xiao, Y.; Jensen, C.S.; Zhang, D. EEG-based multimodal emotion recognition: A machine learning perspective. IEEE Trans. Instrum. Meas. 2024, 73, 4003729. [Google Scholar] [CrossRef]
Khan, U.A.; Xu, Q.; Liu, Y.; Lagstedt, A.; Alamäki, A.; Kauttonen, J. Exploring contactless techniques in multimodal emotion recognition: Insights into diverse applications, challenges, solutions, and prospects. Multimed. Syst. 2024, 30, 115. [Google Scholar] [CrossRef]
Kalateh, S.; Estrada-Jimenez, L.A.; Hojjati, S.N.; Barata, J. A Systematic Review on Multimodal Emotion Recognition: Building Blocks, Current State, Applications, and Challenges. IEEE Access 2024, 12, 103976–104019. [Google Scholar] [CrossRef]
Poria, S.; Cambria, E.; Bajpai, R.; Hussain, A. A review of affective computing: From unimodal analysis to multimodal fusion. Inf. Fusion 2017, 37, 98–125. [Google Scholar] [CrossRef]
Kitchenham, B.; Charters, S. Guidelines for Performing Systematic Literature Reviews in Software Engineering; EBSE Technical Report EBSE-2007-01; Keele University and University of Durham: Keele, UK; Durham, UK, 2007. [Google Scholar]
Bosse, T. On computational models of emotion regulation and their applications within HCI. In Emotions and Affect in Human Factors and Human-Computer Interaction; Elsevier: Amsterdam, The Netherlands, 2017; pp. 311–337. [Google Scholar]
Scherer, K.R. Psychological Structure of Emotions. In International Encyclopedia of the Social & Behavioral Sciences; Smelser, N.J., Baltes, P.B., Eds.; Elsevier Ltd.: Amsterdam, The Netherlands, 2001; pp. 4472–4477. [Google Scholar] [CrossRef]
Ekman, P. An argument for basic emotions. Cogn. Emot. 1992, 6, 169–200. [Google Scholar] [CrossRef]
Ekman, P.; Davidson, R.J. (Eds.) The Nature of Emotion: Fundamental Questions; Oxford University Press: Oxford, UK, 1994. [Google Scholar]
Shaver, P.; Schwartz, J.; Kirson, D.; O’Connor, C. Emotion knowledge: Further exploration of a prototype approach. J. Personal. Soc. Psychol. 1987, 52, 1061–1086. [Google Scholar] [CrossRef]
Cowen, A.S.; Keltner, D. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proc. Natl. Acad. Sci. USA 2017, 114, E7900–E7909. [Google Scholar] [CrossRef]
Oatley, K.; Johnson-laird, P.N. Towards a Cognitive Theory of Emotions. Cogn. Emot. 1987, 1, 29–50. [Google Scholar] [CrossRef]
Zeng, Z.; Pantic, M.; Roisman, G.I.; Huang, T.S. A survey of affect recognition methods: Audio, visual, and spontaneous expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 39–58. [Google Scholar] [CrossRef] [PubMed]
Plutchik, R. The nature of emotions: Human emotions have deep evolutionary roots, a fact that may explain their complexity and provide tools for clinical practice. Am. Sci. 2001, 89, 344–350. [Google Scholar] [CrossRef]
Plutchik, R. A psychoevolutionary theory of emotions. Soc. Sci. Inf. 1982, 21, 529–553. [Google Scholar] [CrossRef]
Russell, J.A.; Mehrabian, A. Evidence for a three-factor theory of emotions. J. Res. Personal. 1977, 11, 273–294. [Google Scholar] [CrossRef]
Mehrabian, A. Pleasure-arousal-dominance: A general framework for describing and measuring individual differences in temperament. Curr. Psychol. 1996, 14, 261–292. [Google Scholar] [CrossRef]
Russell, J.A. A circumplex model of affect. J. Personal. Soc. Psychol. 1980, 39, 1161–1178. [Google Scholar] [CrossRef]
Whissell, C.M. The dictionary of affect in language. In The Measurement of Emotions; Elsevier: Amsterdam, The Netherlands, 1989; Chapter 5; pp. 113–131. [Google Scholar]
Ortony, A.; Clore, G.L.; Collins, A. The Cognitive Structure of Emotions; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
Lövheim, H. A new three-dimensional model for emotions and monoamine neurotransmitters. Med. Hypotheses 2012, 78, 341–348. [Google Scholar] [CrossRef]
Cambria, E.; Livingstone, A.; Hussain, A. The Hourglass of Emotions. In Cognitive Behavioural Systems; Hutchison, D., Kanade, T., Kittler, J., Kleinberg, J.M., Mattern, F., Mitchell, J.C., Naor, M., Nierstrasz, O., Pandu Rangan, C., Steffen, B., et al., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7403, pp. 144–157. [Google Scholar] [CrossRef]
Susanto, Y.; Livingstone, A.G.; Ng, B.C.; Cambria, E. The Hourglass Model Revisited. IEEE Intell. Syst. 2020, 35, 96–102. [Google Scholar] [CrossRef]
Fontaine, J.R.; Scherer, K.R.; Roesch, E.B.; Ellsworth, P.C. The world of emotions is not two-dimensional. Psychol. Sci. 2007, 18, 1050–1057. [Google Scholar] [CrossRef]
Cochrane, T. Eight dimensions for the emotions. Soc. Sci. Inf. 2009, 48, 379–420. [Google Scholar] [CrossRef]
Liu, Y.; Fu, Q.; Fu, X. The interaction between cognition and emotion. Chin. Sci. Bull. 2009, 54, 4102–4116. [Google Scholar] [CrossRef]
Lee, Y.; Seo, Y.; Lee, Y.; Lee, D. Dimensional emotions are represented by distinct topographical brain networks. Int. J. Clin. Health Psychol. 2023, 23, 100408. [Google Scholar] [CrossRef]
Mauss, I.B.; Robinson, M.D. Measures of emotion: A review. Cogn. Emot. 2009, 23, 209–237. [Google Scholar] [CrossRef] [PubMed]
Kahou, S.E.; Bouthillier, X.; Lamblin, P.; Gulcehre, C.; Michalski, V.; Konda, K.; Jean, S.; Froumenty, P.; Dauphin, Y.; Boulanger-Lewandowski, N.; et al. Emonets: Multimodal deep learning approaches for emotion recognition in video. J. Multimodal User Interfaces 2016, 10, 99–111. [Google Scholar] [CrossRef]
Davison, A.K.; Merghani, W.; Yap, M.H. Objective Classes for Micro-Facial Expression Recognition. J. Imaging 2018, 4, 119. [Google Scholar] [CrossRef]
Mehrabian, A. Communication without words. In Communication Theory; Routledge: London, UK, 2017; Chapter 13; pp. 193–200. [Google Scholar]
Wolfkühler, W.; Majorek, K.; Tas, C.; Küper, C.; Saimed, N.; Juckel, G.; Brüne, M. Emotion recognition in pictures of facial affect: Is there a difference between forensic and non-forensic patients with schizophrenia? Eur. J. Psychiatry 2012, 26, 73–85. [Google Scholar] [CrossRef]
Yan, W.J.; Wu, Q.; Liang, J.; Chen, Y.H.; Fu, X. How fast are the leaked facial expressions: The duration of micro-expressions. J. Nonverbal Behav. 2013, 37, 217–230. [Google Scholar] [CrossRef]
Porter, S.; ten Brinke, L. Reading between the lies: Identifying concealed and falsified emotions in universal facial expressions. Psychol. Sci. 2008, 19, 508–514. [Google Scholar] [CrossRef]
Ekman, P. Darwin, deception, and facial expression. Ann. N. Y. Acad. Sci. 2003, 1000, 205–221. [Google Scholar] [CrossRef]
Ekman, P. Telling Lies: Clues to Deceit in the Marketplace, Politics, and Marriage; W.W. Norton & Company Inc.: New York, NY, USA, 2009. [Google Scholar]
Ekman, P. Emotions Revealed: Recognizing Faces and Feelings to Improve Communication and Emotional Life; Times Books, Henry Holt and Company: New York, NY, USA, 2003. [Google Scholar]
Porter, S.; Ten Brinke, L.; Wallace, B. Secrets and lies: Involuntary leakage in deceptive facial expressions as a function of emotional intensity. J. Nonverbal Behav. 2012, 36, 23–37. [Google Scholar] [CrossRef]
Frank, M.; Herbasz, M.; Sinuk, K.; Keller, A.; Nolan, C. I see how you feel: Training laypeople and professionals to recognize fleeting emotions. In Proceedings of the Annual Meeting of the International Communication Association, Sheraton New York, New York, NY, USA, 21–25 May 2009; pp. 1–35. [Google Scholar]
Ekman, P.; Freisen, W.V.; Ancoli, S. Facial signs of emotional experience. J. Personal. Soc. Psychol. 1980, 39, 1125. [Google Scholar] [CrossRef]
Rosenberg, E.L.; Ekman, P. What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS); Oxford University Press: Oxford, UK, 2020. [Google Scholar]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Ghimire, D.; Lee, J. Geometric feature-based facial expression recognition in image sequences using multi-class Adaboost and support vector machines. Sensors 2013, 13, 7714–7734. [Google Scholar] [CrossRef] [PubMed]
Murugappan, M.; Mutawa, A. Facial geometric feature extraction based emotional expression classification using machine learning algorithms. PLoS ONE 2021, 16, e0247131. [Google Scholar]
López-Gil, J.M.; Garay-Vitoria, N. Photogram classification-based emotion recognition. IEEE Access 2021, 9, 136974–136984. [Google Scholar] [CrossRef]
Rivera, A.R.; Castillo, J.R.; Chae, O.O. Local directional number pattern for face analysis: Face and expression recognition. IEEE Trans. Image Process. 2013, 22, 1740–1752. [Google Scholar] [CrossRef]
Moore, S.; Bowden, R. Local binary patterns for multi-view facial expression recognition. Comput. Vis. Image Underst. 2011, 115, 541–558. [Google Scholar] [CrossRef]
Mistry, K.; Zhang, L.; Neoh, S.C.; Jiang, M.; Hossain, A.; Lafon, B. Intelligent Appearance and shape based facial emotion recognition for a humanoid robot. In Proceedings of the 8th International Conference on Software, Knowledge, Information Management and Applications (SKIMA 2014), Dhaka, Bangladesh, 18–20 December 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1–8. [Google Scholar]
Yang, G.; Ortoneda, J.S.Y.; Saniie, J. Emotion Recognition Using Deep Neural Network with Vectorized Facial Features. In Proceedings of the 2018 IEEE International Conference on Electro/Information Technology (EIT), Rochester, MI, USA, 3–5 May 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 0318–0322. [Google Scholar]
Hai-Duong, N.; Soonja, Y.; Guee-Sang, L.; Hyung-Jeong, Y.; In-Seop, N.; Soo-Hyung, K. Facial Emotion Recognition Using an Ensemble of Multi-Level Convolutional Neural Networks. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1940015. [Google Scholar]
Agrawal, E.; Christopher, J. Emotion recognition from periocular features. In Proceedings of the Second International Conference on Machine Learning, Image Processing, Network Security and Data Sciences (MIND 2020), Silchar, India, 30–31 July 2020; Springer: Berlin/Heidelberg, Germany, 2020. Part I. pp. 194–208. [Google Scholar]
Dirik, M. Optimized ANFIS model with hybrid metaheuristic algorithms for facial emotion recognition. Int. J. Fuzzy Syst. 2023, 25, 485–496. [Google Scholar] [CrossRef]
Pfister, T.; Li, X.; Zhao, G.; Pietikäinen, M. Recognising spontaneous facial micro-expressions. In Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 1449–1456. [Google Scholar]
Wang, Y.; See, J.; Phan, R.C.W.; Oh, Y.H. LBP with Six Intersection Points: Reducing Redundant Information in LBP-TOP for Micro-expression Recognition. In Computer Vision—Asian Conference on Computer Vision ACCV 2014; Cremers, D., Reid, I., Saito, H., Yang, M.H., Eds.; Springer International Publishing: Cham, Switzerland, 2015; Volume 9003, pp. 525–537. [Google Scholar] [CrossRef]
Huang, X.; Wang, S.J.; Zhao, G.; Piteikainen, M. Facial Micro-Expression Recognition Using Spatiotemporal Local Binary Pattern with Integral Projection. In Proceedings of the 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, Chile, 7–13 December 2015; pp. 1–9. [Google Scholar] [CrossRef]
Wang, Y.; See, J.; Phan, R.C.W.; Oh, Y.H. Efficient spatio-temporal local binary patterns for spontaneous facial micro-expression recognition. PLoS ONE 2015, 10, e0124674. [Google Scholar]
Li, X.; Pfister, T.; Huang, X.; Zhao, G.; Pietikäinen, M. A spontaneous micro-expression database: Inducement, collection and baseline. In Proceedings of the 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), Shanghai, China, 22–26 April 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–6. [Google Scholar]
Wang, Y.; See, J.; Oh, Y.H.; Phan, R.C.W.; Rahulamathavan, Y.; Ling, H.C.; Tan, S.W.; Li, X. Effective recognition of facial micro-expressions with video motion magnification. Multimed. Tools Appl. 2017, 76, 21665–21690. [Google Scholar] [CrossRef]
Li, X.; Hong, X.; Moilanen, A.; Huang, X.; Pfister, T.; Zhao, G.; Pietikäinen, M. Towards Reading Hidden Emotions: A Comparative Study of Spontaneous Micro-Expression Spotting and Recognition Methods. IEEE Trans. Affect. Comput. 2018, 9, 563–577. [Google Scholar] [CrossRef]
Zhao, G.; Pietikainen, M. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 915–928. [Google Scholar] [CrossRef]
Park, S.Y.; Lee, S.H.; Ro, Y.M. Subtle facial expression recognition using adaptive magnification of discriminative facial motion. In Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, 26–30 October 2015; pp. 911–914. [Google Scholar]
Shreve, M.; Godavarthy, S.; Goldgof, D.; Sarkar, S. Macro- and micro-expression spotting in long videos using spatio-temporal strain. In Proceedings of the 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG), Santa Barbara, CA, USA, 21–25 March 2011; IEEE: Piscataway, NJ, USA, 2011; pp. 51–56. [Google Scholar]
Liong, S.T.; See, J.; Phan, R.C.W.; Le Ngo, A.C.; Oh, Y.H.; Wong, K. Subtle Expression Recognition Using Optical Strain Weighted Features. In Computer Vision—ACCV 2014 Workshops; Jawahar, C., Shan, S., Eds.; Springer International Publishing: Cham, Switzerland, 2015; Volume 9009, pp. 644–657. [Google Scholar] [CrossRef]
Liong, S.T.; See, J.; Phan, R.C.W.; Oh, Y.H.; Le Ngo, A.C.; Wong, K.; Tan, S.W. Spontaneous subtle expression detection and recognition based on facial strain. Signal Process. Image Commun. 2016, 47, 170–182. [Google Scholar] [CrossRef]
Liong, S.T.; See, J.; Wong, K.; Phan, R.C.W. Less is more: Micro-expression recognition from video using apex frame. Signal Process. Image Commun. 2018, 62, 82–92. [Google Scholar] [CrossRef]
Xu, F.; Zhang, J.; Wang, J.Z. Microexpression Identification and Categorization Using a Facial Dynamics Map. IEEE Trans. Affect. Comput. 2017, 8, 254–267. [Google Scholar] [CrossRef]
Zheng, H.; Geng, X.; Yang, Z. A Relaxed K-SVD Algorithm for Spontaneous Micro-Expression Recognition. In PRICAI 2016: Trends in Artificial Intelligence; Booth, R., Zhang, M.L., Eds.; Springer International Publishing: Cham, Switzerland, 2016; Volume 9810, pp. 692–699. [Google Scholar] [CrossRef]
Le Ngo, A.C.; Phan, R.C.W.; See, J. Spontaneous subtle expression recognition: Imbalanced databases and solutions. In Proceedings of the 12th Asian Conference on Computer Vision (ACCV), Singapore, 1–5 November 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 33–48. [Google Scholar]
Oh, Y.H.; Le Ngo, A.C.; See, J.; Liong, S.T.; Phan, R.C.W.; Ling, H.C. Monogenic Riesz wavelet representation for micro-expression recognition. In Proceedings of the 2015 IEEE International Conference on Digital Signal Processing (DSP), Singapore, 21–24 July 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1237–1241. [Google Scholar]
Huang, X.; Wang, S.J.; Liu, X.; Zhao, G.; Feng, X.; Pietikainen, M. Discriminative Spatiotemporal Local Binary Pattern with Revisited Integral Projection for Spontaneous Facial Micro-Expression Recognition. IEEE Trans. Affect. Comput. 2019, 10, 32–47. [Google Scholar] [CrossRef]
Peng, M.; Wang, C.; Chen, T.; Liu, G.; Fu, X. Dual Temporal Scale Convolutional Neural Network for Micro-Expression Recognition. Front. Psychol. 2017, 8, 1745. [Google Scholar] [CrossRef] [PubMed]
Zhou, L.; Mao, Q.; Xue, L. Cross-database micro-expression recognition: A style aggregated and attention transfer approach. In Proceedings of the 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shanghai, China, 8–12 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 102–107. [Google Scholar]
Belaiche, R.; Liu, Y.; Migniot, C.; Ginhac, D.; Yang, F. Cost-effective CNNs for real-time micro-expression recognition. Appl. Sci. 2020, 10, 4959. [Google Scholar] [CrossRef]
Liu, Y.; Du, H.; Zheng, L.; Gedeon, T. A neural micro-expression recognizer. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–4. [Google Scholar]
Avent, R.R.; Ng, C.T.; Neal, J.A. Machine vision recognition of facial affect using backpropagation neural networks. In Proceedings of the 16th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Baltimore, MD, USA, 3–6 November 1994; IEEE: Piscataway, NJ, USA, 1994; Volume 2, pp. 1364–1365. [Google Scholar]
Gargesha, M.; Kuchi, P.; Torkkola, I. Facial expression recognition using artificial neural networks. Artif. Neural Comput. Syst. 2002, 8, 1–6. [Google Scholar]
Bartlett, M.S.; Littlewort, G.; Frank, M.; Lainscsek, C.; Fasel, I.; Movellan, J. Recognizing facial expression: Machine learning and application to spontaneous behavior. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 2, pp. 568–573. [Google Scholar]
Guo, Y.; Tao, D.; Yu, J.; Xiong, H.; Li, Y.; Tao, D. Deep neural networks with relativity learning for facial expression recognition. In Proceedings of the 2016 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Seattle, WA, USA, 11–15 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 1–6. [Google Scholar]
Zhang, K.; Huang, Y.; Du, Y.; Wang, L. Facial expression recognition based on deep evolutional spatial-temporal networks. IEEE Trans. Image Process. 2017, 26, 4193–4203. [Google Scholar] [CrossRef]
Wang, S.H.; Phillips, P.; Dong, Z.C.; Zhang, Y.D. Intelligent facial emotion recognition based on stationary wavelet entropy and Jaya algorithm. Neurocomputing 2018, 272, 668–676. [Google Scholar] [CrossRef]
Ruiz-Garcia, A.; Elshaw, M.; Altahhan, A.; Palade, V. A hybrid deep learning neural approach for emotion recognition from facial expressions for socially assistive robots. Neural Comput. Appl. 2018, 29, 359–373. [Google Scholar] [CrossRef]
Caroppo, A.; Leone, A.; Siciliano, P. Comparison between deep learning models and traditional machine learning approaches for facial expression recognition in ageing adults. J. Comput. Sci. Technol. 2020, 35, 1127–1146. [Google Scholar] [CrossRef]
Khanbebin, S.N.; Mehrdad, V. Improved convolutional neural network-based approach using hand-crafted features for facial expression recognition. Multimed. Tools Appl. 2023, 82, 11489–11505. [Google Scholar] [CrossRef]
Boughanem, H.; Ghazouani, H.; Barhoumi, W. Multichannel convolutional neural network for human emotion recognition from in-the-wild facial expressions. Vis. Comput. 2023, 39, 5693–5718. [Google Scholar] [CrossRef]
Arabian, H.; Abdulbaki Alshirbaji, T.; Chase, J.G.; Moeller, K. Emotion Recognition beyond Pixels: Leveraging Facial Point Landmark Meshes. Appl. Sci. 2024, 14, 3358. [Google Scholar] [CrossRef]
Kim, D.H.; Baddar, W.J.; Ro, Y.M. Micro-Expression Recognition with Expression-State Constrained Spatio-Temporal Feature Representations. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 382–386. [Google Scholar] [CrossRef]
Khor, H.Q.; See, J.; Phan, R.C.W.; Lin, W. Enriched Long-Term Recurrent Convolutional Network for Facial Micro-Expression Recognition. In Proceedings of the 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), Xi’an, China, 15–19 May 2018; pp. 667–674. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A. Deep face recognition. In Proceedings of the BMVC 2015-Proceedings of the British Machine Vision Conference 2015, Swansea, UK, 7–10 September 2015. [Google Scholar]
Zhou, L.; Mao, Q.; Xue, L. Dual-inception network for cross-database micro-expression recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Wang, C.; Peng, M.; Bi, T.; Chen, T. Micro-attention for micro-expression recognition. Neurocomputing 2020, 410, 354–362. [Google Scholar] [CrossRef]
Gan, Y.S.; Liong, S.T.; Yau, W.C.; Huang, Y.C.; Tan, L.K. OFF-ApexNet on micro-expression recognition system. Signal Process. Image Commun. 2019, 74, 129–139. [Google Scholar] [CrossRef]
Xia, Z.; Feng, X.; Hong, X.; Zhao, G. Spontaneous facial micro-expression recognition via deep convolutional network. In Proceedings of the 2018 Eighth International Conference on Image Processing Theory, Tools and Applications (IPTA), Xi’an, China, 7–10 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Liong, S.T.; Gan, Y.S.; See, J.; Khor, H.Q.; Huang, Y.C. Shallow triple stream three-dimensional cnn (ststnet) for micro-expression recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Li, J.; Wang, Y.; See, J.; Liu, W. Micro-expression recognition based on 3D flow convolutional neural network. Pattern Anal. Appl. 2019, 22, 1331–1339. [Google Scholar] [CrossRef]
Wu, C.; Guo, F. TSNN: Three-Stream Combining 2D and 3D Convolutional Neural Network for Micro-Expression Recognition. IEEJ Trans. Electr. Electron. Eng. 2021, 16, 98–107. [Google Scholar] [CrossRef]
Peng, M.; Wang, C.; Bi, T.; Shi, Y.; Zhou, X.; Chen, T. A novel apex-time network for cross-dataset micro-expression recognition. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–6. [Google Scholar]
Van Quang, N.; Chun, J.; Tokuyama, T. CapsuleNet for micro-expression recognition. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
Xie, H.X.; Lo, L.; Shuai, H.H.; Cheng, W.H. AU-assisted graph attention convolutional network for micro-expression recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2871–2880. [Google Scholar]
Polikovsky, S.; Kameda, Y.; Ohta, Y. Facial micro-expressions recognition using high speed camera and 3D-gradient descriptor. In Proceedings of the 3rd International Conference on Imaging for Crime Detection and Prevention (ICDP 2009), London, UK, 3 December 2009. [Google Scholar]
Warren, G.; Schertler, E.; Bull, P. Detecting deception from emotional and unemotional cues. J. Nonverbal Behav. 2009, 33, 59–69. [Google Scholar] [CrossRef]
Lyons, M.J. “Excavating AI” Re-excavated: Debunking a Fallacious Account of the JAFFE Dataset. arXiv 2021, arXiv:2107.13998. [Google Scholar] [CrossRef]
Yin, L.; Wei, X.; Sun, Y.; Wang, J.; Rosato, M.J. A 3D facial expression database for facial behavior research. In Proceedings of the 7th international conference on automatic face and gesture recognition (FGR06), Southampton, UK, 10–12 April 2006; IEEE: Piscataway, NJ, USA, 2006; pp. 211–216. [Google Scholar]
Li, S.; Deng, W.; Du, J. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2584–2593. [Google Scholar]
Goeleven, E.; De Raedt, R.; Leyman, L.; Verschuere, B. The Karolinska directed emotional faces: A validation study. Cogn. Emot. 2008, 22, 1094–1118. [Google Scholar] [CrossRef]
Lucey, P.; Cohn, J.F.; Kanade, T.; Saragih, J.; Ambadar, Z.; Matthews, I. The extended Cohn-Kanade dataset (CK+): A complete dataset for action unit and emotion-specified expression. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 13–18 June 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 94–101. [Google Scholar]
Aifanti, N.; Papachristou, C.; Delopoulos, A. The MUG facial expression database. In Proceedings of the 11th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS 10), Desenzano del Garda, Italy, 12–14 April 2010; IEEE: Piscataway, NJ, USA, 2010; pp. 1–4. [Google Scholar]
Chen, L.F.; Yen, Y.S. Taiwanese Facial Expression Image Database; Brain Mapping Laboratory, Institute of Brain Science, National Yang-Ming University: Taipei, Taiwan, 2007. [Google Scholar]
Langner, O.; Dotsch, R.; Bijlstra, G.; Wigboldus, D.H.; Hawk, S.T.; Van Knippenberg, A. Presentation and validation of the Radboud Faces Database. Cogn. Emot. 2010, 24, 1377–1388. [Google Scholar] [CrossRef]
Yan, W.J.; Li, X.; Wang, S.J.; Zhao, G.; Liu, Y.J.; Chen, Y.H.; Fu, X. CASME II: An Improved Spontaneous Micro-Expression Database and the Baseline Evaluation. PLoS ONE 2014, 9, e86041. [Google Scholar] [CrossRef]
Davison, A.K.; Lansley, C.; Costen, N.; Tan, K.; Yap, M.H. SAMM: A Spontaneous Micro-Facial Movement Dataset. IEEE Trans. Affect. Comput. 2018, 9, 116–129. [Google Scholar] [CrossRef]
Piana, S.; Staglianò, A.; Odone, F.; Verri, A.; Camurri, A. Real-time Automatic Emotion Recognition from Body Gestures. arXiv 2014, arXiv:1402.5047. [Google Scholar]
Piana, S.; Staglianò, A.; Camurri, A.; Odone, F. A set of full-body movement features for emotion recognition to help children affected by autism spectrum condition. In Proceedings of the IDGEI International Workshop, Chania, Greece, 14 May 2013; Volume 23. [Google Scholar]
Noroozi, F.; Corneanu, C.A.; Kamińska, D.; Sapiński, T.; Escalera, S.; Anbarjafari, G. Survey on emotional body gesture recognition. IEEE Trans. Affect. Comput. 2018, 12, 505–523. [Google Scholar] [CrossRef]
Zacharatos, H.; Gatzoulis, C.; Chrysanthou, Y.L. Automatic emotion recognition based on body movement analysis: A survey. IEEE Comput. Graph. Appl. 2014, 34, 35–45. [Google Scholar] [CrossRef] [PubMed]
Ly, S.T.; Lee, G.S.; Kim, S.H.; Yang, H.J. Emotion recognition via body gesture: Deep learning model coupled with keyframe selection. In Proceedings of the 2018 International Conference on Machine Learning and Machine Intelligence (MLMI2018), Hanoi, Vietnam, 28–30 September 2018; pp. 27–31. [Google Scholar]
Liu, X.; Shi, H.; Chen, H.; Yu, Z.; Li, X.; Zhao, G. iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10626–10637. [Google Scholar]
Wu, J.; Zhang, Y.; Sun, S.; Li, Q.; Zhao, X. Generalized zero-shot emotion recognition from body gestures. Appl. Intell. 2022, 52, 8616–8634. [Google Scholar] [CrossRef]
Ekman, P.; Keltner, D. Universal facial expressions of emotion. Calif. Ment. Health Res. Dig. 1970, 8, 151–158. [Google Scholar]
Kerkeni, L.; Serrestou, Y.; Mbarki, M.; Raoof, K.; Mahjoub, M.A.; Cleder, C. Automatic speech emotion recognition using machine learning. In Social Media and Machine Learning; IntechOpen: Rijeka, Croatia, 2019. [Google Scholar]
Murray, I.R.; Arnott, J.L. Toward the simulation of emotion in synthetic speech: A review of the literature on human vocal emotion. J. Acoust. Soc. Am. 1993, 93, 1097–1108. [Google Scholar] [CrossRef]
Poria, S.; Cambria, E.; Hussain, A.; Huang, G.B. Towards an intelligent framework for multimodal affective data analysis. Neural Netw. 2015, 63, 104–116. [Google Scholar] [CrossRef] [PubMed]
Kamińska, D.; Sapiński, T.; Anbarjafari, G. Efficiency of chosen speech descriptors in relation to emotion recognition. EURASIP J. Audio Speech Music Process. 2017, 2017, 3. [Google Scholar] [CrossRef]
Vogt, T.; André, E. Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition. In Proceedings of the 2005 IEEE International Conference on Multimedia and Expo, Amsterdam, The Netherlands, 6 July 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 474–477. [Google Scholar]
Devillers, L.; Vidrascu, L.; Lamel, L. Challenges in real-life emotion annotation and machine learning based detection. Neural Netw. 2005, 18, 407–422. [Google Scholar] [CrossRef]
Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of German emotional speech. Interspeech 2005, 5, 1517–1520. [Google Scholar]
Adigwe, A.; Tits, N.; Haddad, K.E.; Ostadabbas, S.; Dutoit, T. The emotional voices database: Towards controlling the emotion dimension in voice generation systems. arXiv 2018, arXiv:1806.09514. [Google Scholar]
You, M.; Chen, C.; Bu, J. CHAD: A Chinese affective database. In Proceedings of the International Conference on Affective Computing and Intelligent Interaction, Beijing, China, 22–24 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 542–549. [Google Scholar]
Palo, H.K.; Mohanty, M.N. Wavelet based feature combination for recognition of emotions. Ain Shams Eng. J. 2018, 9, 1799–1806. [Google Scholar] [CrossRef]
Kerkeni, L.; Serrestou, Y.; Raoof, K.; Mbarki, M.; Mahjoub, M.A.; Cleder, C. Automatic speech emotion recognition using an optimal combination of features based on EMD-TKEO. Speech Commun. 2019, 114, 22–35. [Google Scholar] [CrossRef]
Nagarajan, S.; Nettimi, S.S.S.; Kumar, L.S.; Nath, M.K.; Kanhe, A. Speech emotion recognition using cepstral features extracted with novel triangular filter banks based on bark and ERB frequency scales. Digit. Signal Process. 2020, 104, 102763. [Google Scholar] [CrossRef]
Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5329–5333. [Google Scholar] [CrossRef]
Kumawat, P.; Routray, A. Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition. In Proceedings of the Interspeech 2021 (ISCA), Brno, Czech Republic, 30 August–3 September 2021; pp. 3410–3414. [Google Scholar] [CrossRef]
Zhou, S.; Beigi, H. A Transfer Learning Method for Speech Emotion Recognition from Automatic Speech Recognition. arXiv 2020, arXiv:2008.02863. [Google Scholar]
Morais, E.; Hoory, R.; Zhu, W.; Gat, I.; Damasceno, M.; Aronowitz, H. Speech Emotion Recognition using Self-Supervised Features. arXiv 2022, arXiv:2202.03896. [Google Scholar]
Ahmed, M.R.; Islam, S.; Islam, A.M.; Shatabda, S. An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Syst. Appl. 2023, 218, 119633. [Google Scholar] [CrossRef]
Nam, H.J.; Park, H.J. Speech Emotion Recognition under Noisy Environments with SNR Down to- 6 dB Using Multi-Decoder Wave-U-Net. Appl. Sci. 2024, 14, 5227. [Google Scholar] [CrossRef]
Alkhamali, E.A.; Allinjawi, A.; Ashari, R.B. Combining Transformer, Convolutional Neural Network, and Long Short-Term Memory Architectures: A Novel Ensemble Learning Technique That Leverages Multi-Acoustic Features for Speech Emotion Recognition in Distance Education Classrooms. Appl. Sci. 2024, 14, 5050. [Google Scholar] [CrossRef]
Sekkate, S.; Khalil, M.; Adib, A. A statistical feature extraction for deep speech emotion recognition in a bilingual scenario. Multimed. Tools Appl. 2023, 82, 11443–11460. [Google Scholar] [CrossRef]
Huang, Y.; Tian, K.; Wu, A.; Zhang, G. Feature fusion methods research based on deep belief networks for speech emotion recognition under noise condition. J. Ambient Intell. Humaniz. Comput. 2019, 10, 1787–1798. [Google Scholar] [CrossRef]
Balakrishnan, A.; Rege, A. Reading Emotions from Speech Using Deep Neural Networks; Technical Report; Computer Science Department, Stanford University: Stanford, CA, USA, 2017. [Google Scholar]
Alu, D.; Zoltan, E.; Stoica, I.C. Voice based emotion recognition with convolutional neural networks for companion robots. Sci. Technol. 2017, 20, 222–240. [Google Scholar]
Tzirakis, P.; Zhang, J.; Schuller, B.W. End-to-end speech emotion recognition using deep neural networks. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 5089–5093. [Google Scholar]
Schuller, B.W. Speech emotion recognition: Two decades in a nutshell, benchmarks, and ongoing trends. Commun. ACM 2018, 61, 90–99. [Google Scholar] [CrossRef]
Shon, S.; Ali, A.; Glass, J. MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Okinawa, Japan, 16–20 December 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 374–380. [Google Scholar]
Shu, L.; Xie, J.; Yang, M.; Li, Z.; Li, Z.; Liao, D.; Xu, X.; Yang, X. A review of emotion recognition using physiological signals. Sensors 2018, 18, 2074. [Google Scholar] [CrossRef] [PubMed]
Wang, X.W.; Nie, D.; Lu, B.L. Emotional state classification from EEG data using machine learning approach. Neurocomputing 2014, 129, 94–106. [Google Scholar] [CrossRef]
Hosseinifard, B.; Moradi, M.H.; Rostami, R. Classifying depression patients and normal subjects using machine learning techniques and nonlinear features from EEG signal. Comput. Methods Programs Biomed. 2013, 109, 339–345. [Google Scholar] [CrossRef]
Zhang, Y.; Ji, X.; Zhang, S. An approach to EEG-based emotion recognition using combined feature extraction method. Neurosci. Lett. 2016, 633, 152–157. [Google Scholar] [CrossRef]
Soroush, M.Z.; Maghooli, K.; Setarehdan, S.K.; Nasrabadi, A.M. A novel method of EEG-based emotion recognition using nonlinear features variability and Dempster–Shafer theory. Biomed. Eng. Appl. Basis Commun. 2018, 30, 1850026. [Google Scholar] [CrossRef]
Murugappan, M.; Nagarajan, R.; Yaacob, S. Combining spatial filtering and wavelet transform for classifying human emotions using EEG Signals. J. Med. Biol. Eng. 2011, 31, 45–51. [Google Scholar] [CrossRef]
Jie, X.; Cao, R.; Li, L. Emotion recognition based on the sample entropy of EEG. Bio-Med. Mater. Eng. 2014, 24, 1185–1192. [Google Scholar] [CrossRef]
Lan, Z.; Sourina, O.; Wang, L.; Liu, Y. Real-time EEG-based emotion monitoring using stable features. Vis. Comput. 2016, 32, 347–358. [Google Scholar] [CrossRef]
Song, T.; Zheng, W.; Song, P.; Cui, Z. EEG emotion recognition using dynamical graph convolutional neural networks. IEEE Trans. Affect. Comput. 2018, 11, 532–541. [Google Scholar] [CrossRef]
Gao, Z.; Li, R.; Ma, C.; Rui, L.; Sun, X. Core-brain-network-based multilayer convolutional neural network for emotion recognition. IEEE Trans. Instrum. Meas. 2021, 70, 1–9. [Google Scholar] [CrossRef]
Yao, L.; Lu, Y.; Wang, M.; Qian, Y.; Li, H. Exploring EEG Emotion Recognition through Complex Networks: Insights from the Visibility Graph of Ordinal Patterns. Appl. Sci. 2024, 14, 2636. [Google Scholar] [CrossRef]
Álvarez-Jiménez, M.; Calle-Jimenez, T.; Hernández-Álvarez, M. A Comprehensive Evaluation of Features and Simple Machine Learning Algorithms for Electroencephalographic-Based Emotion Recognition. Appl. Sci. 2024, 14, 2228. [Google Scholar] [CrossRef]
Salama, E.S.; El-Khoribi, R.A.; Shoman, M.E.; Shalaby, M.A.W. EEG-based emotion recognition using 3D convolutional neural networks. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 329–337. [Google Scholar] [CrossRef]
Kumar, N.; Khaund, K.; Hazarika, S.M. Bispectral analysis of EEG for emotion recognition. Procedia Comput. Sci. 2016, 84, 31–35. [Google Scholar] [CrossRef]
Quesada Tabares, R.; Molina Cantero, A.J.; Gómez González, I.M.; Merino Monge, M.; Castro García, J.A.; Cabrera Cabrera, R. Emotions Detection based on a Single-electrode EEG Device. In Proceedings of the 4th International Conference on Physiological Computing Systems (PhyCS 2017), Madrid, Spain, 27–28 July 2017; SciTePress: Setubal, Portugal, 2017; pp. 89–95. [Google Scholar]
Van Dyk, D.A.; Meng, X.L. The art of data augmentation. J. Comput. Graph. Stat. 2001, 10, 1–50. [Google Scholar] [CrossRef]
Khosrowabadi, R.; Quek, C.; Ang, K.K.; Wahab, A. ERNN: A biologically inspired feedforward neural network to discriminate emotion from EEG signal. IEEE Trans. Neural Netw. Learn. Syst. 2013, 25, 609–620. [Google Scholar] [CrossRef]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Pan, Z.; Yu, W.; Wang, B.; Xie, H.; Sheng, V.S.; Lei, J.; Kwong, S. Loss functions of generative adversarial networks (GANs): Opportunities and challenges. IEEE Trans. Emerg. Top. Comput. Intell. 2020, 4, 500–522. [Google Scholar] [CrossRef]
Harper, R.; Southern, J. End-to-end prediction of emotion from heartbeat data collected by a consumer fitness tracker. In Proceedings of the 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), Cambridge, UK, 3–6 September 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–7. [Google Scholar]
Sarkar, P.; Etemad, A. Self-supervised learning for ECG-based emotion recognition. In Proceedings of the ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 3217–3221. [Google Scholar]
Li, L.; Chen, J.H. Emotion recognition using physiological signals. In Proceedings of the International Conference on Artificial Reality and Telexistence, Hangzhou, China, 29 November–1 December 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 437–446. [Google Scholar]
Lisetti, C.; Nasoz, F.; LeRouge, C.; Ozyer, O.; Alvarez, K. Developing multimodal intelligent affective interfaces for tele-home health care. Int. J. Hum.-Comput. Stud. 2003, 59, 245–255. [Google Scholar] [CrossRef]
Lisetti, C.L.; Nasoz, F. MAUI: A multimodal affective user interface. In Proceedings of the Tenth ACM International Conference on Multimedia, Juan-les-Pins, France, 1–6 December 2002; pp. 161–170. [Google Scholar]
Shimojo, S.; Shams, L. Sensory modalities are not separate modalities: Plasticity and interactions. Curr. Opin. Neurobiol. 2001, 11, 505–509. [Google Scholar] [CrossRef] [PubMed]
D’mello, S.K.; Kory, J. A review and meta-analysis of multimodal affect detection systems. ACM Comput. Surv. (CSUR) 2015, 47, 1–36. [Google Scholar] [CrossRef]
Scherer, K.R. Adding the affective dimension: A new look in speech analysis and synthesis. In Proceedings of the ICSLP, Philadelphia, PA, USA, 3–6 October 1996. [Google Scholar]
Lu, Y.; Zheng, W.L.; Li, B.; Lu, B.L. Combining eye movements and EEG to enhance emotion recognition. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Huang, Y.; Yang, J.; Liu, S.; Pan, J. Combining facial expressions and electroencephalography to enhance emotion recognition. Future Internet 2019, 11, 105. [Google Scholar] [CrossRef]
Van Huynh, T.; Yang, H.J.; Lee, G.S.; Kim, S.H.; Na, I.S. Emotion recognition by integrating eye movement analysis and facial expression model. In Proceedings of the 3rd International Conference on Machine Learning and Soft Computing, Da Lat, Vietnam, 25–28 January 2019; pp. 166–169. [Google Scholar]
Soleymani, M.; Pantic, M.; Pun, T. Multimodal emotion recognition in response to videos. IEEE Trans. Affect. Comput. 2011, 3, 211–223. [Google Scholar] [CrossRef]
Nguyen, H.D.; Yeom, S.; Oh, I.S.; Kim, K.M.; Kim, S.H. Facial expression recognition using a multi-level convolutional neural network. In Proceedings of the International Conference on Pattern Recognition and Artificial Intelligence, Montréal, Canada, 14–17 May 2018; pp. 217–221. [Google Scholar]
Li, T.H.; Liu, W.; Zheng, W.L.; Lu, B.L. Classification of five emotions from EEG and eye movement signals: Discrimination ability and stability over time. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 607–610. [Google Scholar]
Zhao, L.M.; Li, R.; Zheng, W.L.; Lu, B.L. Classification of five emotions from EEG and eye movement signals: Complementary representation properties. In Proceedings of the 2019 9th International IEEE/EMBS Conference on Neural Engineering (NER), San Francisco, CA, USA, 20–23 March 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 611–614. [Google Scholar]
Gu, Y.; Yang, K.; Fu, S.; Chen, S.; Li, X.; Marsic, I. Multimodal affective analysis using hierarchical attention strategy with word-level alignment. Proc. Conf. Assoc. Comput. Linguist. Meet. 2018, 2018, 2225. [Google Scholar]
Luna-Jiménez, C.; Kleinlein, R.; Griol, D.; Callejas, Z.; Montero, J.M.; Fernández-Martínez, F. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset. Appl. Sci. 2021, 12, 327. [Google Scholar] [CrossRef]
Simić, N.; Suzić, S.; Milošević, N.; Stanojev, V.; Nosek, T.; Popović, B.; Bajović, D. Enhancing Emotion Recognition through Federated Learning: A Multimodal Approach with Convolutional Neural Networks. Appl. Sci. 2024, 14, 1325. [Google Scholar] [CrossRef]
Wu, Y.; Daoudi, M.; Amad, A. Transformer-based self-supervised multimodal representation learning for wearable emotion recognition. IEEE Trans. Affect. Comput. 2023, 15, 157–172. [Google Scholar] [CrossRef]
Li, D.; Liu, J.; Yang, Y.; Hou, F.; Song, H.; Song, Y.; Gao, Q.; Mao, Z. Emotion recognition of subjects with hearing impairment based on fusion of facial expression and EEG topographic map. IEEE Trans. Neural Syst. Rehabil. Eng. 2022, 31, 437–445. [Google Scholar] [CrossRef]
Middya, A.I.; Nag, B.; Roy, S. Deep learning based multimodal emotion recognition using model-level fusion of audio–visual modalities. Knowl.-Based Syst. 2022, 244, 108580. [Google Scholar] [CrossRef]
Sharafi, M.; Yazdchi, M.; Rasti, R.; Nasimi, F. A novel spatio-temporal convolutional neural framework for multimodal emotion recognition. Biomed. Signal Process. Control 2022, 78, 103970. [Google Scholar] [CrossRef]
Kang, D.; Kim, D.; Kang, D.; Kim, T.; Lee, B.; Kim, D.; Song, B.C. Beyond superficial emotion recognition: Modality-adaptive emotion recognition system. Expert Syst. Appl. 2024, 235, 121097. [Google Scholar] [CrossRef]
Selvi, R.; Vijayakumaran, C. An Efficient Multimodal Emotion Identification Using FOX Optimized Double Deep Q-Learning. Wirel. Pers. Commun. 2023, 132, 2387–2406. [Google Scholar] [CrossRef]
Mocanu, B.; Tapu, R.; Zaharia, T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning. Image Vis. Comput. 2023, 133, 104676. [Google Scholar] [CrossRef]
Shahzad, H.; Bhatti, S.M.; Jaffar, A.; Rashid, M.; Akram, S. Multi-modal CNN Features Fusion for Emotion Recognition: A Modified Xception Model. IEEE Access 2023, 11, 94281–94289. [Google Scholar] [CrossRef]
Aguilera, A.; Mellado, D.; Rojas, F. An assessment of in-the-wild datasets for multimodal emotion recognition. Sensors 2023, 23, 5184. [Google Scholar] [CrossRef]
Roshdy, A.; Karar, A.; Kork, S.A.; Beyrouthy, T.; Nait-ali, A. Advancements in EEG Emotion Recognition: Leveraging Multi-Modal Database Integration. Appl. Sci. 2024, 14, 2487. [Google Scholar] [CrossRef]
Han, X.; Chen, F.; Ban, J. FMFN: A Fuzzy Multimodal Fusion Network for Emotion Recognition in Ensemble Conducting. IEEE Trans. Fuzzy Syst. 2024. [Google Scholar] [CrossRef]
Wang, Y.; Guan, L.; Venetsanopoulos, A.N. Kernel cross-modal factor analysis for information fusion with application to bimodal emotion recognition. IEEE Trans. Multimed. 2012, 14, 597–607. [Google Scholar] [CrossRef]
Xie, Z.; Guan, L. Multimodal information fusion of audiovisual emotion recognition using novel information theoretic tools. In Proceedings of the 2013 IEEE International Conference on Multimedia and Expo (ICME), San Jose, CA, USA, 15–19 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 1–6. [Google Scholar]
Bota, P.; Wang, C.; Fred, A.; Silva, H. Emotion assessment using feature fusion and decision fusion classification based on physiological data: Are we there yet? Sensors 2020, 20, 4723. [Google Scholar] [CrossRef]
Arthanarisamy Ramaswamy, M.P.; Palaniswamy, S. Subject independent emotion recognition using EEG and physiological signals—A comparative study. Appl. Comput. Inform. 2022. [Google Scholar] [CrossRef]
Douglas-Cowie, E.; Cowie, R.; Schröder, M. A new emotion database: Considerations, sources and scope. In Proceedings of the ISCA Tutorial and Research Workshop (ITRW) on Speech and Emotion, Newcastle, Northern Ireland, UK, 5–7 September 2000; pp. 39–44. [Google Scholar]
Martin, O.; Kotsia, I.; Macq, B.; Pitas, I. The eNTERFACE’ 05 Audio-Visual Emotion Database. In Proceedings of the 22nd International Conference on Data Engineering Workshops, Atlanta, GA, USA, 3–7 April 2006; p. 8. [Google Scholar]
Douglas-Cowie, E.; Cowie, R.; Sneddon, I.; Cox, C.; Lowry, O.; McRorie, M.; Martin, J.C.; Devillers, L.; Abrilian, S.; Batliner, A.; et al. The HUMAINE Database: Addressing the Collection and Annotation of Naturalistic and Induced Emotional Data. In International Conference on Affective Computing and Intelligent Interaction; Paiva, A.C.R., Prada, R., Picard, R.W., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; Volume 4738, pp. 488–500. [Google Scholar] [CrossRef]
McKeown, G.; Valstar, M.; Cowie, R.; Pantic, M.; Schroder, M. The SEMAINE Database: Annotated Multimodal Records of Emotionally Colored Conversations between a Person and a Limited Agent. IEEE Trans. Affect. Comput. 2012, 3, 5–17. [Google Scholar] [CrossRef]
Livingstone, S.R.; Russo, F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 2018, 13, e0196391. [Google Scholar] [CrossRef] [PubMed]
Cao, H.; Cooper, D.G.; Keutmann, M.K.; Gur, R.C.; Nenkova, A.; Verma, R. Crema-d: Crowd-sourced emotional multimodal actors dataset. IEEE Trans. Affect. Comput. 2014, 5, 377–390. [Google Scholar] [CrossRef] [PubMed]
Bänziger, T.; Mortillaro, M.; Scherer, K.R. Introducing the Geneva Multimodal expression corpus for experimental research on emotion perception. Emotion 2012, 12, 1161. [Google Scholar] [CrossRef]
Busso, C.; Parthasarathy, S.; Burmania, A.; AbdelWahab, M.; Sadoughi, N.; Provost, E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Trans. Affect. Comput. 2017, 8, 67–80. [Google Scholar] [CrossRef]
Ngai, W.K.; Xie, H.; Zou, D.; Chou, K.L. Emotion recognition based on convolutional neural networks and heterogeneous bio-signal data sources. Inf. Fusion 2022, 77, 107–117. [Google Scholar] [CrossRef]
Liu, Y.J.; Zhang, J.K.; Yan, W.J.; Wang, S.J.; Zhao, G.; Fu, X. A main directional mean optical flow feature for spontaneous micro-expression recognition. IEEE Trans. Affect. Comput. 2016, 7, 299–310. [Google Scholar] [CrossRef]
Xia, Z.; Hong, X.; Gao, X.; Feng, X.; Zhao, G. Spatiotemporal recurrent convolutional networks for recognizing spontaneous micro-expressions. IEEE Trans. Multimed. 2019, 22, 626–640. [Google Scholar] [CrossRef]
Peng, W.; Hong, X.; Xu, Y.; Zhao, G. A boost in revealing subtle facial expressions: A consolidated eulerian framework. In Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), Lille, France, 14–18 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1–5. [Google Scholar]
Sun, Y.; Zheng, L.; Yang, Y.; Tian, Q.; Wang, S. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the 15th European Conference on Computer Vision (ECCV2018), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 501–518. [Google Scholar]
Nguyen, H.D.; Yeom, S.; Lee, G.S.; Yang, H.J.; Na, I.S.; Kim, S.H. Facial emotion recognition using an ensemble of multi-level convolutional neural networks. Int. J. Pattern Recognit. Artif. Intell. 2019, 33, 1940015. [Google Scholar] [CrossRef]
Hu, G.; Liu, L.; Yuan, Y.; Yu, Z.; Hua, Y.; Zhang, Z.; Shen, F.; Shao, L.; Hospedales, T.; Robertson, N.; et al. Deep multi-task learning to recognise subtle facial expressions of mental states. In Proceedings of the 15th European Conference on Computer Vision (ECCV2018), Munich, Germany, 8–14 September 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 106–123. [Google Scholar]
Ganin, Y.; Lempitsky, V. Unsupervised domain adaptation by backpropagation. In Proceedings of the 32nd International Conference on Machine Learning (PMLR 37), Lille, France, 7–9 July 2015; Volume 37, pp. 1180–1189. [Google Scholar]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. In Domain Adaptation in Computer Vision Applications; Csurka, G., Ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 189–209. [Google Scholar] [CrossRef]
Bhattacharya, P.; Gupta, R.K.; Yang, Y. Exploring the contextual factors affecting multimodal emotion recognition in videos. IEEE Trans. Affect. Comput. 2021, 14, 1547–1557. [Google Scholar] [CrossRef]
Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag. 2017, 34, 18–42. [Google Scholar] [CrossRef]
Zheng, J. Geometric Deep Learning with 3D Facial Motion. Master’s Thesis, Imperial College London, London, UK, 2019. [Google Scholar]

Figure 1. Outline of this study’s review protocol.

Figure 2. PRISMA flow diagram.

Figure 3. Two-dimensional convolutional network-based method.

Figure 4. Multistream convolutional neural network-based methods.

Figure 5. Three-dimensional convolutional neural network-based methods.

Figure 6. Recurrent convolutional network-based methods.

Figure 7. Typical multimodal emotion recognition framework.

Table 1. Search query/term.

Query Structure: Subexpression 1 AND Subexpression 2 AND Subexpression 3
subexpression 1: multimodal OR “multi modal” OR multi-modal OR “multiple modalities” OR multimodality OR multi-modality OR “multiple channels” OR multichannel OR multi-channel OR “multiple sensors” OR multisensor OR multi-sensor OR bimodal OR bi-modal OR bimodality OR bi-modality OR trimodal OR tri-modal OR trimodality OR tri-modality
subexpression 2: “emotion analysis” OR “emotion recognition” OR “emotion classification” OR “emotion* detection” OR “emotion computing” OR “emotion sensing” OR “emotion assessment” OR “affect recognition” OR “affective computing” OR “emotional state recognition” OR “affective state recognition”
subexpression 3: visual OR facial OR face OR “body movement” OR “body motion” OR gesture* OR posture* OR gesticulation* OR eye OR gaze OR “pupil* dilation” OR “pupil* reflex” OR “pupil* response” OR pupillometry OR pupillogra* OR oculogra* OR lip* OR video* OR audiovisual OR audio-visual OR vocal* OR speech OR audio* OR voice* OR physiological OR biological OR psychophysiological OR biosignal* OR “bio signal” OR “bio-signal” OR electroencephalogra* OR eeg OR magnetoencephalogra* OR electrocardiogra* OR ecg OR ekg OR “heart rate” OR “cardiac activit” OR electromyogra OR emg OR temg OR “muscle” OR “blood volume” OR bvp OR “blood pressure” OR “blood pulse” OR electrodermal OR eda OR “galvanic skin” OR gsr OR “skin conductance” OR psychogalvanic OR respiration OR accelerometer* OR acceleration* OR electrooculogra* OR eog OR heog OR photoplethysmogra* OR ppg OR “inter-beat interval” OR “interbeat interval” OR “inter beat interval” OR “brain wave” OR “brain signal” OR “brain activit*” OR temperature

Table 2. Facial macro-expression recognition using handcrafted features.

Ref	Year	Database	Features/Classifier	Best Performance
[90]	2011	BU3DFE	LGBP+LBP/SVM	Acc: $71.10 %$
[89]	2013	CK CK+ JAFFE MMI CMU-PIE	LDN/SVM	Acc: $96.60 %$ Acc: $89.30 %$ Acc: $91.60 %$ Acc: $95.80 %$ Acc: $94.40 %$
[86]	2013	CK+	Geometric/SVM	Acc: $97.35 %$
[91]	2014	CK+	AAM/NN	Acc: $85.73 %$
[94]	2020	CK+	AUs/RF	Acc: $75.61 %$
[87]	2021	in-house	ICAT/RF	Acc: $98.17 %$
[88]	2021	CK+ MUG	AUs/Ensemble	Acc: $72.55 %$ Acc: $88.37 %$
[95]	2023	MUG	AUs and geometric features/ANFIS	Acc: $99.6 %$

Table 3. Facial micro-expression recognition using handcrafted features.

Ref	Year	Database	Features/Classifier	Best Performance
[96]	2011	SMIC	LBP-TOP + TIM/MKL	Acc: $71.4 %$
[100]	2013	SMIC-HS	LBP-TOP + TIM/SVM	Acc: $48.78 %$
[111]	2014	SMIC CASME II	STM/AdaBoost	F1: $0.4731$ , Acc: $44.34 %$ F1: $0.3337$ , Acc: $43.78 %$
[97]	2015	SMIC CASME II	LBP-SIP/SVM	Acc: $62.80 %$ Acc: $66.40 %$
[99]	2015	SMIC CASME II	LBP-MOP/SVM	Acc: $50.61 %$ Acc: $45.75 %$
[98]	2015	SMIC CASME II	STLBP-IP	Acc: $57.93 %$ Acc: $59.51 %$
[112]	2015	CASME II	Monogenic Riesz Wav./SVM	F1: $0.4307$
[104]	2015	CASME II	LBP-TOP with adapt. magnification/SVM	Acc: $69.63 %$
[106]	2015	SMIC CASME II	OSW-LBP-TOP/SVM	Acc: $57.54 %$ Acc: $66.40 %$
[107]	2016	SMIC CASME II	OSF + OSW/SVM	Acc: $52.44 %$ Acc: $63.16 %$
[110]	2016	CASME CASME II	LBP-TOP/RK-SVD	Acc: $69.04 %$ Acc: $63.25 %$
[101]	2017	CASME II	LBP-TOP with EVM/SVM	Acc: $75.30 %$
[21]	2017	SMIC CASME II	LBP-TOP with a sparse sampling/SVM	Acc: $58.00 %$ Acc: 49.00%
[109]	2017	SMIC-HS CASME CASME II	FDM/SVM	F1: $0.5380$ , Acc: $54.88 %$ F1: $0.2401$ , Acc: $42.02 %$ F1: $0.2972$ , Acc: $41.96 %$
[108]	2018	CAS(ME)² CASME II SMIC	Bi-WOOF/SVM	F1: $0.47$ , Acc: $59.26 %$ F1: 0.61 F1: 0.62
[102]	2018	SMIC-HS CASME II	HIGO + Magnification/SVM	Acc: 75.00% Acc: 78.14%
[113]	2019	SMIC CASME CASME II	DiSTLBP-RIP/SVM	Acc: $63.41 %$ Acc: $64.33 %$ Acc: $64.78 %$

Table 5. Microexpression recognition for spontaneous datasets using deep learning features.

Ref	Database	Features/Classifier	Best Performance
2D CNN
[117]	SMIC, CASME II, and SAMM	EMR	UF1: $0.7885$ and UAR: $0.7824$
[132]	SMIC, CASME II, and SAMM	Dual-Inception	UF1: $0.7322$ and UAR: $0.7278$
[133]	SMIC CASME II SAMM	ResNet, Micro-Attention	Acc: $49.4 %$ Acc: $65.9 %$ Acc: $48.5 %$
[134]	SMIC CASME II SAMM	OFF-ApexNet	Acc: $67.6 %$ Acc: $88.28 %$ Acc: $69.18 %$
RCNN
[135]	SMIC CASME CASME II	MER-RCNN	Acc: $57.1 %$ Acc: $63.2 %$ Acc: $65.8 %$
3D-CNN
[114]	CASME I/II	DTSCNN	Acc: $66.67 %$
[136]	SMIC, CASME II, and SAMM	STSTNet	UF1: $0.7353$ and UAR: $0.7605$
[137]	SMIC CASME CASME II	3D-FCNN	Acc: $54.49 %$ Acc: $54.44 . 28 %$ Acc: $59.11 %$
Combined 2D CNN and 3D CNN
[138]	SMIC CASME II SAMM + CASME II	TSNN-IF TSNN-LF	F1:0.6631, UAR:0.6566, WAR:0.7547 F1: 0.6921, UAR:0.6833, WAR:0.7632
Combined 2D-CNN and LSTM/GRU/RNN
[139]	SMIC CASME II SAMM	Apex–time network	UF1: $0.497$ and UAR: $0.489$ UF1: $0.523 %$ and UAR: $0.501$ UF1: $0.429$ and UAR: $0.427$
2D-CNN and then LSTM/GRU/RNN
[129]	CASME II	CNN-LSTM	Acc: $60.98 %$
[130]	CASME II SAMM	ELRCN	F1 score:0.5 F1 score:0.409
Spatial contextual
[140]	SMIC, CASME II, and SAMM	CapsuleNet	UF1: $0.6520 %$ and UAR: $0.6506 %$
[141]	CASME II SAMM	AU-GACN	Acc: $49.2 %$ Acc: $48.9 %$

Table 6. Facial emotion expression datasets.

Dataset	No. Samples	No. Characters	Comments
FER2013	35 K (small) [35,887]	35,887	Hard: Web/Montreal
JAFFE [144]	213	1 actor	Medium
BU-3DFE [145]	2500	100	Video
RAF-DB [146]	30 K	100	Medium
Oulu-CASIA	30 K	80 subjects	Medium
Ferg	55 K	6 3D characters	Easy
KDEF [147]	837	36	Image
SFER	30 K	-	Subtle
CK+ [148]	327 (593)	123 subjects	video
MUG [149]	75 K	476	Image
TFEID [150]	368	4	Image
RaFD [151]	676	18	Video
CASME II [152]	247	26 subjects	Spontaneous subtle: video
SMIC [100]	164	16 subjects	Spontaneous subtle
SAMM [153]	159	32 subjects	Spontaneous subtle: video

Table 7. Body gesture-based recognition.

Ref	Year	Database	Elicitation	Features	Classifier	Average Accuracy
[158]	2018	FABO	Acted	keyframes HMI + CNN + convLSTM	MLP	$72.50 %$
[16]	2019	GEMEP	Acted	CNN	MLP	$95.40 %$
[159]	2021	iMiGUE (micro-gestures)	Natural	BiLSTM (encoder)/ LSTM (decoder)	BiLSTM (encoder)/ LSTM (decoder)	$55.00 %$
[160]	2022	MASR	Acted	BiLSTM with attention module	HPN/SAE	$92.42 %$ (seen)/ $67.85 %$ (unseen)

SAE: semantic auto-encoder, HPN: hierarchical prototype network.

Table 8. Speech emotion expression datasets.

Dataset	No. of Emotions	No. of Utterances	Persons	Comments	Difficulty
EMO-DB [168]	7	535	10	Acted	Medium + text
EVD [169]	5	9750	5	Acted	Medium
CHAD [170]	7	6228	42	Acted	Medium

Table 9. Speech emotion recognition methods using handcrafted features.

Ref	Year	Database	Features	Classifier	Average Accuracy
[171]	2018	EMO-DB SAVEE	Combine WLPCCVQ and WMFCCVQ	RBFNN	$91.82$ $93.67$
[172]	2019	EMO-DB	AM–FM modulation features (MS, MFF), cepstral features (ECC, EFCC, SMFCC) from THT	SVM	$86.22$
[173]	2020	EMO-DB SAVEE	Combine either MFCC, HFCC, TBFCC-B or TBFCC-E	SVM	$77.08$ $55.83$

FCN: fully connected network; Acc: accuracy; WAAcc: weighted average accuracy.

Table 10. Speech emotion recognition methods using deep learning features.

Ref	Year	Database	Features	Classifier	Average Accuracy
[174]	2008	Crema-D	CNN	CNN	60–70%
[175]	2018	AESDD	LSTM	LSTM	F1 score: 0.6523
[176]	2021	LSSED	GRU	GRU	Acc:53.45
[177]	2020	ESD	CNN	SVM	Acc:63.78
[178]	2022	AESDD	CNN	CNN	$77 %$
[179]	2023	TESS EMO-DB RAVDESS SAVEE CREMA-D	Ensemble of 1D CNN, LSTM and GRU	FCN	WAAcc: $99.46 %$ WAAcc: $95.42 %$ WAAcc: $95.62 %$ WAAcc: $93.22 %$ WAAcc: $90.47 %$
[180]	2024	IEMOCAP NoiseX92	Wave-U-Net	Wave-U-Net	$66.2 %$ 62.4% (at 0 dB SNR)
[181]	2024	EMO-DB RAVDESS SAVEE IEMOCAP SHEIE	Ensemble of Transformer, CNN and LSTM	FCN	$99.86 %$ $96.3 %$ $96.5 %$ $85.3 %$ $83 %$
[182]	2023	RAVDESS EMOVO	Statistical mean of MFCCs	3-1D CNN	Acc: 87.08 Acc:83.90 (speaker-dependent)
[183]	2019	EMO-DB	Feature fusion by DBNs: prosody features (fundamental frequency, power), voice quality features (the first, second and third formants with their bandwidths), spectral features (WPCC and W-WPCC)	DBNs/SVM	Acc: $86.60 %$
[172]	2019	EMO-DB	AM–FM modulation features (MS, MFF), cepstral features (ECC, EFCC, SMFCC) from THT	RNN	Acc: $91.16 %$

FCN: fully connected network; Acc: accuracy; WAAcc: weighted average accuracy.

Table 11. EEG-based emotion recognition methods.

Ref	Year	Database	Elicitation	Features	Classifier	Accuracy
[190]	2014	GAMEEMO	Acted	PS wavelet ENT HE FD	SVM	$87.53 %$ $78.41 %$ $65.12 %$ $71.38 %$ $70.63 %$
[195]	2014	DEAP	Induced	Sample entropy	SVM	val: $80.43 %$ , ar: $79.11 %$
[192]	2016	DEAP	Induced	Sample entropy of IMFs	SVM	$93.20 %$
[196]	2016	In-house dataset induced by IADS	Induced	FD, 5 statistics and 4 band powers ( $θ$ , $α$ , $β$ , $θ / β$ ratio)	SVM	$35.76 %$
[194]	2011	in-house dataset induced by audio-visual stimuli	Induced	Wavelet	kNN LDA	$83.04 %$ $80.52 %$
[193]	2018	DEAP	Induced	Entropy from temporal window	MLPs combined through DST	ar: $87.43$ , val: $88.74 %$
[197]	2018	SEED DREAMER	Induced	Differential entropy PSD	DGCNN	$79.95 %$ val: $86.23 %$ , ar: $84.54 %$ , dom: $85.02 %$
[198]	2021	SEED	Induced	Differential entropy and brain network	CNN	$91.45 %$
[199]	2024	SEED	Induced	AND NDE	SVM	79.16–91.39% 81.66–85.39%
[200]	2024	DEAP	Induced	Hybrid of time, frequency, time–frequency and location features	kNN SVM ANN	$82 %$ $83 %$ $96 %$

val: valence; ar: arousal; dom: dominance; AND: average node degree; NDE: node degree entropy; HE: Hurst exponent; FD: fractal dimension; IMF: intrinsic mode functions of an empirical mode decomposition; DST: Dempster–Shafer theory; PS: power spectrum; ENT: approximate entropy.

Table 12. Physiologically based emotion datasets.

Dataset	No. of Modalities	No. of Emotions	No. of Subjects	Comments
WESAD	BVP, ECG, EDA, EMG, RESP, TEMP, ACC	6	15	Induced
SWELL	ECG	Multi-dimensional (val, ar, dom)	25	Induced
AMIGOS	ECG, EEG and GSR	7	40	Induced
GAMEEMO	EEG	Multi-dimensional (val, ar)	28	Induced

val: valence, ar: arousal, dom: dominance, EDA: electrodermal activity, EMG: electromyogram, RESP: respiration, TEMP: body temperature, BVP: blood volume pulse, ACC: three-axis acceleration.

Table 14. Multimodal emotion recognition using handcrafted features.

Ref	Year	Database	Modality	Features	Classifier	Average Accuracy	Fusion Method
[164]	2015	eNTERFACE	Audio/facial (visual)/text combined	-	SVM	Acc: $87.95 %$	Feature-level fusion (concatenation)
[237]	2012	RML eNTERFACE	Visual/video and audio	-	HMM	Acc: $80 - 85 %$ $70 - 75 %$	Kernel-based feature-level fusion and decision-level fusion
[238]	2013	RML eNTERFACE	Facial visual and audio	-	HMM	Acc: $70 - 80 %$ $70 - 85 %$	Combination of feature-level (KECA) and decision-level
[239]	2020	ITMDER WESAD	ECG, EDA, RESP and BVP	-	SVM (ar), RF (val) QDA (ar) SVM (val)	ar: $87.6 %$ , val: $89.26 %$ ar: $87.6 %$ , val: $92.9 %$	Feature-level fusion (concatenation)
[219]	2011	In-house dataset (induced)	Eye gaze and EEG	-	SVM	(FF) ar: $66.4 %$ , val: $58.4 %$ (DF) ar: $76.4 %$ , val: $68.5 %$	Feature-level (FF) and decision-level (DF)
[217]	2019	MAHNOB-HCI DEAP	Facial and EEG	CNN/PSD	CNN/SVM	ar: $74.17 %$ , val: $75.21 %$ ar: $71.54 %$ , val: $80.00 %$	AdaBoost for decision-level fusion
[240]	2022	DEAP	EOG and EMG	PSD, Hjorth activity and complexity	Logit boost	$59 %$	Data-level fusion

KECA: kernel entropy component analysis, ECG: electrocardiography, EDA: electrodermal activity, RESP: respiration, BVP: blood volume pulse, RF: random forest, QDA: quadratic discriminant analysis, ar: arousal, val: valence.

Table 15. Multimodal emotion recognition datasets.

Dataset	No. Emotions	No. Utterances	No. Persons	Comments
Belfast [241]	8	1440	60	dimensional + categorical
eNTERFACE [242]	7	239	42	-
HUMAINE [243]	7	50	10	naturalistic and induced data + text
IEMOCAP [174]	10	5K-10K	10	Actors conversation + text
SEMAINE [244]	7	959	150	-
RAVDESS [245]	8	1440	60	-
CREMA-D [246]	6	7442	91	-
GEMEP [247]	15	1260	10	-
MSP-IMPROV [248]	-	-	12	-
DEAP	-	EEG and audio	25 (useful)	Dimensional emotion model description

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Udahemuka, G.; Djouani, K.; Kurien, A.M. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Appl. Sci. 2024, 14, 8071. https://doi.org/10.3390/app14178071

AMA Style

Udahemuka G, Djouani K, Kurien AM. Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences. 2024; 14(17):8071. https://doi.org/10.3390/app14178071

Chicago/Turabian Style

Udahemuka, Gustave, Karim Djouani, and Anish M. Kurien. 2024. "Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review" Applied Sciences 14, no. 17: 8071. https://doi.org/10.3390/app14178071

APA Style

Udahemuka, G., Djouani, K., & Kurien, A. M. (2024). Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review. Applied Sciences, 14(17), 8071. https://doi.org/10.3390/app14178071

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multimodal Emotion Recognition Using Visual, Vocal and Physiological Signals: A Review

Abstract

1. Introduction

2. Materials and Methods

2.1. Search Strategy

2.2. Study Selection

3. Human Emotion Categorization Models

4. Human Emotional Expression Recognition

4.1. Single Modality

4.1.1. Visual Modality

4.1.2. Speech Modality

4.1.3. Physiological Modality

4.2. Multimodal Emotion Recognition

5. Deep Learning Challenges and Solutions for High-Quality Emotion Recognition

6. Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI