sensors-logo

Journal Browser

Journal Browser

Sensor Based Multi-Modal Emotion Recognition

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: closed (30 November 2022) | Viewed by 63933
Please contact the Guest Editor or the Section Managing Editor at ([email protected]) for any queries.

Special Issue Editors


E-Mail Website
Guest Editor
Pattern Recognition Lab, Chonnam National University, Gwangju, Republic of Korea
Interests: deep-learning-based emotion recognition; medical image analysis; pattern recognition
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Chonnam National University, Gwangju, South Korea
Interests: image processing; computer vision; medical imaging

Special Issue Information

Dear Colleagues,

Emotion recognition is one of the hot issues in AI research. This Special Issue is being assembled to share all kinds of in-depth research results related to emotion recognition, such as the classification of emotion category (anger, disgust, fear, happiness, sadness, surprise, neutral, etc.), arousal/valence estimation, diagnosis of mental health such as stress, pain, cognitive load, engagement, curiosity, humor, and so on. All of these problems deal with a stream of data not only from individual sensors such as RGB-D cameras, EEG/ECG/EMG sensors, wearable devices, or smart phones, but also from the fusion of various sensors.

Please join this Special Issue entitled “Sensor-Based Multi-Modal Emotion Recognition”, and contribute your valuable research progress. Thank you very much.

Prof. Soo-Hyung Kim
Prof. Gueesang Lee
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multi-modal emotion recognition
  • audio-visual, EEG/ECG/EMG, wearable devices
  • emotion classification
  • arousal/valence estimation
  • stress, pain, cognitive load, engagement, curiosity, humor
  • related issues in emotion recognition or sentiment analysis

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

19 pages, 8273 KiB  
Article
Robust Human Face Emotion Classification Using Triplet-Loss-Based Deep CNN Features and SVM
by Irfan Haider, Hyung-Jeong Yang, Guee-Sang Lee and Soo-Hyung Kim
Sensors 2023, 23(10), 4770; https://doi.org/10.3390/s23104770 - 15 May 2023
Cited by 7 | Viewed by 3488
Abstract
Human facial emotion detection is one of the challenging tasks in computer vision. Owing to high inter-class variance, it is hard for machine learning models to predict facial emotions accurately. Moreover, a person with several facial emotions increases the diversity and complexity of [...] Read more.
Human facial emotion detection is one of the challenging tasks in computer vision. Owing to high inter-class variance, it is hard for machine learning models to predict facial emotions accurately. Moreover, a person with several facial emotions increases the diversity and complexity of classification problems. In this paper, we have proposed a novel and intelligent approach for the classification of human facial emotions. The proposed approach comprises customized ResNet18 by employing transfer learning with the integration of triplet loss function (TLF), followed by SVM classification model. Using deep features from a customized ResNet18 trained with triplet loss, the proposed pipeline consists of a face detector used to locate and refine the face bounding box and a classifier to identify the facial expression class of discovered faces. RetinaFace is used to extract the identified face areas from the source image, and a ResNet18 model is trained on cropped face images with triplet loss to retrieve those features. An SVM classifier is used to categorize the facial expression based on the acquired deep characteristics. In this paper, we have proposed a method that can achieve better performance than state-of-the-art (SoTA) methods on JAFFE and MMI datasets. The technique is based on the triplet loss function to generate deep input image features. The proposed method performed well on the JAFFE and MMI datasets with an accuracy of 98.44% and 99.02%, respectively, on seven emotions; meanwhile, the performance of the method needs to be fine-tuned for the FER2013 and AFFECTNET datasets. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

43 pages, 9806 KiB  
Article
ADABase: A Multimodal Dataset for Cognitive Load Estimation
by Maximilian P. Oppelt, Andreas Foltyn, Jessica Deuschel, Nadine R. Lang, Nina Holzer, Bjoern M. Eskofier and Seung Hee Yang
Sensors 2023, 23(1), 340; https://doi.org/10.3390/s23010340 - 28 Dec 2022
Cited by 9 | Viewed by 7113
Abstract
Driver monitoring systems play an important role in lower to mid-level autonomous vehicles. Our work focuses on the detection of cognitive load as a component of driver-state estimation to improve traffic safety. By inducing single and dual-task workloads of increasing intensity on 51 [...] Read more.
Driver monitoring systems play an important role in lower to mid-level autonomous vehicles. Our work focuses on the detection of cognitive load as a component of driver-state estimation to improve traffic safety. By inducing single and dual-task workloads of increasing intensity on 51 subjects, while continuously measuring signals from multiple modalities, based on physiological measurements such as ECG, EDA, EMG, PPG, respiration rate, skin temperature and eye tracker data, as well as behavioral measurements such as action units extracted from facial videos, performance metrics like reaction time and subjective feedback using questionnaires, we create ADABase (Autonomous Driving Cognitive Load Assessment Database) As a reference method to induce cognitive load onto subjects, we use the well-established n-back test, in addition to our novel simulator-based k-drive test, motivated by real-world semi-autonomously vehicles. We extract expert features of all measurements and find significant changes in multiple modalities. Ultimately we train and evaluate machine learning algorithms using single and multimodal inputs to distinguish cognitive load levels. We carefully evaluate model behavior and study feature importance. In summary, we introduce a novel cognitive load test, create a cognitive load database, validate changes using statistical tests, introduce novel classification and regression tasks for machine learning and train and evaluate machine learning models. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

21 pages, 520 KiB  
Article
Feature Selection for Continuous within- and Cross-User EEG-Based Emotion Recognition
by Nicole Bendrich, Pradeep Kumar and Erik Scheme
Sensors 2022, 22(23), 9282; https://doi.org/10.3390/s22239282 - 29 Nov 2022
Viewed by 2068
Abstract
The monitoring of emotional state is important in the prevention and management of mental health problems and is increasingly being used to support affective computing. As such, researchers are exploring various modalities from which emotion can be inferred, such as through facial images [...] Read more.
The monitoring of emotional state is important in the prevention and management of mental health problems and is increasingly being used to support affective computing. As such, researchers are exploring various modalities from which emotion can be inferred, such as through facial images or via electroencephalography (EEG) signals. Current research commonly investigates the performance of machine-learning-based emotion recognition systems by exposing users to stimuli that are assumed to elicit a single unchanging emotional response. Moreover, in order to demonstrate better results, many models are tested in evaluation frameworks that do not reflect realistic real-world implementations. Consequently, in this paper, we explore the design of EEG-based emotion recognition systems using longer, variable stimuli using the publicly available AMIGOS dataset. Feature engineering and selection results are evaluated across four different cross-validation frameworks, including versions of leave-one-movie-out (testing with a known user, but a previously unseen movie), leave-one-person-out (testing with a known movie, but a previously unseen person), and leave-one-person-and-movie-out (testing on both a new user and new movie). Results of feature selection lead to a 13% absolute improvement over comparable previously reported studies, and demonstrate the importance of evaluation framework on the design and performance of EEG-based emotion recognition systems. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

15 pages, 851 KiB  
Article
EEG-Based Emotion Classification Using Stacking Ensemble Approach
by Subhajit Chatterjee and Yung-Cheol Byun
Sensors 2022, 22(21), 8550; https://doi.org/10.3390/s22218550 - 6 Nov 2022
Cited by 31 | Viewed by 4209
Abstract
Rapid advancements in the medical field have drawn much attention to automatic emotion classification from EEG data. People’s emotional states are crucial factors in how they behave and interact physiologically. The diagnosis of patients’ mental disorders is one potential medical use. When feeling [...] Read more.
Rapid advancements in the medical field have drawn much attention to automatic emotion classification from EEG data. People’s emotional states are crucial factors in how they behave and interact physiologically. The diagnosis of patients’ mental disorders is one potential medical use. When feeling well, people work and communicate more effectively. Negative emotions can be detrimental to both physical and mental health. Many earlier studies that investigated the use of the electroencephalogram (EEG) for emotion classification have focused on collecting data from the whole brain because of the rapidly developing science of machine learning. However, researchers cannot understand how various emotional states and EEG traits are related. This work seeks to classify EEG signals’ positive, negative, and neutral emotional states by using a stacking-ensemble-based classification model that boosts accuracy to increase the efficacy of emotion classification using EEG. The selected features are used to train a model that was created using a random forest, light gradient boosting machine, and gradient-boosting-based stacking ensemble classifier (RLGB-SE), where the base classifiers random forest (RF), light gradient boosting machine (LightGBM), and gradient boosting classifier (GBC) were used at level 0. The meta classifier (RF) at level 1 is trained using the results from each base classifier to acquire the final predictions. The suggested ensemble model achieves a greater classification accuracy of 99.55%. Additionally, while comparing performance indices, the suggested technique outperforms as compared with the base classifiers. Comparing the proposed stacking strategy to state-of-the-art techniques, it can be seen that the performance for emotion categorization is promising. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

15 pages, 5174 KiB  
Article
A Comparison of Machine Learning Algorithms and Feature Sets for Automatic Vocal Emotion Recognition in Speech
by Cem Doğdu, Thomas Kessler, Dana Schneider, Maha Shadaydeh and Stefan R. Schweinberger
Sensors 2022, 22(19), 7561; https://doi.org/10.3390/s22197561 - 6 Oct 2022
Cited by 12 | Viewed by 3599
Abstract
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. [...] Read more.
Vocal emotion recognition (VER) in natural speech, often referred to as speech emotion recognition (SER), remains challenging for both humans and computers. Applied fields including clinical diagnosis and intervention, social interaction research or Human Computer Interaction (HCI) increasingly benefit from efficient VER algorithms. Several feature sets were used with machine-learning (ML) algorithms for discrete emotion classification. However, there is no consensus for which low-level-descriptors and classifiers are optimal. Therefore, we aimed to compare the performance of machine-learning algorithms with several different feature sets. Concretely, seven ML algorithms were compared on the Berlin Database of Emotional Speech: Multilayer Perceptron Neural Network (MLP), J48 Decision Tree (DT), Support Vector Machine with Sequential Minimal Optimization (SMO), Random Forest (RF), k-Nearest Neighbor (KNN), Simple Logistic Regression (LOG) and Multinomial Logistic Regression (MLR) with 10-fold cross validation using four openSMILE feature sets (i.e., IS-09, emobase, GeMAPS and eGeMAPS). Results indicated that SMO, MLP and LOG show better performance (reaching to 87.85%, 84.00% and 83.74% accuracies, respectively) compared to RF, DT, MLR and KNN (with minimum 73.46%, 53.08%, 70.65% and 58.69% accuracies, respectively). Overall, the emobase feature set performed best. We discuss the implications of these findings for applications in diagnosis, intervention or HCI. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

13 pages, 784 KiB  
Article
Subjective Evaluation of Basic Emotions from Audio–Visual Data
by Sudarsana Reddy Kadiri and Paavo Alku
Sensors 2022, 22(13), 4931; https://doi.org/10.3390/s22134931 - 29 Jun 2022
Cited by 3 | Viewed by 1982
Abstract
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio–visual data) is studied using perceptual evaluation. For this purpose, [...] Read more.
Understanding of the perception of emotions or affective states in humans is important to develop emotion-aware systems that work in realistic scenarios. In this paper, the perception of emotions in naturalistic human interaction (audio–visual data) is studied using perceptual evaluation. For this purpose, a naturalistic audio–visual emotion database collected from TV broadcasts such as soap-operas and movies, called the IIIT-H Audio–Visual Emotion (IIIT-H AVE) database, is used. The database consists of audio-alone, video-alone, and audio–visual data in English. Using data of all three modes, perceptual tests are conducted for four basic emotions (angry, happy, neutral, and sad) based on category labeling and for two dimensions, namely arousal (active or passive) and valence (positive or negative), based on dimensional labeling. The results indicated that the participants’ perception of emotions was remarkably different between the audio-alone, video-alone, and audio–video data. This finding emphasizes the importance of emotion-specific features compared to commonly used features in the development of emotion-aware systems. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

12 pages, 2420 KiB  
Article
EEG Connectivity during Active Emotional Musical Performance
by Mahrad Ghodousi, Jachin Edward Pousson, Aleksandras Voicikas, Valdis Bernhofs, Evaldas Pipinis, Povilas Tarailis, Lana Burmistrova, Yuan-Pin Lin and Inga Griškova-Bulanova
Sensors 2022, 22(11), 4064; https://doi.org/10.3390/s22114064 - 27 May 2022
Cited by 5 | Viewed by 2639
Abstract
The neural correlates of intentional emotion transfer by the music performer are not well investigated as the present-day research mainly focuses on the assessment of emotions evoked by music. In this study, we aim to determine whether EEG connectivity patterns can reflect differences [...] Read more.
The neural correlates of intentional emotion transfer by the music performer are not well investigated as the present-day research mainly focuses on the assessment of emotions evoked by music. In this study, we aim to determine whether EEG connectivity patterns can reflect differences in information exchange during emotional playing. The EEG data were recorded while subjects were performing a simple piano score with contrasting emotional intentions and evaluated the subjectively experienced success of emotion transfer. The brain connectivity patterns were assessed from the EEG data using the Granger Causality approach. The effective connectivity was analyzed in different frequency bands—delta, theta, alpha, beta, and gamma. The features that (1) were able to discriminate between the neutral baseline and the emotional playing and (2) were shared across conditions, were used for further comparison. The low frequency bands—delta, theta, alpha—showed a limited number of connections (4 to 6) contributing to the discrimination between the emotional playing conditions. In contrast, a dense pattern of connections between regions that was able to discriminate between conditions (30 to 38) was observed in beta and gamma frequency ranges. The current study demonstrates that EEG-based connectivity in beta and gamma frequency ranges can effectively reflect the state of the networks involved in the emotional transfer through musical performance, whereas utility of the low frequency bands (delta, theta, alpha) remains questionable. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

25 pages, 1078 KiB  
Article
AttendAffectNet–Emotion Prediction of Movie Viewers Using Multimodal Fusion with Self-Attention
by Ha Thi Phuong Thao, B T Balamurali, Gemma Roig and Dorien Herremans
Sensors 2021, 21(24), 8356; https://doi.org/10.3390/s21248356 - 14 Dec 2021
Cited by 7 | Viewed by 4159
Abstract
In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. [...] Read more.
In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

13 pages, 1387 KiB  
Article
EEG-Based Emotion Recognition by Convolutional Neural Network with Multi-Scale Kernels
by Tran-Dac-Thinh Phan, Soo-Hyung Kim, Hyung-Jeong Yang and Guee-Sang Lee
Sensors 2021, 21(15), 5092; https://doi.org/10.3390/s21155092 - 27 Jul 2021
Cited by 25 | Viewed by 4036
Abstract
Besides facial or gesture-based emotion recognition, Electroencephalogram (EEG) data have been drawing attention thanks to their capability in countering the effect of deceptive external expressions of humans, like faces or speeches. Emotion recognition based on EEG signals heavily relies on the features and [...] Read more.
Besides facial or gesture-based emotion recognition, Electroencephalogram (EEG) data have been drawing attention thanks to their capability in countering the effect of deceptive external expressions of humans, like faces or speeches. Emotion recognition based on EEG signals heavily relies on the features and their delineation, which requires the selection of feature categories converted from the raw signals and types of expressions that could display the intrinsic properties of an individual signal or a group of them. Moreover, the correlation or interaction among channels and frequency bands also contain crucial information for emotional state prediction, and it is commonly disregarded in conventional approaches. Therefore, in our method, the correlation between 32 channels and frequency bands were put into use to enhance the emotion prediction performance. The extracted features chosen from the time domain were arranged into feature-homogeneous matrices, with their positions following the corresponding electrodes placed on the scalp. Based on this 3D representation of EEG signals, the model must have the ability to learn the local and global patterns that describe the short and long-range relations of EEG channels, along with the embedded features. To deal with this problem, we proposed the 2D CNN with different kernel-size of convolutional layers assembled into a convolution block, combining features that were distributed in small and large regions. Ten-fold cross validation was conducted on the DEAP dataset to prove the effectiveness of our approach. We achieved the average accuracies of 98.27% and 98.36% for arousal and valence binary classification, respectively. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

22 pages, 7290 KiB  
Article
Deep-Learning-Based Multimodal Emotion Classification for Music Videos
by Yagya Raj Pandeya, Bhuwan Bhattarai and Joonwhoan Lee
Sensors 2021, 21(14), 4927; https://doi.org/10.3390/s21144927 - 20 Jul 2021
Cited by 48 | Viewed by 8077
Abstract
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents [...] Read more.
Music videos contain a great deal of visual and acoustic information. Each information source within a music video influences the emotions conveyed through the audio and video, suggesting that only a multimodal approach is capable of achieving efficient affective computing. This paper presents an affective computing system that relies on music, video, and facial expression cues, making it useful for emotional analysis. We applied the audio–video information exchange and boosting methods to regularize the training process and reduced the computational costs by using a separable convolution strategy. In sum, our empirical findings are as follows: (1) Multimodal representations efficiently capture all acoustic and visual emotional clues included in each music video, (2) the computational cost of each neural network is significantly reduced by factorizing the standard 2D/3D convolution into separate channels and spatiotemporal interactions, and (3) information-sharing methods incorporated into multimodal representations are helpful in guiding individual information flow and boosting overall performance. We tested our findings across several unimodal and multimodal networks against various evaluation metrics and visual analyzers. Our best classifier attained 74% accuracy, an f1-score of 0.73, and an area under the curve score of 0.926. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

17 pages, 1660 KiB  
Article
Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion
by Baijun Xie, Mariia Sidulova and Chung Hyuk Park
Sensors 2021, 21(14), 4913; https://doi.org/10.3390/s21144913 - 19 Jul 2021
Cited by 56 | Viewed by 8057
Abstract
Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for [...] Read more.
Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

29 pages, 2238 KiB  
Article
Context-Aware Emotion Recognition in the Wild Using Spatio-Temporal and Temporal-Pyramid Models
by Nhu-Tai Do, Soo-Hyung Kim, Hyung-Jeong Yang, Guee-Sang Lee and Soonja Yeom
Sensors 2021, 21(7), 2344; https://doi.org/10.3390/s21072344 - 27 Mar 2021
Cited by 4 | Viewed by 3629
Abstract
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as [...] Read more.
Emotion recognition plays an important role in human–computer interactions. Recent studies have focused on video emotion recognition in the wild and have run into difficulties related to occlusion, illumination, complex behavior over time, and auditory cues. State-of-the-art methods use multiple modalities, such as frame-level, spatiotemporal, and audio approaches. However, such methods have difficulties in exploiting long-term dependencies in temporal information, capturing contextual information, and integrating multi-modal information. In this paper, we introduce a multi-modal flexible system for video-based emotion recognition in the wild. Our system tracks and votes on significant faces corresponding to persons of interest in a video to classify seven basic emotions. The key contribution of this study is that it proposes the use of face feature extraction with context-aware and statistical information for emotion recognition. We also build two model architectures to effectively exploit long-term dependencies in temporal information with a temporal-pyramid model and a spatiotemporal model with “Conv2D+LSTM+3DCNN+Classify” architecture. Finally, we propose the best selection ensemble to improve the accuracy of multi-modal fusion. The best selection ensemble selects the best combination from spatiotemporal and temporal-pyramid models to achieve the best accuracy for classifying the seven basic emotions. In our experiment, we take benchmark measurement on the AFEW dataset with high accuracy. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

25 pages, 11541 KiB  
Article
CorrNet: Fine-Grained Emotion Recognition for Video Watching Using Wearable Physiological Sensors
by Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic and Pablo Cesar
Sensors 2021, 21(1), 52; https://doi.org/10.3390/s21010052 - 24 Dec 2020
Cited by 38 | Viewed by 7872
Abstract
Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose [...] Read more.
Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≤64 Hz) (3) large amounts of neutral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance. Full article
(This article belongs to the Special Issue Sensor Based Multi-Modal Emotion Recognition)
Show Figures

Figure 1

Back to TopTop