1. Introduction
The use of artificial intelligence (AI) in daily activities has become mainstream in recent years. Advances in technology have paved the way for computationally powerful machine learning models to cement the foundations for the future of the industrial and healthcare domains. The adoption of AI in the health sector holds a lot of potential, from patient diagnostics to health monitoring and, in some cases, treatment itself [
1].
Emotional intelligence strives to bridge the gap between human and machine interactions. The application of such systems varies and is becoming more prominent as healthcare services work to provide more efficient care through the utilization of smart digital health apps. One application in digital health is for the incorporation of emotion recognition systems as a tool for therapeutic interventions. Emotion classification is currently being developed as a component in a closed-loop system [
2] designed to aid in the therapeutic intervention of people with autism spectrum disorder (ASD).
ASD is a neuro-developmental condition that affects a person’s social skills by impairing their interaction, communication, behaviors, and interests [
1,
3,
4]. The condition often results in more health problems due to isolation and unemployment (or reduced employment), which can lead to depression and anxiety [
4]. Estimates reveal that 1 out of 59 people are affected by ASD, thus comprising ~1~2% of the general population [
4,
5].
Emotions can be identified by three main components: 1—facial expressions; 2—speech and voice patterns; and 3—physiological signals. Emotion recognition perception is distributed as 55% facial, 35% speech, and 10% physiological signals [
6]. Although facial expressions and speech patterns hold the majority for emotion determination, limited access to these data in real time in daily life makes them less convenient than physiological signals. Physiological signals can be accessed through electronic wearable devices (EWD), such as smart watches, which are increasingly prevalent and are directly associated with health management [
7]. Equally, screen time, including smart phone, TV, and computer usage, stands at 28.5 ± 11.6 h a week [
8]. Even if a small portion of screen time is allocated to using a health app, the data collected would still be fewer than the level of data from EWDs. Physiological signals often used to measure emotional and cognitive reactions include electrodermal activity (EDA) and electrocardiogram (ECG) [
9,
10,
11]. Hence, physiological signals were selected for emotion detection in this study.
For electrodermal activity, the parameters of the frequency of non-specific skin conductance responses (NS.SCR) and the skin conductance level (SCL) are frequently used. This is one of the most common measures used in psychophysiology and includes a wide range of applications, such as emotional reactions, attention examination, and the processing of information. EDA is measured by applying a small current through a pair of electrodes that are placed on the surface of the skin [
12]. Two mechanisms contribute to the EDA measurement: 1—sweat secretion and 2—selective membrane activity in the epidermis. The more sweat produced, the more conductive the path becomes; as a result, the resistance decreases and therefore a change is observed in the EDA.
ECG is one of the most widely used non-invasive clinical diagnostic tools, providing a clear observation of the heart’s electrical behavior [
13]. ECG records the electrical activity transmitted through the body by means of electrodes attached to the skin. Another relatively simple derivation option is the use of a chest belt. This electrical activity is the result of the heart’s depolarization to induce contraction at each beat [
14]. The measurements are analyzed through the QRS wave complex, and subsequently the heart rate (HR) is derived from peak to peak, e.g., RR interval, of the ECG recording across a specific time frame. The use of ECG monitoring has increased in recent years, thanks in part to the advancement of wearable devices, such as smart watch technology or fitness trackers, and people’s often high adherence to their use for the monitoring of daily activity and workout routines in a lifestyle focused on well-being and healthy aging.
The data used in this article were collected from a separate collaborative study conducted on emotion induction methods’ influence on recognition [
15]. The ground truth, defined as the subjectively perceived valence and arousal of each emotional category, was assessed using the self-assessment manikin (SAM) [
15,
16]. The data were gathered from EDA and ECG sensors attached to the non-dominant hand (thenar and hypothenar) and chest, respectively.
In this study, the EDA—more specifically, the SCL—and ECG signals, i.e., HR and heart rate variability (HRV) were analyzed for emotional stimulus trigger marks and assessed for the different emotional reaction stages and intensity of arousal using signal processing and machine learning techniques. Features of interest, required for the machine learning algorithm, were extracted from the data by applying different signal processing methods. To evaluate the outcome of the predictions, different evaluation criteria were used. The aim of this study was to disclose the effectiveness of physiological signals—in this case, EDA and ECG—in characterizing emotional stimuli reactions and identifying their stages and arousal strength.
The paper is organized with the following structure.
Section 2 describes the methods used, data description, signal processes, network architecture, and analysis criteria. Key results are highlighted in
Section 3, with their respective discussions rendered in
Section 4. The conducted ablation studies are mentioned in
Section 5, and a conclusion is drawn in
Section 6.
Related Work
The challenges of detecting and recognizing human emotions have yielded different approaches and techniques, with a recent trend towards machine learning strategies to solve the problem. A recent search for “emotion recognition facial” and “emotion recognition physiological signal” on PubMed revealed the concentration of research works towards facial recognition (4825 articles), rather than physiological signals (191 articles), for emotion recognition, with a ratio of ~25:1 over the last 5 years [
17].
In Kakuba S. et al. (2022) [
18], an attention-based multi-learning model (ABMD) utilizing residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) with multi-head attention is proposed for emotion recognition from speech patterns, achieving 95.83% on the EMODB dataset, with notable robustness in distinguishing the emotion of happiness. In Yan Y. et al. (2022) [
19], an AA-CBGRU network model is proposed for speech emotion recognition that combines spectrogram derivatives, convolutional neural networks with residual blocks, and BGRU with attention layers, showing improved weighted and unweighted accuracy on the IEMOCAP sentiment corpus. In Khaireddin Y. et al. (2021) [
20], a popular VGG network architecture was deployed with fine hyperparameter tuning to achieve state of the art results on the FER2013 [
21] dataset. A shallow dual network architecture was introduced in Mehendale N. (2020) [
22], with one framework removing background noise while the second generated point landmark features, achieving recognition accuracies of up to 96% on a combined dataset. Zhao X. et al. (2017) [
23] proposed a novel peak-piloted GoogleNet [
24] network architecture in which the peak and non-peak emotional reaction was considered from an image sequence, with tests on the OULU-CASIA [
13] database achieving up to 84.59% accuracy.
In Kim Y. et al. (2021) [
25], a facial image threshing (FIT) machine for autonomous vehicles’ facial emotion recognition (FER) is introduced, utilizing advanced features from pre-trained facial recognition and the Xception algorithm, resulting in a 16.95% increase in validation accuracy and a 5% improvement in real-time testing with the FER 2013 dataset compared to conventional methods. In Canal F. et al. (2022) [
26], a survey was conducted that reviewed 94 methods from 51 papers on emotion expression recognition from facial images, categorizing them into classical approaches and neural networks, finding slightly better precision for the classical methods but with lesser generalization; this work also evaluated the strengths and weaknesses of popular datasets. In Karnati M. et al. (2023) [
27], a thorough survey of deep learning-based methods for facial expression recognition (FER) is provided, which discusses their components, performance, advantages, and limitations, while also examining relevant FER databases and pondering the field’s future challenges and opportunities.
Although the facial features provide a more distinguishable analysis of the emotional response of a person, the acquisition of the data is somewhat cumbersome. The relevant and appropriate feature extraction from facial expressions in images is also disputed. In particular, it is often not robust to differences in complexion, culture, and ethnicity.
Physiological signals provide more continuous real-time monitoring compared to facial expressions. In comparable studies [
28,
29,
30,
31,
32,
33,
34,
35], the impact of using physiological signals for emotion detection and subsequent recognition is highlighted. Shukla J. et al. (2021) [
28] assessed and evaluated different techniques for EDA signals and determined the optimal number of features required to yield high accuracy and real-time emotion recognition. A fine hyperparameter-tuned convolutional neural network was developed in Al Machot F. et al. (2019) [
29] for use in assisted living environments using EDA signals to recognize emotions. The designed model improved the robustness of two established datasets, achieving accuracies of 78% and 82% on the MAHNOB [
36] and DEAP [
37] datasets, respectively, for subject-independent recognition. In Veeranki Y. R. et al. (2021) [
30], different time–frequency signal analysis methods are implemented on the EDA signal and combined with machine learning techniques for emotion recognition, reaching area under the curve (AUC) accuracies of 71.30% on the DEAP [
37] database. In Wenqian L. et al. (2023) [
38], a review was conducted on emotion recognition and judgment using physiological signals like EEGs, EDA, ECGs, and EMG, discussing their technological applications and the effects achieved and providing a comparative analysis of different signal applications, along with considerations for future research.
Heart rate (HR) monitoring, using smart watches, is often applied when following up on pre-existing health conditions or tracking workout routines for athletes [
7]. However, other applications, such as stress level detection and emotion recognition, are also studied [
31,
39]. In Shu L. et al. (2020) [
31], HR signals recorded by a smart wearable device were assessed for the recognition of paired emotions using machine learning models. The approach achieved accuracy of 84% for three emotional states’ classification, using a gradient boosted decision tree algorithm on the collected dataset. Zhang Z. et al. (2016) [
35] took a different approach to recognizing emotions, using the accelerometer data from wearable devices. The results revealed accuracy of 81.2% in classifying three emotional categories, using a support vector machine (SVM) with a radial basis (RBF) kernel function as a classifier.
A combination, more commonly known as fusion, of more than one signal for emotion recognition has also been studied, with promising results. Greco A. et al. (2019) explored the fusion of both EDA signals and speech patterns to improve arousal level recognition, yielding a marginal classifier improvement of 11.64% using an SVM classifier with recursive feature elimination [
32]. Du G. et al. (2020) investigated the combination of facial expressions and HR for emotion recognition in gaming environments, increasing the recognition accuracy by 8.30% [
33]. In Fernández-Aguilar L. et al. (2019) [
34], the fusion of EDA signals and HR variability (HRV) was used for emotion classification, achieving 82.37% overall accuracy for both young and elderly age groups combined, for seven emotion classes, using an SVM classifier with a quadratic kernel.
Hence, both EDA and ECG signals were used in the present study for emotion identification and its subsequent arousal level determination. This study was distinct from prior research as it did not focus on identifying the relative emotional response but rather the ability to identify the physiological reaction and its subsequent arousal intensity. This approach offers a more detailed understanding of an individual’s level of engagement with the presented stimuli.
4. Discussion
As observed in
Table 4, the selected dataset was smaller than the original, with a reduction of 20.83%. This reduction resulted from a first-stage signal analysis on the original ECG signal, where data from five subjects revealed inconsistencies in the recording. As a consequence, these samples were removed from further processing.
The distribution in
Table 4 also demonstrates there was no bias towards a particular class in the two-class system. Thus, there was equal representation during the training process. However, in
Table 5, a bias in the data towards the class of mid arousal strength is revealed, having a rate of 45.49% from the total distribution, with 31.58% for low and 22.93% for high. This data imbalance was countered with a class-weighted loss function, as described in
Section 2.5.2. This ensured the fair representation of each of the arousal strength classes during model training.
The efficacy of the proposed model in distinguishing between the two classes of emotion and rest is highlighted in
Figure 7. The results indicate that the selected features, and HRV specifically, have suitable embedded information for the task of distinguishing between an emotion or calm or resting state. The robustness of the model at this stage makes further processes throughout the workflow pipeline more efficient. Thus, overall errors will be more sensitive to the model’s capability in identifying the strength of a detected emotion’s arousal.
The results in
Figure 8 reveal the difficulty in identifying the different arousal strengths from the given dataset. One contributing factor to the heightened performance of the mid arousal strength could be the inherent human uncertainty or variability surrounding the projection of mid-range arousals. Contrary to real-life scenarios, where extreme emotions tend to offer clearer cues, the model appears particularly adept at navigating the nuances of these intermediate arousal strengths, possibly because of the complexities and ambiguities that humans exhibit when expressing them.
In addition, the use of deep learning models is a high-dimensional problem and requires significantly large datasets. Another contributing factor to this low performance was linked to the data imbalance, as well as the limited number of total observations. The data augmentation technique of signal oversampling was not adopted as it would have led to the model overfitting on the data.
The low representation of the high arousal strength class also indicates that the subjects were not strongly impacted by the experiment’s stimuli. Thus, no significant change in their ECG signal was present. Indeed, when examining the recorded videos, which were synchronized with the physiological signal measurements, minimal to no change in the person’s facial expressions was observed. It is thus worth noting the need for potentially more extensive tests to ensure that this state is better represented in the data, if possible.
Further, the dataset used in this study was composed of real human reactions to stimuli perceived to trigger the corresponding emotional response. As a result, the complexity of classification increased, since each person behaved differently towards the same stimuli. Equally, the physiological signals also differed from one person to the other depending on a wide range of factors, which in turn influenced the acquired features.
In the broader context of emotion recognition, this research underscores the potential of physiological signals, specifically electrodermal activity (EDA) and electrocardiogram (ECG) data, in accurately detecting emotions and assessing arousal strength. The notable emotion detection accuracy of 94.19% achieved by emphasizing key descriptors from heart rate variability (HRV) signifies a substantial advancement in the utilization of these physiological markers. The proposed pipeline, with its real-time application capability, highlights the emerging role of wearable devices in advancing the realm of digital health therapeutics. Additionally, by incorporating a system that can be integrated into therapeutic settings, the research paves the way for more personalized and adaptive therapeutic interventions. The methodology, especially when compared to previous works, showcases the efficacy of combining multiple physiological markers. Thus, this study adds a pivotal dimension to the ongoing discourse in emotion recognition by emphasizing real-time, wearable-device-driven insights, bridging the gap between laboratory findings and real-world therapeutic applications.
As with any research, certain limitations of the study should be noted. Limitations include no optimization on the signal window length for HRV feature extraction, no hyperparameter tuning on the CWT, and no model explicability analysis. It should be noted that the signal window length for HRV feature extraction was not optimized, which could have influenced the accuracy of the HRV features derived. Additionally, the absence of hyperparameter tuning for the continuous wavelet transform (CWT) suggests that the decomposition of the signal into its constituent frequencies might not have been at its optimal state, potentially impacting the precision of the feature extraction. Furthermore, without a detailed explicability analysis, the underlying rationale behind the model’s decisions remained challenging to decipher, which might limit its practical application. These factors collectively may constrain the generalizability of the findings.
The focus of future work will be to tackle some of these limitations by performing an ablation study on the window length. An optimization function will be implemented to tune the CWT hyperparameters. To evaluate the explicability of the model, different techniques will be employed and an evaluation metric established for a quantitative measurement.