1. Introduction
Emotions are considered to be human emotional states, it is a response to certain stimuli in the external environment or interpersonal interactions. Understanding and quantifying human emotional states, which have major implications for intelligent human-machine systems. In 1997 Picard and Healey proposed to equip sensors to record physiological signals, identify the emotional state of the wearer through signals, improve human-computer interaction experience through affective computing [
1], the paper indicates that in the future, sensors will become small enough, thus, a wearable device for real-time emotion recognition is designed.
The accessibility, non-fakeability, and continuous detectability of physiological signals [
2] is a hot topic of current research. Physiological signals can be divided into two categories: signals originating from the peripheral nervous system and signals from the central nervous system. Compared to Electroencephalogram (EEG) signals, the combination of Electrocardiogram (ECG) and Galvanic Skin Response (GSR) is less explored in the literature. Since ECG and GSR are rich in emotional information and can be obtained by low-cost, non-invasive devices, making them highly significant in terms of affective computing. Among them, ECG has been shown to be a reliable source of information for emotion recognition systems [
3,
4,
5], ECG analysis can identify the emotional state of the users, such as happiness, sadness, and stress. The GSR signal is a non-smooth signal, usually measured at the palm of the hand, and is a collection of two different components: Tonic and phase components. Tonic indicates the general level of skin conductivity, it’s horizontal value varies slowly with time. Phase components shows a sharper peak on the tidal drift of the tonic GSR, phase components usually caused by the presence of instantaneous sympathetic activation of the stimulus causing, and that can reflect changes in cognitive and emotional processes [
6,
7,
8]. Several studies [
9,
10,
11,
12] have shown that an adequate combination of information extracted from multiple models may improve robustness (for noisy inputs). Therefore, this paper mainly focuses on GSR and ECG to analyze. In many real-life scenarios, a key factor in decision-making (e.g., healthcare) is the classification model. For applications in these areas, affective computing systems must be able to describe the uncertainty of their emotional state outputs, and arousal and valence dimensions are the best options [
13,
14]. Therefore, the binary high/low classification problem is considered in this study [
15]. The affective computing task can be accomplished using two types of models, one is a deep learning model and the other is a traditional machine learning model. Deep learning related methods have had great success in the field of pattern recognition. More and more researchers are using it in affective computing tasks [
16]. For example, new deep learning models [
17], and many innovative models have been generated in machine learning models. Affective Computing has an important role in healthcare [
18], education [
19], and entertainment [
20], and its deeper value deserves to be explored.
Currently, deep learning, machine learning methods for affective computing has their own advantages, the deeper reasons for the advantages and disadvantages between the two models need to be further summarized. The effectiveness of feature selection in affective computing directly affects the level of accuracy, the joint mutual information (JMI) dimension of multidimensional features is used as a direct factor for feature validity, can effectively improve the rationality and effectiveness of the selection of features. Recent research in affective computing has focused on improving accuracy while ignoring the importance of the time dimension of affective computing, the time required for affective computing is an important factor in enhancing the human-computer interaction experience. In response to the above proposed deficiencies, the following work has been carried out in this paper. We used deep learning model and a machine learning model to process the ECG and GSR modalities in the AMIGOS dataset, respectively. Focus on exploring the advantages and disadvantages of both models, derive a model architecture with high recognition accuracy. The JMI-based greedy feature selection algorithm is proposed to feature-level fusion, to analyze which features extracted from ECG and GSR are more compatible with the affective computing task. In addition, focus on the time dimension of affective computing, propose a new type of terminal-edge-cloud computing architecture. Organize realistic scenario experiments based on the proposed computing architecture, using online education as an experimental scenario, the method proposed above was used to analyze the experimentally collected physiological database, the more promising results were obtained.
The paper is organized as follows:
Section 2 reviews the literature related to physiological signal-based emotion computing.
Section 3 describes the analysis of the feature selection algorithms proposed in this paper, based on machine learning and deep learning algorithm for sentiment classification methods, using the AMIGOS dataset to validate their effectiveness.
Section 4, describes the novel computing architecture, verifies its advantages in the affective computing times dimension, and designs an online learning scenario experiment to build an emotional database to verify the advantages of the proposed method and computing architecture. Finally,
Section 5 and
Section 6 present the results and conclusions of the experiments generated during this study.
2. Related Work
Changes in physiological signals can be influenced by human emotions, and since the proposal of non-invasive devices that can collect human physiological signals in real-time, many efforts have been made to analyze physiological signals. First, public datasets, DEAP [
21], SEED (2015) [
22], AMIGOS [
23], etc., are proposed, and then a series of sentiment computation models are proposed to analyze them. Zheng [
24] studied the mental arousal space in four quadrants and solved a four-category task using the graph regularization extreme learning machine (GELM) method, these two obtained about 70% accuracy in the polynomial classification task study. When data are incomplete, semi-supervised learning methods can be used to integrate Stack Auto-Encoder (SAE) with deep belief networks (DBN) using a decision fusion method and based on Bayesian inference classification [
25], yielding 73.1% accuracy in arousal and 78.8% accuracy in valence. Another recent GSR-based framework ref. [
26] proposed temporal and spectral features of SVM (RBF kernel) under the AMIGOS dataset, reporting 83.9% and 65% arousal and valence recognition accuracy, respectively. New trends in emotion-evoking computing use deep neural networks (DNNs) to process physiological signals and improve recognition rates. One of the earliest attempts was [
27], which proposed a multimodal residual LSTM for emotion recognition (MMResLSTM) yielded encouraging results, with their classification accuracy of 92.87% for arousal and 92.30% for valence on the DEAP dataset. Ref. [
28] processed ECG and GSR data from the AMIGOS dataset and proposed to use machine learning methods and DCNN to process the data, obtained better results of 0.76 for valence and 0.75 for arousal. A recent study Yang [
29] presented the fusion of statistical features extracted from EEG, ECG, and GSR of the AMIGOS dataset. They reported recognition rates of 67% and 68.8% for valence and arousal, respectively, which using the SVM classifier. LSTM-RNN was recently proposed [
30] using an attention-based mechanism for the AMIGOS dataset and reported recognition rates of 79.4% and 83.3% for binary classification of valence and arousal. Four-category emotion results also became progressively more common, however, the reported recognition rates decreased more in the case of four categories of emotions [
31]. Granados [
32] proposed a one-dimensional convolutional neural network model to analyze ECG and GSR signals in the AMIGOS dataset with an accuracy of 65.25% for the A-V four-category emotion recognition task.
The features extracted from physiological signals are the most important aspect in emotion recognition. The processing is carried out in the time domain, frequency domain, or nonlinear domain. Time domain methods include the use of various mathematical/statistical features such as mean [
33], median, etc., or the use of methods such as sample differences, zero-crossing, etc. In the frequency domain, the Fourier transform (FT) [
34] and the wavelet transform [
35] are widely used. The FT allows one to use time-based features on the signal (e.g., its mean or DC component and dominant frequency component) represented in the spectrum. The nonlinear domain approaches require the conversion to the sensor signals to discrete symbolic strings, and the key to performing this conversion is the discretization process. Once these signals are mapped to strings, exact or approximate matches and edit distances [
36]. Compared three feature selection algorithms Joint Mutual Information (JMI), Conditional Mutual Information Maximization (CMIM), and Dual Input Symmetric Correlation (DISR) on the AMIGOS dataset and concluded that the three feature selection algorithms are similar, and it is important to have the same number of features to obtain the best accuracy for arousal recognition and valence recognition. Therefore, which features are better, need to be further explored.
In order to accelerate the response speed of model systems, and use the resources at the edge efficiently, it has triggered a boom in edge computing among researchers, and the use of new computing architectures for sentiment analysis are gradually attracting attention, Chen [
37] designed a medical artificial intelligence framework based on data-width evolution and self-learning, which aim to provide medical services for skin diseases that meet the requirements of real-time, scalability and personalization. This computational framework allows physicians to quickly obtain patient skin conditions. Edge AI technology to analyze thermal imaging image data of buildings, for rapid analysis of building house occupancy information [
38]. Ref. [
39] the authors proposed Smart Edgent, a collaborative on-demand DNN co-inference framework with device edge synergy, that can split the network to run the network faster and efficiently use other node resources. Few studies have proposed methods related to the use of affective computing for education, and in [
40], a dynamic difficulty adjustment mechanism for computer games is proposed to provide a tailored gaming experience in individual users by analyzing ECG and GSR.
Our Contribution
We propose to extract features from the ECG and GSR, and use the proposed JMI-Score algorithm to compute the best set of features that match the current sentiment classification task. Machine learning models parameters were optimized to obtain the optimal model, and the CNN model automatically extracted features are compared with extracted manual features, and the accuracy of sentiment classification results was improved compared with the state of art. Propose a new computing architecture that leverages both edge-side and terminal-side computing resources, to speed up the recognition of emotions, and reduce network bandwidth, recognition latency. We also organize field experiments to verify the effectiveness of the novel computational architecture and the proposed affective computing model in the context of online learning.
3. Method
3.1. Experimental Data Description
The following paragraph describes the AMIGOS dataset in a condensed form. In this paper, the newly released dataset AMIGOS are used to validate the model, not only because it is widely used in the recent literature on physiological signal-based emotion elicitation, but also because they use low-cost physiological signal acquisition devices for data collection and the non-invasive nature of the whole process. AMIGOS applied a 14-channel Emotiv Epoc wireless headset to acquire EEG signals, peripheral physiological signals (ECG Right, ECG Left and, GSR physiological data pre-processed at a sampling frequency of 128 Hz) recorded with non-invasive devices such as the ECG Shimmer 2R5, as well as frontal video (RGB), stimulus material from the MAHNOB-HCI [
41] dataset as emotionally stimulating material. The dataset used both individually and GrOups scenarios, the first with 40 participants watching 16 short videos (<250 s in length); the second, with 17 people in an individual setting and 5 groups of 4 people each, where participants watched long videos (>14 min in length). Each trial first contained a 5 s baseline signal, with the signal depending on the duration of the video. After viewing the video, participants rated self-assessments of arousal, potency, liking, and dominance on a scale of 1 to 9 in the self-assessment of potency (SAM) [
42]. A total of 12,580 video clips were annotated in this way (340 clips from 37 participants in both short and long-video experiments). The arousal and valence scales used for these annotations are continuous, ranging from 1 (low arousal or potency) to 9 (high arousal or potency), and there is a high degree of agreement between annotators. There were 800 records in the dataset and 7 subjects (ID numbers 33, 24, 23, 22, 21, 12, 9) had missed data and were considered invalid.
The dataset can be divided into four classes: low arousal low valence (LALV), high arousal low valence (HALV), low arousal high valence (LAHV), and high arousal high valence (HAHV), and the threshold values of the valence and arousal dichotomy is 5. The kmeans algorithm are applied to cluster the distribution of the data, and
Figure 1 shows the distribution of the sentiment classes in AMIGOS, purple represents LALV, blue represents HALV, green represents LAHV, yellow represents HAHV in the
Figure 1.
3.2. Preprocessing
In this paper, we use deep learning and machine learning to process physiological signal data separately, and achieve effective emotion detection. GSR is a non-stationary signal, and in this study, the signal is first decomposed by smoothing through empirical modal decomposition (EMD) to obtain the effective frequency, and then the low-pass butterworth filter is used to pre-process the GSR signal since the skin electrical signal changes slowly, and the effective frequency is between 0–0.3 Hz, the cutoff frequency of the low-pass filter is set to 0.5 Hz and the sampling frequency is 128 Hz, and then the SCR and SCL are decomposed. The ECG signal frequency is usually 0.05~100 Hz, firstly, the baseline drift is eliminated by discrete wavelet transform, which is to eliminate unnecessary low-frequency noise in the frequency range of 0.05 and 1 Hz, and then the end frequency is set to 1 Hz using butterworth high-pass filte which the sampling frequency is 128 Hz. Then to acquire the denoised ECG signal. The noise-reduced signal is first normalized by Z-Score (Equation (1)) using a sliding window of 2 s and an offset of 1 s to capture the subtle changes in emotional motion and derive the feature vectors. Then, the data enters the display or implicit feature extraction phase. The first one extract implicit features by convolutional networks through deep learning, and the second one, by machine learning methods, which extracts time and frequency domain manual features, in three steps of preprocessing, classification and multimodal fusion, respectively.
where
is the standardized data,
and
are the mean and standard deviation of the data, respectively.
3.3. Detailed Analysis
3.3.1. Deep Learning Methods
Deep learning is an algorithm-based, difficult-to-interpret machine learning field, which used to model high-dimensional features in datasets. In recent research on emotion recognition based on physiological signals, more and more studies use deep learning models to process them, and achieved good results [
43]. The deep network structure we used is shown in
Figure 2 in this study, CNN is considered as a blur filter, which can automatically discover SCR peaks or SCLs in GSR signals, specific morphological patterns of the QRS complex in the ECG. The signal dimension after CNN processing is 2304 * 528. We believe that the obtained features have noise or invalid features, and SVD is often used in dimension reduction algorithms in deep learning [
44]. SVD processing is performed on the extracted features [
45], and the signal dimension becomes 268 * 528, which is fed into the fully connected layer.
The maxpooling layer alternates between CNNs as a regularization technique to reduce transition fitting in neural networks, and finally, to evaluate sentiment recognition. A cross-entropy loss function is set in the fully connected layer, which determines how well the target output vector
correspond to the predicted output vector
, as shown in Equation (2).
Our multi-task signals conversion recognition network consists of 3 convolutional blocks and 3 pooling layers. The convolutional layers are shared among different tasks, while the dense layers are task-specific, as shown in
Figure 3. Each convolutional block consists of 2 × 1 D convolutional layers with ReLu activation function, and followed by a maximum pooling layer of size 8. In the convolutional layers, we gradually increase the number of filters from 32 to 64 and 128. After each convolution blocks, the kernel size decreases from 32 to 16 and 8, respectively. Finally, at the end of the convolutional layer, global maximum pooling is performed. The dense layer immediately following consists of 2 fully connected layers and 128 hidden nodes, followed by a sigmoid layer.
3.3.2. Machine Learning Methods
In order to design reliable emotion recognition systems, it is particularly important to select appropriate and effective signal features. When designing affective computing systems, one of the most important considerations for application functionality are their simplicity and acceptable computational speed, thus making them suitable for real-time applications. Therefore, we use simple time-domain and frequency-domain features that do not require complex transformations and heavy computations.
Most of the characteristics of the ECG signal are based on the analysis of the P, Q, R, S and T waves of the recorded signal, including several statistical features calculated from the amplitude and width of the P, Q, R, S and T wavelets. Subsequently, heart rate variability (HRV) is calculated based on the detected R peak, and further features are extracted from the resulting signal, including the mean and root-mean-square deviation from HRV. In addition, the slope of the linear regression fitted to the appearance of the R-peak was calculated IBI. Based on [
46], wavelet transformed decomposition coefficients were also extracted, using 8th order Daubechies wavelets applied to detect and align the R-peaks.
For GSR, include features such as signal mean, standard deviation, kurtosis, or skewness (e.g., [
47,
48]). In other cases, researchers focused on event-related features of GSR. Event-related features refer to the properties of short-term responses, such as the presence or absence of an SCR, when seconds after the presence of a stimulus (such as an image or sound). In this sense, SCR can be automatically detected and features extracted from longer time windows. Phases skin conductance response (SCR) and the sum of SCR amplitude, SCR peak count, mean SCR rise time [
49,
50]. Furthermore, tonic skin conductance level (SCL). Power Spectral Density (PSD) estimation in the frequency domain using Welch’s method, which is the most commonly used algorithms to obtain a frequency domain representation of the signal. Previous studies have considered the statistical aspects (variance, range, signal amplitude region, skewness, kurtosis, harmonic summation) and spectral power of the five frequency bands, as well as their minimum, maximum, and variance [
51].
The physiological signal changes without a specific pattern and is highly random. Much of the information cannot be judged on the time domain, so it is also analyzed in the frequency domain. The signal frequency band is generally divided into very low frequency band (VLF = [0.0022–0.04] Hz), low frequency band (LF = [0.04–0.15] Hz) and high frequency band (HF = [0.15–0.40] Hz). The PSD method extracts the spectral power of each frequency band as the spectral characteristics of the original signal with the following equation.
Power calculated in the VLF, LF and HF band, total power in the entire frequency range (TP), power calculated in the power range LF band as a proportion of that calculated in the HF band (LF/HF). The proportion of power LF band calculated in the power range to that calculated in the whole band (LF/TF), LF power normalized to the sum of LF and HF power (nLF), and HF normalized to the sum of LF and HF power (nHF).
The nonlinear entropy domain feature, which can reflect the complexity and uncertainty of physiological signals, and has a wide range for applications in computational studies of emotions based on physiological signals. The extracted entropy values help to quantify the regularity of the signal, which can be applied to emotion recognition. This section applies three types of entropy domain features, including information entropy, multi scale entropy, and refined composite multi scale dispersion entropy (RCMDE) [
52]. The extracted features are shown in the
Table 1.
To sum up, there are 33 time domains, 60 frequency domains, and 3 nonlinear ECG signal features, 32 time domains, 60 frequency domains, and 3 nonlinear GSR signal features. The total number of physiological signal features per window are 191.
3.4. JMI-Based Greedy Feature Selection Algorithm (JMI-Score)
In the task of emotional feature classification and recognition, it is necessary to perform feature dimensional reduction processing on the obtained high-dimensional features, and to avoid overfitting caused by too high dimensional. Therefore, a greedy feature selection algorithm based on JMI is proposed here to select features, as shown in the Algorithm 1, the specific steps of the Joint Mutual Information (JMI)-based greedy feature selection algorithm proposed in this paper are shown below.
JMI Introduction
The mutual information is a measure
X and
Y between two (possibly multidimensional) random variables, which quantifies the amount of information about one random variable obtained through another random variable. The mutual information is given by the following equation:
are the X, Y components, and with N and M values, respectively.
JMI provides the best trade-offs in terms of accuracy, stability, and flexibility based on two assumptions:
- (1)
After removing a feature given the removed feature itself, any unselected feature is conditionally independent of the union of the selected features:
- (2)
Any unselected feature is conditionally independent of the union of selected features after removing any feature of a given class label and the removed feature itself:
Assuming the above two, the JMI score of the feature
is obtained according to the mutual information formula,
This is the information between the target Y and the joint random variables , defined by pairing the candidate with each of the previously selected features . The candidate feature that maximizes this mutual information is selected and added to the feature subset S.
The maximum joint mutual information is defined as: Let be the full feature set, let is a subset of the selected features. Let is the maximum value of the joint mutual information shared with the class label by the candidate feature when each feature in the subset is individually connected, therefore .
Algorithm 1 JMI-Score |
Input: All feature sets , Classification Tags , Number of features , Simple classification model Selected feature set subscript: . JMI-Socre (F, C, model, S, D): 1. = [] 2. = 3. 4. = [] 5. Temp = S[i] 6. 7. . () 8. = 9. If > Score[i − 1] 10. S[i] = S[i]. add() 11. else 12. S[i] = Temp 13. End for 14. End for 15. Sort the Score, select the top ten largest, and record all subscript IDX 16. 17. 18. = 19. = 20. Output: S. |
The algorithm first iterates through each feature, using a single feature as the starting set, and iterates through the features, other than the original features. Feeding the selected set of features into the pre-trained model, if the model score improves on this feature set, the newly traversed features are added to the feature set, otherwise they are not added. Therefore, all features are traversed, the set of features with the highest model score is selected, assuming that these feature combinations are most relevant to the labels and the features in each feature combination are complementary information. The ten features that make the highest model score are combined with the labels, and then to calculate the joint mutual information. Since JMI indicates the selection of the candidate selection features that maximize the cumulative sum of the joint mutual information, and the selected subset of features add them to the subset. The method performs well in terms of classification accuracy and stability. Therefore, the final optimal subset with the largest joint mutual information is selected and the algorithm ends.
The feature types proposed in the previous section are selected to help reduce the features used for feature processing. The final size of manual features is 123 × 528.
3.5. Results and Verification
3.5.1. Feature Selection Algorithm Verification
All the extracted multimodal physiological features are subjected to Principal Component Analysis (PCA) feature dimension reduction, and then input to XGBoost for feature classification. A total of 10 folds of independent experiments are carried out, and the samples are randomly scrambled in each fold. Taking Valence as the classification label, the recognition accuracy of the two feature dimension reduction methods is
It can be seen from the
Figure 3 that at the beginning of dimension reduction, the recognition effects of the two algorithms are not much different. With the decline in later features, the recognition rates of both have increased. For the PCA algorithm, the best recognition effect is when the feature dimension is 150, the recognition rate is 75.3%; for the JMI-Score feature selection algorithm. The best recognition effect is when the feature dimension is only 120, the recognition rate is 81.8%.
3.5.2. Model Validation
Table 2 illustrates the computational results of the AMIGOS dataset. In the second method, after comparison, it is concluded that the accuracy obtained by using the XGBoost algorithm is 81.8%, which is higher than other algorithms because XGBoost uses multi-classifier stacking, which can achieve a better classification effect.
From the
Table 2, it can be concluded that, use of deep neural networks takes longer than machine learning methods, but due to its model characteristics, the accuracy is better than machine learning. Using the computing framework proposed in this paper has obvious advantages in reducing the running time of the model and determining the response delay rate, decentralization, and rational use of edge resources.
3.6. Accuracy Description
Table 3 shows the comparative results of studies similar to this study. The types of features, feature selection algorithms, and optimal model parameters proposed in this paper are extracted from physiological data, and their results are compared with other studies:
4. New Computing Architectures
New computing architecture to accelerate computing: Often when processing data, we rely too much on cloud servers, which wastes network bandwidth and consumes time. Thanks to the development of Tensor Processing Unit (TPU), which have become conveniently portable computing devices, we propose novel computing architectures to accelerate emotion recognition and shorten recognition time, in contrast with inputting features directly into the model, we used TPU.
The computing framework of this study includes three layers: terminal-side, edge-side, and cloud-side, which effectively integrate the computing resources of the terminal-side and edge device, to make them work together to complete the computational process of deep learning. Achieve accelerated processing of data, while ensuring data security, user experience, and system availability. Reduce the latency of human-computer interaction, and decentralization. At the same time, effective and reasonable use of terminal-side idle computing resources, edge-side proximity computing resources.
Terminal-side: When the raw physiological data are obtained, run the pre-processing decision algorithm, including three values Computing Resource Utilization (CRU) as equation 8 (local-side computing resources, cloud-side current computing resources, and cloud-side predicted resource usage), when the terminal-side (CRU) is more than 0.7 then the raw physiological data will be directly uploaded to the cloud server, the pre-processing and algorithm decision will be run by the cloud, conversely, when the terminal-side computing resource is sufficient, the feature extraction in data pre-processing will be performed by the terminal-side.
Terminal-side: On the edge-side, we deploy several feature selection algorithms to process features from deep learning or machine learning and pass the streamlined features to the cloud for model decisions
Cloud side: On the cloud side, we need to collect cloud server computing resources in three seconds, and use machine learning models to predict the resource occupation in the next time period, then calculate the average CRU, in addition to deploying corresponding decision models such as CNN, XGBoost.
The data flow is shown in the
Figure 4a. When the original data is on the terminal-side, the data pre-processing process will have two cases: when (computing resources) are sufficient, the data pre-processing is performed on the terminal-side, and then passed into the edge-side for feature selection, and finally into the cloud model to produce results; When (capacity value) is insufficient, the data pre-processing is performed directly on the cloud side, after feature selection on the edge-side, and finally the decision is made in the cloud without the participation of the end-side.
As
Figure 4b shows that the edge side is based on the network situation and cloud-side computing resources to decide whether to participate in affective computing, if the cloud-side computing capacity is sufficient, it is Faster processing directly on the cloud side as opposed to going back and forth between the cloud and the edge, but the cloud side is often heavily loaded, the edge side is taken into account, so that the cloud side and the edge side can compute together. The edge side will run the feature selection algorithm and input the selection results to the cloud side.
The
Table 4 shows the time elapsed between data collection and input into the pre-trained model, when analyze the sentiment results in the same network environment, and shown based on the proposed computing architecture and in the same hardware environment (The hardware configuration is shown in
Table 5), the time required to perform the same emotional computing task in. In terms of time consumption, we use two parameters to measure, (1) Running time: The time it takes to obtain analysis results from raw data under the same network environment; (2) Determine Response Latency Rate (DRLR): In the case of the same emotion calculation time and network transmission time. Sentiment recognition takes up the percentage of time it takes to send from the sensor, send the sentiment data to the user within the fine edge of the network, and correctly identify it.
From the data in the
Table 4, we can see that the new computing mode can give feedback on emotional results faster. Compared with the traditional cloud-centric computing mode, the advantages of such computing are: 1. Speed up the operation without affecting the accuracy of the model; 2. It can not only ensure the security of data, but also realize a decentralized computing model, and make rational use of edge resources; 3. Reduce the use of network bandwidth, and innovatively integrate and use cloud and edge resources.
Online Learning Experiment
In order to verify the effectiveness of the computing architecture and algorithms, taking online learning as a scenario, collecting physiological data of students during online learning, and running with the new computing architecture proposed in this paper, considerable results were obtained, which is of great significance to the future online education and medical fields. We invited 30 subjects as shown in
Figure 5b (age range 22–26 years), 17 males and 13 females, all of whom had received more than six years of formal EFL education. The experimental equipment is placed as in
Figure 5a (Shimmer3 ECG device, E4 wristband, windows core i5, ASHU 603, Hi3559A TPU, ubuntu 32 G/4 T Server), before participating in the experiment, sign the required process description and give informed consent, and the acquisition process complies with the ethical requirements of the Human Biobanking Educational Exam. Establish the context as offline.
The experimental flow is shown in
Figure 6, and the detailed procedure is as follows:
Make sure the subjects remain calm, take a five-minute baseline test, and fill in the familiarity of the test questions before the experiment, and evaluate the difficulty level of the test according to the familiarity;
Show multiple-choice questions to the subjects. After each answer, the participants self-assess their arousal level and valence, and the background selects the difficulty level of the next question according to the subject’s emotional score;
The test paper contains 30 questions, and 30 min of ECG (Shimmer3 ECG equipment) and GSR (E4 wristband) data are collected;
After the experiment, annotators performed annotations based on video clips, first for valence and then for arousal.
The collection frequency of the Shimmer3 ECG device is 256 Hz, the amount of ECG data collected is (13,824,000) per subject, the E4 wristband is used to collect GSR frequency of 4 Hz, and the quantity is (216,000) per subject. The emotion annotation includes user self-assessment Valence and Arousal external annotation. We performed variable statistical analysis on the collected data. The degree of influence between these variables was measured using the Pearson correlation coefficient, defined as:
X carries the ECG or GSR physiological data vector and
Y represents emotional decision making. The correlation between ECG and affective state is usually lower than the correlation between GSR and affective state, which proves that different subjects stimulates different control factors for affective state. The system can adjust the difficulty of the questions according to the emotions fed by different subjects. The scientific validity and rationality of analyzing affective states from ECG, GSR is illustrated according to Pearson coefficients in
Figure 7.
We use the optimal model method proposed above to analyze the data. The XGBoost model has the highest accuracy rate of 80.6% for the second classification of arousal. Under the new computing architecture’s operating model, Affective computing takes an average of 5 s less time than under the usual cloud-centric architecture.
5. Discussion
After analyzing the results, it was observed that the method using XGBoost performed better compared to the other method, for one reason: EEG, ECG and, GSR are continuous time signals with large memory content, manual feature engineering and, the better features can be obtained by using JMI-Score algorithm. The second reason are: machine learning can remove irrelevant features from feature sets, which deep learning cannot do. In this study, machine learning has great advantages. In addition to basic interpretability, the combined use of user device and edge device resources can accelerate computing, while affecting accuracy, reduce processing time, and achieve decentralization processing methods. The time reduction is not very significant. The reason for the analysis is that, the main purpose of the new computing architecture are to reduce the load on the cloud center, and effectively use the computing resources on the edge and the terminal, while the overall computing resources have not increased significantly. Of course, the advantages of deep learning are also obvious, which can avoid complex feature extraction, extract high-dimensional features, and obtain better results. In order to seek the choice of better features, this paper extracts many features, including time domain, frequency domain, and nonlinear features. According to the Spearman correlation coefficient, it is more stable in GSR. The ECG signal has higher inter-class variability. According to the correlation coefficient, some features are low and the jump is serious, so it is very necessary for the feature selection of ECG.
Compare with other studies, in this experiment, the amount of data collected has increased, and the uniqueness of the decision labels needs to be further verified. The proposed method and framework are used to obtain promising results, which are expected to solve the problem in the epidemic era. The majority of teachers and students encounter the problem of interaction channels in distance education.
It is an experiment to move Affective computing based on physiological signals towards life. Of course, in this study, the research method was applied to the actual scene, and subjective factors such as subjects’ different educational backgrounds, and different answering backgrounds were not considered. It is an important factor, but because it is difficult to express mathematically, it is not considered in training data and needs to be studied in the future.
The model selection of machine learning is also very critical. This article selects several representative models. JMI-Score is an iterative version of JMI, which is relatively new. XGBoost is a widely used stacking ensemble algorithm, which can solve the limitations of a single model. Naïve Bayes is a traditional basic algorithm model and the origin of machine learning, which is very representative. In this study, XGBoost performed better, indicating that it is more appropriate to use traditional optimized machine learning when the amount of data is not large.
6. Conclusions
This work shows that emotion recognition can be performed with high accuracy from ECG and GSR signals. In addition, using a new MSE-based feature RCMDE, we found that the derived features of GSR along with the energy, and zero-crossing rate of its EMD patterns, allows for the correct classification of target emotional states. For the GSR signal, its stability characteristics can be used to predict the stress value, while the ECG has a strong mutation, and its frequency characteristics are more important to emotion recognition. Several classification models are trained in the machine learning method to select the model that maximizes the accuracy. In practical applications, the emotion recognition model should not only focus on accuracy, but also on timeliness. Only faster feedback can improve the human interaction experience.
In this paper, the public multi-physiological signal database AMIGOS is used as the experimental data to perform preprocessing, feature extraction, and feature selection to verify the effectiveness of the method proposed in this paper. The stimulation materials and acquisition processed of the physiological signal dataset proposed in this paper are briefly introduced. The acquired dataset is used to verify the effectiveness of the proposed method in real scenarios. When analyzing physiological data, it is first proposed to use discrete wavelet analysis, butterworth filter and empirical mode decomposition method to denoise the data. The feature engineering is divided into two categories: manual feature extraction and deep network automatic extraction. Machine learning uses time domain, frequency domain and nonlinear feature analysis to perform traditional feature extraction for two physiological signals: ECG signal and GSR signal. With a 3 s sliding window. Deep network methods are automatically extracted with convolutional neural networks, but their interpretability is not high. The shallow emotional features extracted by machine learning and the deep emotional features obtained by deep learning are, respectively.
We present a novel computational framework for affective computing, and the proposed system helps make affective computing applicable to solve problems in our lives, and helps bridge the gap between the representation of low-level physiological signal sensors and high-level contextually relevant interpretations of human emotions. The experimental results obtained from the two optimal algorithms using public datasets show that our feature selection processes use the JMI-Scores algorithm proposed in this paper for feature selection, and the dimension reduction effects is obvious. Feature sets, model parameter are set better than outperform state-of-the-art recognition rates. In fact, we observed an average 0.85% improvement in accuracy. There is also an extensive analysis of feature selection, model selection, time dimensions. Physiology is processed separately with deep learning and machine learning. It turns out that after feature selection and parameter tuning, the two architectures based on new computing systems are effective in emotion recognition, that is, better than previous methods, and in time dimension, the computational space dimension has been optimized.
Future work includes optimizing protocols in cloud-side computing systems, taking more into account security, and coordination. Applying more intelligent algorithms to new computing architectures, and developing real-time sentiments detection of wearable systems.