1. Introduction
AI-assisted teaching, particularly on online distance learning platforms, gained popularity during and after the COVID-19 pandemic. The automatic recognition of negative emotions is important for studying cognitive outcomes in these educational scenarios.
In past emotion recognition studies, basic types of emotions were extensively studied in controlled and isolated laboratory environments. The relationship between emotion and cognition from a computational perspective has not though been thoroughly explored. Only a limited number of researchers have investigated specific emotions that are associated with the learning and cognitive process.
Pessoa [
1] examined emotions from the perspective of brain organization. Pessoa proposed that the conventional categorization of affective and cognitive regions is overly simplified. By emphasizing the intricate interplay between emotion and cognition, this underscores the necessity of obtaining a more comprehensive understanding of how the brain functions and how complex cognitive behaviors emerge. From the computational perspective, Huang et al. [
2] conducted studies on the practical problem of speech emotion recognition. They employed various machine learning models to model specific types of emotions bearing on practice, such as confidence, anxiety, and fidgety emotion. Zepf et al. [
3] studied driver’s practical emotions, including stress and other types of emotions related to cognitive performance. However, in previous studies, the relationship between emotions and cognitive outcomes has not been fully explored. In particular, the methods suitable for leveraging emotions to aid in cognitive prediction and enhance teaching, as well as how cognitive factors can be used to improve emotion recognition, remain unanswered questions.
Various classification methods have been investigated, including support vector machines, Gaussian mixture models, LSTM (long short-term memory), transformer, and other deep neural networks [
4,
5,
6,
7,
8,
9]. Shirian et al. [
9] investigated a deep graph approach to speech emotion recognition. In their study, they compared various algorithms with graph learning and achieved promising results. However, it is worth noting that only some of the fundamental types of emotions were addressed in their research. Wang et al. [
10,
11] aimed to investigate the challenges posed by group differences and to sample imbalances in emotion recognition. Their research considered distinctions associated with age and gender. Furthermore, they explored the application of deep neural networks in modeling age and gender differences for speech emotion recognition.
Feature analysis and extraction for emotion recognition have received relatively extensive research attention. Farooq et al. [
12] studied feature selection algorithms for speech emotion. In their study, various features were selected and optimized to achieve the best possible results using deep learning. Gat et al. [
13] suggested speaker normalization techniques to improve emotion recognition rates. In their study, speaker normalization and self-supervised learning were investigated in detail and experiments were carried out using different databases. Tiwari et al. [
14] investigated noisy speech emotion, which is a practical topic that has not been extensively studied. They suggested using data augmentation to improve the modeling of speech emotions. Additionally, they examined a generative noise model for common emotion types. Nevertheless, certain practical emotions were not addressed in their work, and further discussion is needed to explore the practical applications of noisy speech recognition. Lu et al. [
15] investigated the generalization of feature extraction, in contrast to domain-specific features, and successfully addressed speaker-dependent issues. The validation of effectiveness was conducted on commonly observed emotion types.
Language-dependent features are also crucial for emotion recognition. Costantini et al. [
16] investigated cross-linguistic features of speech emotion. Their study involved the use of various datasets to enable a universal comparison of speech features. Additionally, they explored different machine learning algorithms for emotion modeling and analyzed their generalization capabilities. Saad et al. [
17] explored language-independent emotion recognition with a focus on addressing cross-database and cross-language recognition challenges. In their research, they examined and analyzed fundamental speech features, including pitch frequency, formant frequency, and intensity. They also extended their analysis to compare these features between English and Bangla languages. However, there was potential for addressing feature normalization issues further in their work, and the authors did not investigate the relationships between these features, cognition, and personality.
In summary, conventional emotion recognition studies are still limited in methodology, focusing solely on acoustic and computational aspects, while overlooking the intricate relationship between emotion and cognitive processes. Our approach involves employing multi-scale CNNs for modeling and, from a computational perspective, studying the connection between “fidgety” emotion and cognitive processes. We explore how to leverage emotion recognition results to enhance cognitive prediction and improve online teaching.
Fidgety emotion is an important emotional category that differs from traditional emotion research, which focuses on basic emotional categories. Fidgety is a complex emotion with practical value. It holds particular practical significance in the processes of learning and cognition, as it significantly influences cognitive abilities, behavioral control, and psychological stability. While traditional sentiment and emotion recognition (SER) research extensively cover the six basic emotions, like happiness, anger, surprise, sadness, fear, and disgust, there has been relatively limited research on complex emotions.
2. The Eliciting Experiment and Data Collection
In this section, we introduce our eliciting experiment [
18], which involves math problem solving as a cognitive task. During this task, subjects (students) are required to verbally report their outcomes, allowing us to collect speech containing various emotions.
In the Schachter–Singer two-factor theory [
19], also known as cognitive arousal theory, it is suggested that emotions are the result of a two-step process. First, individuals experience physiological arousal in response to a stimulus, which can be a general state of physiological excitation. Then, they use cognitive appraisal and external cues to label or interpret that arousal as a specific emotion. According to this theory, the cognitive interpretation is critical in determining which emotion is experienced. Based on the cognitive arousal theory, we make the assumption that the generation of negative emotions, such as feeling fidgety, frustrated, or nervous, may interfere with other cognitive processes, such as math calculations. When a student becomes distracted due to these emotions, it can lead to lower performance in math learning.
Eliciting fidgety emotion using repeated and complex math calculations as an external stimulus aligns with the first factor of the cognitive arousal theory, which involves triggering physiological changes during math tasks. In the following sub-sections, we provide a detailed description of the elicitation and data collection process.
Fidgety emotion is an important practical emotion related to cognition. It often emerges in situations where our minds are engaged, seeking stimulation, or grappling with complex thoughts. This emotion can be manifest in cognitive performance as well as physical behavior. Fidgety emotion can have a range of negative impacts on cognitive functioning and overall well-being. When excessive, it can disrupt one’s ability to concentrate and complete tasks efficiently. Persistent fidgeting can be distracting to both the individual and those around them, making it challenging to engage in activities that require sustained attention, such as studying or participating in meetings.
2.1. The Cognitive Task
Cognitive processes include engaging with sequences of mathematical calculation topics. As illustrated in
Figure 1, participants in the study undertook cognitive tasks by solving a series of mathematical problems. Throughout this learning process, we captured voice data from the participants using a voice interface. This was utilized, in particular, during repetitive math calculations to elicit fidgety emotions from the participants, enabling the collection of high-quality, naturalistic speech data. We systematically observed and annotated the emotions expressed in each oral report (speech data), while also documenting test scores and individual improvements. Additionally, we recorded the associated mathematical topics as part of the learning history data.
2.2. Data Annotation
Data annotation for emotion recognition necessitates precise emotion labeling across diverse contexts, accounting for factors like cultural nuances, personality, and environmental stimuli. Ensuring inter-annotator agreement through guidelines, training, and regular quality checks is crucial. When selecting data for annotation, diversity is prioritized to train robust models capable of recognizing emotions in various cognitive scenarios and across different demographics.
We employed the Self-Emotion Assessment Scale before and after the math task to monitor emotions. We also conducted a listening test with 12 annotators to label emotions as fidgety, stressed, happy, or neutral. If speech proved challenging to categorize under any of these emotions, we assigned it an “other” label.
After annotation, consolidating labels from different annotators can be achieved using the analytic hierarchy process (AHP) [
20]. This method helps weigh and prioritize the annotations, facilitating assignment of a consensus or aggregated label that reflects the collective judgment of the annotation.
Each emotion annotation divides the intensity of the specific emotion into five levels: 1, 3, 5, 7, and 9. Hence, the comparison matrix is represented as
P:
The eigenvalue can be computed as
. The weight vector
W is:
The consistency index
is:
The consistency ratio is thus: , since , it satisfies the consistency requirement.
Finally, we collected a dataset comprising 36 subjects (18 females, 18 males) who volunteered to take part in the data collection, with a total of 4389 annotated emotional speech samples. Among these, there were 1082 labeled as “fidgety”, 858 as “stressed”, 855 as “happy”, 929 as “neutral”, and 665 as “others”.
The distribution of samples is further illustrated in
Figure 2. We can see that the utterances had a relatively balanced distribution across different ages and genders. All speakers were Chinese native speakers and the oral test was carried out in standard Chinese.
Example of the fidgety emotional speech (female) is shown in
Figure 3. The spectrogram and pitch frequency are plotted.
3. Methodology
3.1. Multi-Scale CNN for Emotion Recognition
We propose an end-to-end speech emotion model based on a multi-scale one-dimensional (1-D) residual convolutional neural network. The data input to the network is the raw waveform, and the output is the probability corresponding to various emotion categories (including the fidgety emotion).
Multi-scale CNN [
21] was used to model and identify the emotional categories. We adopted a time series modeling method to perform 1-D convolution on the scale of the emotional speech signal. We extended the model for application to recognizing fidgety speech emotions. The network architecture is shown in
Figure 4.
The role of dilated convolution is to carry out local feature processing, which is suitable for the representation learning of time-series signals, extracting time-series features through convolution, and is also suitable for modeling sequence data, such as speech emotion.
Given that emotions are expressed over varying durations, and time-domain changes are crucial arousal and valence features, increasing the dilation rate to a value greater than one introduces gaps between the values in the filter. With a larger dilation rate, these gaps become wider, allowing the filter to capture information from a broader receptive field within the emotional speech signal.
Each network block comprises a dilated convolution layer, batch normalization, a residual shortcut connection, and a ReLU layer, all arranged to extract the emotional features from the raw time signal, as shown in
Figure 5.
The residual network was proposed by Kaiming He [
22]. By introducing a short-cut to avoid problems such as gradient explosion, the network depth can be greatly increased, so that very deep networks can also converge well in training. The residual module is the basic unit that makes up the residual network. Many residual modules cascaded together can improve the effect of representation learning and enable the construction of effective speech emotional features.
In our model, the ReLU function is used as the activation function. We choose 1, 3, 8, or 12 residual blocks for the multi-scale blocks. The optimizer we choose is Adam. The learning rate is set to 0.01, and the loss function is a cross-entropy function.
3.2. Emotion Recognition and Cognitive Outcome Prediction
As suggested by the Schachter–Singer two-factor theory [
19], our eliciting experiments serve as the stimulus to the subjects, provoking physiological changes. Subsequently, during the second cognitive stage, emotions such as fidgety emotions are generated. In our computational model, we assume that both the second stage and the presence of negative emotions will exert an influence on the cognitive outcome.
Through the stimulation of cognitive tasks, which involve mathematical calculations, it is probable that the underlying two-factor process that triggers emotions can also influence cognitive processes. We observe and record changes in cognitive processes from an external perspective, including the problem-solving speed, the question difficulty, and the answer accuracy, which together form a cognitive vector.
As shown in
Figure 6, by leveraging these cognitive vectors, we assist in emotion recognition, assuming that there exists a certain relationship between cognitive processes and negative emotions (such as the fidgety emotion). Modeling this probability condition can potentially enhance the results of emotion recognition.
Conversely, based on the outcomes of emotion recognition, as well as the historical data on problem-solving speed and answer accuracy rates, it is possible to predict the probability of correctly answering the next question.
The cognitive vector is defined as a set of metrics shown in Equation (
4).
Speed denotes the measure of the average time spent on one problem (1/time spent), diff denotes the difficult level, and rate denotes the accumulated percentage of correct answers.
As depicted in
Figure 7, we utilise a cognitive vector to enhance the function of emotion recognition. This cognitive vector is created by incorporating cognitive metrics, specifically, the accuracy of mathematical calculations and the historical score record, as outlined earlier. The resultant “cognitive vector” is subsequently transmitted to the machine learning classifier for emotion recognition, in conjunction with an emotion vector generated from the probability outputs of the residual network. The machine learning algorithm chosen for combining cognitive and emotional data is the Decision Tree, which offers a more comprehensible representation of the relationship between emotional states and cognitive outcomes.
The machine learning algorithm chosen for integrating cognitive and emotional data is the C5.0 Decision Tree. C5.0 is a sophisticated type of decision tree, renowned for its adaptability and effectiveness in classification and regression tasks. It adeptly partitions the data into subsets by selecting the most informative features, rendering it a valuable tool in our context.
In our specific case, we employ the C5.0 Decision Tree algorithm to amalgamate cognitive and emotional information. It is thoughtfully configured with a maximum depth set at 10 and a requirement for at least five samples to initiate node splitting. In evaluating the quality of these splits, we employ the information gain criterion, a hallmark of C5.0’s advanced decision-making process.
C5.0 Decision Trees are particularly valued for their capacity to unveil the significance of features within a dataset. Given our objective of harmonizing cognitive and emotional data for the recognition of emotional states, understanding which features or metrics exert the most influence becomes paramount. C5.0 decisively illuminates the relative importance of cognitive and emotional metrics, which, in turn, underpins the accuracy of our predictive model.
In this paper, the formulation of the cognitive vector signifies the state observed during the question-solving process. Emotions undeniably exert a noticeable impact on the precision of responses and the pace of question-solving. With this underlying hypothesis, we devised a statistical model to establish statistical connections among the variables, thereby increasing the prediction accuracy. Subsequently, the prediction of the question results’ precision serves as a means to corroborate the hypothesis concerning the influence of fidgety emotional states on cognitive speed and cognitive reasoning processes.
where
i stands for the current index number.
where
denotes the probability of each emotion type. We focus on the negative emotions, e.g., fidgety emotion, and their impacts on cognitive outcomes.
The input includes the cognitive vectors, emotion vectors, and the cognitive difficulty level, and the output is: .
As illustrated in
Figure 8, in contrast to the emotion recognition process, predicting the cognitive outcomes also involves the outcomes of emotion recognition. When considering the accuracy of mathematical calculations, emotional states play an influential part. In the algorithmic flow presented, we demonstrate the close relation between the emotion category and the preceding cognitive vector, jointly facilitating the prediction of cognitive outcomes, encompassing both the correctness rate, which is the cumulative percentage of correct answers, and the speed of math problem solving.
The predictive model here is Decision Tree.The parameter settings are carefully adjusted, achieving an equilibrium between model complexity and generalization. The maximum depth of the tree is set to 15, enabling the tree to explore the data more comprehensively. The minimum number of samples per node threshold is set to 8, ensuring that nodes split only when a sufficient number of data points are present, thus promoting a more robust and generalized model.
4. Experimental Results
The statistics pertaining to the sample distribution within our dataset utilized for this experiment are presented in
Table 1. Our dataset comprises a total of 4389 samples, with each mathematical assignment item associated with approximately 5–7 oral report utterances. Furthermore, our dataset contains 665 math assignment questionnaires. The train-validation-test split ratio is set at 7:1:2, resulting in 878 samples allocated for testing. The training samples are randomly selected and mixed; thus, it is speaker independent. The model ability is not dependent on any specific speaker. It has good ability to be generalized to different speakers.
The training of the emotion recognition classifier is a single task. The emotion classifier is trained separately using the emotion labels. The speed and rate for cognitive prediction is estimated independently in the first place, and they can be improved by the emotion recognition results.
The results for emotion recognition are displayed in
Table 2. The confusion matrix highlights the performance of our proposed method, which is built upon a multi-scale 1-D residual network. It is evident from the matrix that fidgety emotion and other cognitive processes are accurately identified.
In order to demonstrate the advantage of our proposed method, we compare it with a basic 1-D convolution model, LSTM (long short-term memory) [
23], and SVM. As shown in
Figure 9, four emotion classes, fidgety, stress, happy, neutral, and “other” emotion types are modeled and compared. The recognition rates observed show that our proposed multi-scale 1-D residual convolutional network outperformed the rest.
Parameter Settings
In order to better compare the different classifiers and to more easily reproduce the models, we describe the parameters used for the basic 1-D convolution model, the LSTM model, and the SVM model. For the basic 1-D convolutional model, identical residual blocks are employed. We maintain a fixed number of residual blocks at three, in contrast to the multi-scale model where the scale may vary. We further choose the ReLU function as the activation function. For LSTM, we employ the ReLU activation function, the cross-entropy loss function, and set the dropout rate to 0.2. The Adam optimizer is utilized for training the model. For SVM, the radial basis function (RBF) kernel is used as the kernel function, after being compared with the polynomial kernel and the linear kernel.
As shown in
Table 3, using the proposed computational model described in the methodology section, we can improve the emotion recognition results by merging the cognitive vector in the recognition process. The results show that fidgety and other cognition-related emotions are improved considerably. The recognition rate for the fidgety emotion is improved from 85.1% to 94.6%.
As depicted in
Figure 10, the utilization of cognitive vectors was shown to enhance recognition rates, particularly in the case of negative emotions, like “fidgety”, which exhibited a more pronounced improvement compared to emotions less closely associated with cognitive processes. The results highlight considerable enhancements in emotion recognition rates when cognitive vectors are integrated. Across various emotional categories, the incorporation of cognitive vectors consistently outperforms recognition which relies solely on emotion-related features. Notably, the most substantial improvements were evident in the “Fidgety” and “Stress” categories, where recognition rates increased by 9.5 and 7.9 percentage points, respectively. This suggests that cognitive vectors excel at capturing subtleties in the fidgety emotional state. However, even in the “Happy” and “Neutral” categories, there were noteworthy improvements of 4.2 and 6.0 percentage points, underscoring the versatility of cognitive vectors in enhancing recognition accuracy across a spectrum of emotional classifications.
In our cognitive outcome prediction, we determine the prediction accuracy as the percentage of correct predictions for both right and wrong answers. The math problems’ difficulty levels are categorized into “easy” and “hard”. Along with the difficulty level, different math topics are incorporated as features in our prediction model.
By leveraging peer performance in the math assignment correctness results, we employ an XGBoost classifier to predict future math problem outcomes (as a base prediction model, without considering the emotional states). The model takes as input the current math topics, encoded as one-hot vector IDs, along with the difficulty levels, and the historical assignment results, encompassing both correct and incorrect answers for each past topic, as well as the corresponding time spent on each. The classifier’s parameters are configured as follows: 500 for , 0.01 for , and 4 for .
In this experiment, the cognitive prediction results without using emotional information are shown in
Table 4. “Speed” denotes the measure of the average time spent on one problem (1/time spent). By using an emotion vector, we can improve the cognitive outcome prediction results. As shown in
Table 5, both the rate and speed predictions were improved. We can see that the improvements were considerable when fidgety and stress emotions were present. In the ’easy’ category, the rate prediction accuracy improved from 80.1% to 87.7% for the fidgety emotion. In the ’hard’ category, the rate prediction accuracy improved from 81.5% to 89.5% for the fidgety emotion. We can conclude that using the emotional states labels contributes to the prediction of cognitive outcomes.
5. Discussion
In our emotion recognition and cognitive prediction experiments, the dataset, composed of 4389 samples, was systematically divided into training, validation, and test sets, with a significant allocation for testing (878 samples). The results for emotion recognition underscore the effectiveness of the proposed multi-scale 1-D residual convolutional network. Notably, the confusion matrix facilitated accurate identification, particularly when discerning fidgety emotion and other cognitive processes.
Comparative analysis with traditional models, such as basic 1-D convolution, LSTM, and SVM, showed the superior performance of the proposed multi-scale 1-D residual network across four emotion classes. The subtle improvements observed, particularly in the recognition of fidgety emotion (94.6% from 85.1%), underscore the model’s ability to capture nuanced variations in emotional states.
Furthermore, the integration of cognitive vectors in the emotion recognition process highlights significant enhancement in identifying fidgety and other cognitive-related emotions. These improvements span various emotional categories, with the most substantial gains observed in the recognition of the fidgety emotional state.
The incorporation of emotion vectors in cognitive prediction aligns with the two-factor model, considering physiological arousal and cognitive processes. This approach, attuned to understanding and adapting to students’ emotional states during cognitive tasks, offers a nuanced perspective on enhancing the learning experience.
6. Conclusions
In this paper, we present a computational model for emotion recognition and cognitive prediction based on the well-known two-factor model, which takes into account physiological arousal and cognitive processes. We approach the problem of emotion recognition from the perspective that its generation is closely intertwined with cognitive processes. Our methodology began with the creation of an eliciting experiment for collecting emotional speech data during a mathematical cognitive task. Subsequently, we developed a computational model employing a 1-D residual network.
Through comparative analysis among various machine learning classifiers, we established that our proposed approach excels in recognizing emotions with cognitive relevance. Furthermore, we demonstrated the potential utility of emotion recognition in assisting cognitive outcome prediction. This development has promising implications for applications in AI-assisted teaching.