Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing

Avital, Nuphar; Egel, Idan; Weinstock, Ido; Malka, Dror

doi:10.3390/inventions9060113

Open AccessArticle

Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing

by

Nuphar Avital

^1,2,3,

Idan Egel

⁴,

Ido Weinstock

⁴ and

Dror Malka

^4,*

¹

Faculty of Education, Bar Ilan University, Ramat-Gan 5290002, Israel

²

Early Childhood Education, Talpiot College of Education, Holon 5810201, Israel

³

Faculty of Education, Beit Berl College, Kfar Saba 4490500, Israel

⁴

Faculty of Engineering, Holon Institute of Technology (HIT), Holon 5810201, Israel

^*

Author to whom correspondence should be addressed.

Inventions 2024, 9(6), 113; https://doi.org/10.3390/inventions9060113

Submission received: 27 September 2024 / Revised: 28 October 2024 / Accepted: 31 October 2024 / Published: 4 November 2024

(This article belongs to the Special Issue Advanced Technologies and Artificial Intelligence for Sustainable and Intelligent Transportation Systems)

Download

Browse Figures

Versions Notes

Abstract

:

In contemporary academic settings, end-of-semester student feedback on a lecturer’s teaching abilities often fails to provide a comprehensive, real-time evaluation of their proficiency, and becomes less relevant with each new cohort of students. To address these limitations, an innovative feedback method has been proposed, utilizing image processing algorithms to dynamically assess the emotional states of students during lectures by analyzing their facial expressions. This real-time approach enables lecturers to promptly adapt and enhance their teaching techniques. Recognizing and engaging with emotionally positive students has been shown to foster better learning outcomes, as their enthusiasm actively stimulates cognitive engagement and information analysis. The purpose of this work is to identify emotions based on facial expressions using a deep learning model based on a convolutional neural network (CNN), where facial recognition is performed using the Viola–Jones algorithm on a group of students in a learning environment. The algorithm encompasses four key steps: image acquisition, preprocessing, emotion detection, and emotion recognition. The technological advancement of this research lies in the proposal to implement photonic hardware and create an optical neural network which offers unparalleled speed and efficiency in data processing. This approach demonstrates significant advancements over traditional electronic systems in handling computational tasks. An experimental validation was conducted in a classroom with 45 students, demonstrating that the level of understanding in the class as predicted was 43–62.94%, and the proposed CNN algorithm (facial expressions detection) achieved an impressive 83% accuracy in understanding students’ emotional states. The correlation between the CNN deep learning model and the students’ feedback was 91.7%. This novel approach opens avenues for the real-time assessment of students’ engagement levels and the effectiveness of the learning environment, providing valuable insights for ongoing improvements in teaching practices.

Keywords:

machine learning; deep learning; CNN; Viola–Jones; face detection; emotion detection

1. Introduction

Lecturers wield a significant impact on student learning and success [1], and there is evident variation in the effectiveness of lecturers in fostering positive educational outcomes [2]. The primary influential factor in achievement is feedback, but the effects vary significantly, highlighting the intricate nature of optimizing the advantages derived from feedback [3]. Feedback mechanisms in higher education are frequently misconstrued, challenging to execute proficiently, and often fall short of their intended goal of substantially influencing student learning [4].

An impassioned feeling frequently serves as a catalyst for achieving educational milestones. A student’s likelihood of accomplishing successful learning is higher when they are filled with zeal compared to when they are disinterested. This is a result of the emotion of ‘enthusiasm’ fostering a readiness that motivates the body and mind to engage in learning and scrutinize information. Consequently, educational approaches should align with the emotional state of the student.

Numerous methods exist for identifying emotions. Among the widely used methods is the gathering of facial characteristics [5]. In physical classrooms, closed-circuit television systems can record these facial features, while in the context of online learning, webcams serve the same purpose. Following the acquisition, additional steps are taken to obtain the analysis results.

The Viola–Jones algorithm [6,7] is one of the frequently utilized algorithms in the context of real-time image processing. The Viola–Jones algorithm stands out as a widely acknowledged and extensively utilized method for facial detection in the realm of image processing [8,9,10,11]. Devised by Paul Viola and Michael Jones, this algorithm demonstrates prowess in swiftly and effectively identifying faces within digital images [12]. Its functionality hinges on a cascade of classifiers trained on a diverse array of affirmative and negative samples. Its efficacy spans a spectrum of applications, encompassing real-time video scrutiny to systems centered on facial recognition, rendering it a fundamental pillar in the domain of computer vision.

In recent years, the advent of online learning platforms has transformed the educational landscape, necessitating innovative approaches to monitor and enhance student engagement. Recognizing students’ facial expressions in real-time during virtual classes offers a promising avenue to assess their emotional and cognitive states, which are crucial indicators of their learning experiences. This research aims to develop and implement a robust system for facial detection in a Zoom-based online learning environment using the Viola–Jones algorithm and analyzing facial expressions of students by using Convolutional Neural Networks.

Another central advancement in this context is convolutional neural networks (CNNs), a class of deep learning algorithms specifically designed for processing structured grid data, such as images [13]. CNNs have revolutionized various fields by automating feature extraction and classification processes, which are crucial for analyzing visual data. The architecture of CNNs is inspired by the biological visual cortex and consists of several key components: convolutional layers, pooling layers, and fully connected layers. Convolutional layers apply filters to input data, detecting essential features like edges, textures, and patterns [14,15,16,17]. These layers are followed by pooling layers, which reduce the dimensionality of the data while preserving critical information, thereby minimizing computational complexity. Finally, fully connected layers integrate these features to perform tasks such as classification or regression.

One of the primary strengths of CNNs lies in their ability to learn hierarchical representations. Lower layers capture basic features, while higher layers identify more complex patterns and objects, enhancing the model’s ability to generalize across various tasks. This hierarchical learning, combined with techniques such as weight sharing and local connectivity, significantly improves CNNs’ efficiency and effectiveness. The robust performance and adaptability of CNNs have made them foundational to numerous advanced applications, including image recognition, object detection, and facial recognition.

In the context of educational technology, CNNs can be employed to analyze students’ facial expressions and body language to infer their engagement and comprehension levels. Such applications are instrumental in providing real-time feedback to educators, enabling them to tailor their teaching strategies to meet the needs of their students better. The educational potential of recognizing students’ facial expressions in online learning platforms like Zoom is substantial, offering notable benefits for both educators and learners. By monitoring facial expressions, educators can gain real-time insights into student engagement levels, allowing for timely interventions that maintain focus and participation. This capability supports personalized learning, enabling instructors to adapt teaching methods to meet individual needs, thereby enhancing learning effectiveness and student satisfaction.

The choice of the Viola–Jones algorithm and CNNs for this research was motivated by their simplicity, computational efficiency, and real-time processing capabilities, which align well with the current capabilities of photonic hardware [18,19,20]. The Viola–Jones algorithm, being less resource-intensive, is well-suited for photonic neural networks utilizing multimode fibers and supports real-time face detection, crucial for immediate results. Starting with simpler algorithms provides a robust foundation for future advancements, enabling the implementation of more complex algorithms like you only look once (YOLO), CenterFace, and CornerNet as photonic technology evolves. Both algorithms have demonstrated proven effectiveness in various applications, particularly in face detection and image recognition.

By leveraging advanced deep learning algorithms and computer vision techniques, this system seeks to provide educators with valuable insights into students’ engagement levels, comprehension, and emotional well-being. The ultimate goal is to create a more responsive and personalized educational experience that adapts to the needs of each student, thereby improving overall learning outcomes in the context of remote education. The Viola–Jones algorithm, employed for real-time facial recognition, holds prominence within the domain of computer vision as a prevalent object recognition technique. Its widespread adoption stems from its efficacy in object identification, particularly in discerning facial features within images and videos. Renowned for its rapid detection capabilities and modest computational demands, the Viola–Jones algorithm proves conducive to real-time operational contexts. Its efficacy in facial recognition has garnered considerable acclaim, prompting its utilization across diverse applications such as facial recognition systems, object tracking, and beyond.

Furthermore, real-time analysis provides immediate feedback on teaching efficacy, prompting adjustments to improve comprehension and retention. Recognizing emotional states also addresses the crucial aspect of student well-being, allowing educators to provide necessary support and referrals. Data collected over time offer valuable insights into engagement trends, informing curriculum development and teaching strategies. This research is particularly relevant for optimizing remote learning, making it more interactive and responsive, and bridging the gap between traditional and digital classrooms. Additionally, in hybrid learning models, the system ensures equal attention and support for remote students, promoting inclusivity. Lastly, insights from facial expression analysis can enhance teacher training, improving their ability to manage classroom dynamics effectively, even virtually. Thus, this research has the potential to significantly enhance the quality and effectiveness of online education.

In this research, an advanced deep learning system is proposed to enhance the real-time assessment of students’ understanding during lessons. This system leverages the Viola–Jones object detection for facial detection combined with the CNN algorithm that analyzes facial expressions to evaluate and monitor students’ comprehension levels continuously. By implementing this innovative approach, the system provides real-time support to educators, enabling them to adapt their teaching strategies dynamically to improve overall classroom understanding. This real-time feedback mechanism facilitates immediate instructional adjustments, thereby fostering a more responsive and effective learning environment. The integration of Viola–Jones techniques with CNN represents a significant advancement in educational technology, offering a robust tool for enhancing pedagogical efficacy and student engagement.

2. Methods and Materials

In this project, the general data protection regulation (GDPR) is strictly adhered to in the handling of participant data. Informed consent is obtained from all participants, ensuring that consent is explicit, freely given, specific, and informed. Data collection is conducted in accordance with the principles of purpose specification and retention, meaning that personal data are collected only for specified, legitimate purposes and are not processed further for incompatible purposes. Data minimization is applied, ensuring that only the essential data required for the study are collected. Participants are afforded data subject rights, which include the ability to view, correct, or delete their data, as well as the “right to be forgotten”. Strict data security measures are in place to prevent unauthorized access or accidental loss of data. Additionally, when sensitive data are involved and large-scale data processing occurs, a data protection officer (DPO) is appointed to oversee compliance. These safeguards ensure that the system complies with GDPR regulations and can be integrated into the European educational system while maintaining participant privacy.

The inquiry into the functional role of emotions spans centuries, yet contemporary advancements in behavioral and neurological research have facilitated a more nuanced understanding of this fundamental question [21]. Emotions play a crucial role in how achievement and goals are pursued and accomplished. Academic emotions constitute a multifaceted spectrum of affective experiences within educational contexts, influenced by a complex interplay of situational determinants and individual psychological factors [22]. These emotions manifest across various academic domains, including lecture attendance, examination scenarios, solitary and collaborative study sessions, and scholarly discourse participation. The genesis of these emotions is twofold, arising from both intrinsic task engagement and extrinsic outcome anticipation, such as performance evaluations. A taxonomic approach to academic emotions typically employs a dual-axis classification system, distinguishing between valence (positive versus negative) and situational context (task-related versus social).

Exemplifying this classification, positive-activating emotions (e.g., enjoyment and hope) stand in contrast to negative-activating emotions (e.g., anxiety and anger), each exerting distinct influences on academic engagement and subsequent outcomes. Empirical investigations have elucidated the heterogeneity of emotional responses among learners, with notable subject-specific variations suggesting the necessity for domain-tailored interventional strategies. Methodologically, the assessment of academic emotions encompasses a diverse array of approaches, including psychometric surveys, qualitative interviews, and physiological measurements. Theoretical frameworks, such as the control-value theory, provide conceptual scaffolding for understanding the antecedents and consequences of emotions in learning contexts, thereby informing the development of interventions aimed at optimizing academic achievement through efficacious emotional regulation strategies [23]. The affective dimensions of students’ educational experiences and academic outcomes are profoundly influenced by the perceived significance of learning activities and the contextual factors in which these occur. Emotions, which can be systematically categorized along dimensions of valence (positive or negative) and their relationship to task-oriented or social–interactive contexts, play a pivotal role in this process.

Emotions also serve an adaptive function by facilitating situational responses and mediating between reflexive and higher-order cognitive processes. Within educational environments, they exert a significant influence on mnemonic processes, cognitive strategy selection, attentional allocation, and motivational orientations.

Furthermore, the emotional states experienced during learning processes have been demonstrated to enhance memory consolidation and exert a considerable impact on academic performance metrics. Positive affective states are associated with enhanced divergent thinking and creative problem-solving capabilities. Conversely, negative emotional states tend to promote analytical processing modes but may concurrently deplete cognitive resources, potentially compromising performance on complex cognitive tasks [24].

The Viola–Jones algorithm and CNN were chosen for this study due to their specific advantages in the context of real-time facial expression recognition using photonic hardware, despite the availability of more advanced algorithms. The Viola–Jones algorithm is renowned for its simplicity and efficiency in face detection, utilizing Haar-like features and a cascade of classifiers. This design results in reduced computational intensity, making it ideal for the initial face detection stage in an optical neural network setup, which emphasizes low-volume 3D connectivity and large bandwidth.

While more advanced algorithms like YOLO, CenterFace, and CornerNet offer superior results, they often require significantly higher computational resources, which can be challenging to implement effectively on photonic hardware currently in the nascent stages of managing complex deep learning models. The straightforward implementation of the Viola–Jones algorithm is better suited to the current capabilities of photonic neural networks, promoting more reliable and efficient performance.

Similarly, CNNs are chosen for their exceptional ability to process and analyze visual data. Inspired by the human visual cortex, CNNs excel in automatically extracting hierarchical features from images, enabling precise classification and recognition of facial expressions. Their deep learning nature ensures that complex patterns and nuances in facial expressions are effectively captured and analyzed.

Photonic hardware offers several significant advantages over traditional electronic hardware, making it a promising alternative for implementing neural networks and other computational tasks. High-speed data processing is one of the primary benefits, as photonic systems can process data at the speed of light, significantly increasing computational speed. This is particularly advantageous for real-time applications requiring fast data processing, such as facial recognition and neural network operations. Furthermore, photonic hardware inherently supports parallel processing, allowing multiple optical signals to be processed simultaneously without interference. This capability enables efficient handling of massive data sets and complex computations in real time.

Another advantage of photonic hardware is its low heat production. Unlike electronic components that generate significant heat during operation, photonic devices produce minimal heat, reducing the need for extensive cooling systems and improving energy efficiency and reliability in high-performance computing environments. Additionally, photonic systems offer significantly larger bandwidth compared to electronic systems, facilitating the transmission and processing of large volumes of data at high speeds. This makes photonic hardware ideal for applications requiring high data throughput.

Photonic systems also experience minimal latency due to the high speed of light transmission. This is crucial for applications where quick response times are essential, such as telecommunications and real-time data analysis. Moreover, photonic hardware can be scaled up to accommodate increasing computational demands. The use of optical fibers and other advanced photonic structures enables the creation of highly scalable systems capable of handling complex and large-scale computations.

By leveraging these advantages, photonic hardware presents a promising alternative to traditional electronic systems, offering improved performance, efficiency, and scalability for a wide range of applications, including neural networks, telecommunications, and high-performance computing. These benefits make photonic hardware an ideal choice for integrating the Viola–Jones algorithm and CNNs for efficient and reliable real-time facial expression recognition. Utilizing simpler, yet highly effective algorithms like Viola–Jones and CNNs ensures that the system can operate efficiently within the current technological constraints while providing a robust foundation for future advancements in photonic computing and more complex algorithm integration. In this study, an advanced computational system was developed and implemented using MATLAB 23.2 software, integrating CNNs with the Viola–Jones algorithm to detect and analyze student emotional states in an academic environment, as illustrated in Figure 1.

The methodological framework encompassed several critical phases: image preprocessing, convolutional layer processing, and emotion classification, with a specific focus on differentiating between negative and positive comprehension levels.

The study was conducted during a 72 min instructional session involving 45 students. To enhance the system’s accuracy and robustness, a substantial dataset comprising 183,624 training images was employed. The image preprocessing phase involved the normalization of input images to ensure consistent scale and format across the dataset. Subsequently, the convolutional layers extracted salient features from these preprocessed images, enabling the CNN to effectively learn and identify patterns associated with various emotional states.

The final classification stage utilized these learned features to categorize the emotions into predefined classes indicative of students’ comprehension levels. To assess the system’s efficacy, a comparative analysis was performed between the emotional responses detected by the system and those self-reported by the students through comprehensive questionnaires. This evaluation aimed to validate the system’s accuracy in emotion detection and its potential application in educational settings for real-time assessment of student engagement and understanding.

By implementing this innovative approach, the system provides real-time support to educators, enabling them to adapt their teaching strategies dynamically to improve overall classroom understanding. This real-time feedback mechanism facilitates immediate instructional adjustments, thereby fostering a more responsive and effective learning environment. The integration of Viola–Jones techniques with CNN represents a significant advancement in educational technology, offering a robust tool for enhancing pedagogical efficacy and student engagement.

The Viola–Jones algorithm is highly efficient for real-time face detection due to several key factors. Firstly, it employs an integral image representation that allows for rapid computation of rectangle features, significantly speeding up the feature extraction process. This technique reduces computational complexity, enabling quick image processing and face detection. Secondly, the AdaBoost [25] learning algorithm is used to select a small number of critical features from a larger set, enhancing detection accuracy while maintaining computational efficiency. By combining weak classifiers into a strong classifier, AdaBoost ensures both speed and accuracy. Thirdly, the algorithm’s cascaded classifier architecture quickly eliminates non-face regions, focusing computational resources on promising face-like regions. Each stage of the cascade is designed to reject a significant percentage of non-face sub-windows while retaining most face sub-windows, ensuring efficient handling of large images and video frames.

In the context of machine and deep learning, the term “weight” refers to a numerical value assigned to each training example that reflects its importance or influence on the learning process during each iteration. Each training example is assigned an equal weight, which means each example is considered equally important. Weights are updated based on whether the examples are correctly or incorrectly classified by the current weak classifier. The goal is to give more focus (higher weight) to the misclassified examples so that subsequent classifiers pay more attention to these harder examples.

This process helps the algorithm to improve its performance on difficult examples. The weights of misclassified examples are increased. The weights of correctly classified examples are decreased. To ensure that the weight remains a valid probability distribution it typically used to normalize the updated weights so they sum up to 1.

The importance of weight is combining weak classifiers. Each weak classifier’s contribution to the final strong classifier is weighted by its accuracy, which is determined by the weights of the examples it correctly classifies. Another importance of weight is focusing on difficult examples by increasing the weights of misclassified examples, AdaBoost forces subsequent classifiers to concentrate more on the difficult examples, thereby improving overall accuracy.

The AdaBoost algorithm iteratively combines weak classifiers to create a strong classifier. Each weak classifier focuses on different aspects of the data, adjusting its importance based on the errors of its predecessors. By sequentially adjusting the weights of misclassified instances, AdaBoost effectively emphasizes challenging examples, refining its predictions with each iteration.

This adaptive boosting technique has demonstrated robustness across various domains, achieving notable success in tasks such as face detection and object recognition. AdaBoost’s ability to improve classification accuracy through its iterative learning process makes it a fundamental tool in modern machine learning applications.

Weak classifiers have limited predictive power, performing slightly better than random guessing. In contrast, strong classifiers are highly accurate, reliably predicting class labels. Strong classifiers are typically composed of multiple weak classifiers, combined using ensemble methods such as AdaBoost, which improves overall prediction performance significantly.

The functionality of the Viola–Jones algorithm relies on the AdaBoost algorithm to create a cascaded series of rectangular feature classifiers [26]. Central to its operation is the utilization of an image integral as illustrated in Figure 2, whereupon each adjustment of the window prompts a recalibration of this integral representation.

The integral image is derived by accumulating the pixel values along the upper and left dimensions at each coordinate (x, y). Mathematical formulation for the integral image is elucidated by Equation (1):

w_{m} = (\frac{1}{2}) (\ln (1 - e_{m}) - \ln e_{m})

(1)

Here,

e_{m}

represents the normalized weight determined by Equation (2):

e_{m} = \frac{w_{t}}{W}

(2)

where W denotes the maximum weight. In the context of AdaBoost, a “miss” refers to an instance where the weak classifier incorrectly predicts the label of a training sample. For example, if the true label is positive (+1) but the classifier predicts it as negative (−1), it is considered a miss. AdaBoost updates the weights of these misclassified samples by increasing them, making these samples more significant in the next iteration. This adjustment directs the algorithm’s focus towards harder-to-classify examples, thus improving the overall performance of the classifier through iterative refinement. To delineate the facial region, the Viola–Jones algorithm employs AdaBoost classifiers arranged serially, comprising m-filters iteratively applied in sequence. Within each step, the weak classifier

k_{m}

is disregarded, while the classifier weights

w_{m}

are retained, calculated as per Equation (3). Subsequently, the threshold

w_{t}

is computed using Equation (3):

w_{t}^{m + 1} = \{\begin{matrix} w_{t}^{m} e^{a_{m}}, k_{m} (x_{i}) i s a m i s s \\ w_{t}^{m} e^{- a_{m}}, o t h e r w i s e \end{matrix}

(3)

The updated threshold

w_{t}

in iteration m + 1 is compared to the weight of the sub-windows

w_{e}

, if the weight of the sub-window is lower than the threshold, the sub-window is neglected and not classified as a face, on the other hand if the sub-window is greater than the threshold, the sub-window is classified as a face.

Using the Viola–Jones algorithm for face detection includes calculating the integral image and classification as depicted in Figure 2. The total weight can be defined as follows in Equation (4):

w_{t o t} = \sum w_{e}^{m}

(4)

In the following research, the CNN model is used for facial expression detection. When dealing with N × N images, convolution takes place using an f × f filter. Through this process, the algorithm learns to recognize and emphasize important features.

Figure 3a,b demonstrates the feasibility of multimode fibers for optical neural networks. Figure 3a illustrates a neural network with four layers, showcasing how optical fibers can be leveraged to achieve efficient data processing with minimal heat generation compared to electronic implementations. Figure 3b presents a conceptual design for a neural network embedded within multimode fibers. This design also includes four layers, highlighting the scalability and potential complexity achievable with this approach. In this network, neurons and synapses are represented by individual silica cores in a multi-core fiber. Optical signals transfer transversely between these cores through optical coupling, while pump-driven amplification in erbium-doped cores mimics synaptic interactions.

This unique photonic CNN architecture showcases the potential for integrating optical technologies into CNN designs. The structure utilizes optical fiber components such as combiners [27,28], splitters [29,30], and erbium-doped fiber amplifiers, as demonstrated in previous research [31], to process information using light instead of traditional electronic signals. This innovative approach significantly enhances both speed and energy efficiency, making it possible for the CNN to handle complex data patterns with improved performance.

While our study primarily addresses digital computation, this architecture holds significant potential for future real-time neural detection applications, such as emotion recognition [32]. By leveraging optical-based components, the system offers a scalable and energy-efficient solution for high-performance neural processing. This approach promises faster and more efficient data handling, making it particularly suited for advanced neural detection tasks in fields like cognitive science and brain–computer interfaces.

Max-pooling layer is used to reduce the spatial dimensions of input feature maps, thereby decreasing computational complexity and promoting translational invariance. It operates by dividing the input into regions and selecting the maximum value from each region. For instance, a 3 × 3 max-pooling filter with a certain stride reduces nxn input to a 3 × 3 output, capturing the most significant features.

Evaluation: in this phase, the test set is classified using the previously constructed model, and the accuracy percentage of the model is evaluated.

Prediction: images from the test set are selected to assess the accuracy achieved by the model. During this evaluation, the model’s detected emotion and the true emotion of each image are compared. The results show the probability of correctly predicting the emotion using the model, as illustrated in simulation results.

Additionally, there is a max-pooling and dropout layer between each part to mitigate overfitting. Finally, a dense layer utilizing the Softmax function [33,34,35] is employed to classify the images into their respective classes (emotions).

Figure 4 shows the block diagram of an emotion recognition system and the levels of understanding on a facial expression screen. This figure outlines the various stages involved in the system, from the initial input to the final output, illustrating how facial expressions are analyzed and interpreted to recognize emotions.

Dataset: the selection of an appropriate dataset is paramount in facilitating the optimal training of a model for the identification of emotions predicated on facial expressions. The dataset under consideration encompasses 229,528 images (training and validation only) meticulously categorized into seven distinct emotional states. The delineated emotions within the dataset encompass the spectrum of human affect, specifically including Anger, Disgust, Fear, Happiness, Neutrality, Sadness, and Surprise. The dataset is based on 45 students’ facial expressions during lectures.

Data loading and preprocessing: the selected dataset is loaded, and the image dimensions are standardized to 48 × 48 × 1 to expedite processing. The dataset is split into training and validation sets, with 80% allocated for training and 20% for validation, resulting in 183,624 training images and 45,904 validation images. Following training and validation, a separate test set comprising 56,528 images is prepared for final evaluation. The combined dataset, including training, validation, and test sets, consists of 286,056 images. The training set accounts for 64.19%, the validation set represents 16.05%, and the test set constitutes 19.76% of the total dataset. These percentages clarify the overall distribution across the dataset segments, ensuring a balanced approach to model training, validation, and evaluation. Display Dataset Samples: at this stage, samples are extracted from the training set to illustrate each emotion alongside its corresponding label. The segmentation of the dataset is presented in Table 1:

To ensure that the emotions detected on students’ faces were directly related to the lecture content and not influenced by external factors, several steps were implemented in the experimental design. A lecturer delivering their first session to the students was selected, ensuring no prior familiarity that could affect emotional responses. Additionally, students were screened for any pre-existing emotional burdens or external factors that might skew results, and those identified as feeling overwhelmed or carrying emotional baggage were excluded from the sample. After this screening process, 45 participants remained, allowing for accurate system calibration. This approach aimed to isolate detected emotions solely in relation to the educational content, minimizing potential noise from external influences. Emotional feedback sessions were also conducted to confirm that the observed emotions were indeed linked to the learning material. Additionally, the data were labeled based on three key elements: the overall feeling reported by the students, the grade they received on specific questions corresponding to the 5 s sample interval, and the emotion recognized by the system. This multi-layered labeling approach allowed us to create a more detailed correlation between emotional states, comprehension, and performance during the lecture [36].

In large datasets, individual variations [37,38,39] or spontaneous data [40,41]—such as emotions influenced by personal issues—become statistically diluted. Big data analysis emphasizes overall trends rather than isolated instances, effectively reducing the impact of random fluctuations or external emotional factors on final results. In analyzing emotional responses within educational settings, the patterns observed typically reflect general engagement levels, as the volume of data helps balance individual deviations. By examining multiple students across various sessions, the model can reliably identify patterns associated with comprehension, even if some students experience personal influences. Furthermore, correlating facial expressions with post-lecture comprehension assessments validates the accuracy of the emotional data collected. Discrepancies between a student’s emotional responses and their comprehension level may indicate moments of mind wandering or temporary disengagement. Overall, leveraging big data enables a reliable and generalizable measure of student engagement, consistent despite occasional individual deviations.

Data pre-processing (Augmentation): in this phase, pre-processing is performed.

Augmentation: valuable techniques for enhancing the performance and outcomes of deep learning models. These methods provide diverse variations of the training data, thereby making the model more robust against overfitting. By applying various transformations to the training images, the model learns to prioritize essential features necessary for accurate classification.

CNN model: the CNN model using Viola–Jones algorithm identifies facial expressions by leveraging cascading classifiers for efficient face detection and convolutional neural networks for accurate emotion recognition.

Evaluation: in this model, the CNN’s performance is evaluated using four key parameters: accuracy, precision, recall, and F1-score. Those scores are based on four parameters: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). The first indication represents if the model’s prediction value correlated with the label (false or true), and the second indicator represents the model’s prediction value (positive or negative). Accuracy captures the proportion of correctly predicted instances, including both true positives and true negatives, out of the total instances as represented in Equation (5):

a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(5)

Precision measures the accuracy of the positive predictions as represented in Equation (6):

p r e c i s i o n = \frac{T P}{T P + F P}

(6)

Recall evaluates the model’s ability to identify all relevant instances as represented in Equation (7):

r e c a l l = \frac{T P + T N}{T P + F N}

(7)

F1-score balances precision and recall into a single metric, providing a comprehensive assessment of the model’s overall effectiveness in detecting facial expressions as represented in Equation (8):

F 1 - s c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

Prediction: the CNN predicts facial expressions by detecting facial features, analyzing their arrangement, and recognizing patterns associated with different emotions. It utilizes cascading classifiers for efficient face detection and subsequent analysis to accurately identify and classify expressions based on these features.

Load class recording and pre-processing: in this phase, a 72 min class recording is loaded, followed by the pre-processing of the recording. Samples are taken at 5 s intervals, focusing on a particular student. The images are resized to 48 × 48 × 1, suitable for the model. This process results in a collection of images of the student captured throughout the lesson.

Face recognition using the Viola–Jones algorithm: immediately after pre-processing, the Viola–Jones algorithm is employed to recognize the student’s face in each sampled image. The identified faces are saved, creating a dataset of the student’s images taken at 5 s intervals throughout the lesson.

Classification and conversion to level of understanding: it is well-established that positive emotions, such as engagement and interest, are indicators of active participation in the classroom. To further explore this, an experiment was conducted where students were asked to raise their hands whenever they did not understand the material, while simultaneously scanning their emotional responses in real time. After the lecture, a comprehension assessment was conducted during a post-lecture feedback session, where students answered questions to gauge their understanding of the content. By correlating the emotional data gathered during the lecture with the comprehension levels obtained from the post-lecture session, a more accurate model was developed to link emotional states with real-time understanding. To validate the mapping presented in Table 1, a multi-layered approach was employed. Real-time emotional responses were continuously monitored throughout the lecture, and at the end of each session, comprehension assessments were used to verify the correlation between emotional states and comprehension levels. These findings were further reinforced by cross-referencing them with traditional statistical methods and comparing the results to existing statistical approaches, ensuring the robustness and accuracy of the model. By integrating emotional responses and comprehension metrics, this study provides a novel contribution to the field of emotion recognition, particularly in educational contexts, offering valuable insights into how emotional states can reliably reflect real-time understanding [42,43,44].

In this phase, the students’ photos, sampled from the lesson recording, are classified into one of the seven emotions recognized by the model. After identifying the emotion using the model, the detected emotion is then mapped to the students’ level of understanding at that specific point in time during the lesson, as shown in Table 2:

In the analysis of student engagement and understanding during a class session, several crucial steps are undertaken to ensure accurate and meaningful insights. This involves the classification of emotions, calculation of comprehension rates, and subsequent analysis. Here is a detailed breakdown of these steps:

Calculation of the Average Comprehension Rate: this phase involves averaging the level of comprehension among all sampled students throughout the lesson. Initially, the comprehension level of each student is calculated separately. These individual comprehension levels were then averaged to determine the overall class comprehension level during the lesson. In this project, tests and analyses were conducted on five different students.

Plotting Results: at this stage, the results are presented, providing a visual or tabular representation of the data collected and analyzed.

Results Analysis: finally, a thorough analysis of the results is performed. This involves interpreting the data to draw meaningful conclusions regarding the students’ comprehension and engagement levels throughout the lesson. The detailed findings are presented subsequently.

Figure 5a,b illustrates the CNN model integrated with the Viola–Jones algorithm. Figure 5a describes the size of the feature map of each convolutional layer, which is highlighted by the frame colors in Figure 5b. As shown in Figure 5b, the CNN model utilizes an image size of 48 × 48 ×1 and employs the ReLU activation function for two-dimensional convolutions. After each group of layers, a max-pooling layer is used to select the maximum value from the preceding layer, thereby preserving prominent image features. Additionally, the large filter in each convolution is chosen to be 3 × 3 as recommended in several articles. The model consists of several distinct components. The first part is indicated in blue and consists of 3 two-dimensional convolutional layers with a size of 64 × 64 and a 3 × 3 filter. The second part is marked in purple and includes 3 two-dimensional convolutional layers with a size of 128 × 128 and a 3 × 3 filter. The third part is highlighted in red and contains 4 two-dimensional convolutional layers with a size of 26 × 256 and a 3 × 3 filter. The fourth part is in light blue and consists of 4 two-dimensional convolutional layers with a size of 512 × 512 and a 3 × 3 filter. The fifth part is in green and comprises 4 two-dimensional convolutional layers with a size of 1024 × 1024 and a 3 × 3 filter.

3. Results

The learning model’s evaluation consists of three main parts: training, validation, and testing. The training set, which constitutes 80% of the dataset (test and validation), enables the model to learn how to identify seven emotions. The training set achieved a success rate of 92%.

Following training, the validation set, comprising the remaining 20% (test and validation) of the dataset, evaluates the model’s learning progress. Validation was conducted every 50 training cycles and achieved a success rate of 65%, comparatively.

After several iterations, the accuracy levels stabilize. The training set stabilizes at 92% accuracy, while the validation set stabilizes at 65% accuracy. This pattern suggests that the model is learning effectively from the training data but struggles to generalize to new, unseen data, indicating a potential overfitting issue.

The discrepancy between the 65% validation accuracy and 83% testing accuracy can be attributed to factors such as hyperparameter adjustments, dataset characteristics, and model initialization.

For the test set, the CNN model achieved 83% accuracy. Figure 6 demonstrates that the model’s success and accuracy improve significantly with a higher frequency of iterations. In the initial stages of running the model, there is a noticeable sharp increase in the accuracy levels for both the training and validation sets.

To determine the optimal number of convolutional layers that achieved the highest accuracy in our CNN model, an accuracy optimization study was conducted. This optimization is illustrated in Figure 7. It can be observed from this figure that the highest accuracy of 83% is achieved with 17 convolutional layers, with the model requiring 60:34 h for training.

In Table 3, the confusion metrics are presented, detailing the breakdown of true positives, false positives, true negatives, and false negatives for each emotion, representing the test segment.

The confusion metrics in Table 3 indicate that the model performs well in detecting the Neutral emotion, as shown by the high number of true positives, reflecting strong accuracy in this category. For emotions like Happy, Surprise, and Angry, the model demonstrates a balanced performance between true positives and true negatives, indicating reliable classification for these emotions. However, improvement is needed in detecting Sad and Disgust, where higher false negative counts suggest difficulties in correctly identifying these emotions.

Table 4 shows the model evaluation segmented by emotions, including accuracy, precision, recall, and F1-score for each emotion in the test segment.

The table demonstrates strong performance across all metrics, with accuracy ranging from 0.7618 for Happy to 0.83 for several emotions, and recall spanning from 0.842 for Happy to 0.9167 for Fear. Precision values are consistently high, ranging from 0.874 to 0.877, and F1-scores show a robust range from 0.858 to 0.896. Although Happy and Surprise show slightly lower values in comparison to other emotions, the overall performance remains strong. Even in the lowest cases, such as the accuracy of 0.7618 for Happy, the model maintains a high level of reliability. These results indicate that the model is highly effective in detecting and classifying emotions, with consistent precision and recall across different emotional states.

The objective of this research is to employ facial recognition technology to analyze and process emotions, categorizing these emotions based on the students’ levels of understanding at specific time intervals.

The study was conducted over a 72 min lesson, during which the facial expressions of 45 students were recorded. Samples were taken every five seconds for each student. By applying the CNN model, both the emotions and comprehension levels of the students were detected. For example, the results for two students demonstrate real-time detection of comprehension during the lecture.

The computational costs of training the CNN model were a key consideration in the development of this system. The training process, conducted on an HP Z4 Rack G5 Workstation with an Intel Xeon W-2245 processor and NVIDIA Quadro P5000 GPU, initially required approximately 85 h to complete for 60 epochs. To optimize this process, an additional NVIDIA Tesla V100 GPU was introduced, which reduced the total training time to around 60 h. The model processes approximately 200 images per second, and compute unified device architecture (CUDA) was utilized to distribute the workload efficiently across both GPUs. This optimized setup ensures the model can handle real-time emotion detection and comprehension analysis in a classroom setting while maintaining computational efficiency.

Figure 8a,b illustrates the comprehension level over time, including the average comprehension levels of student A and student B as the model detects. In this Figure, the blue line depicts the changes in comprehension level, while the red line represents the students’ average understanding of the material taught during the lesson.

Figure 9a,b shows a histogram summarizing the frequency of specific emotions observed throughout the lesson of student A and student B, categorized according to the levels of understanding.

Figure 10a,b illustrates an example of two student images samples with corresponding face detection to 48 × 48 image and gray–scaled box using the Viola–Jones algorithm.

To evaluate the effectiveness of the CNN–based deep learning system, its results were compared with student feedback collected at the end of a 72–min lesson. The feedback asked students to rate their level of understanding. Figure 11 shows that the average results from the CNN deep learning system (blue line) align closely with the student feedback results (red line), with an average accuracy of 91.7%. Notably, the error margin between individual student feedback (red circles) and the system’s predictions (blue circles) varies from 0% to 16%.

4. Discussion

The results of this study offer valuable insights into the application of CNNs for real-time emotion detection and comprehension analysis in a classroom setting. The high training accuracy (92%) indicates that the model effectively learned to identify the seven emotions within the training set, suggesting a strong capacity to recognize patterns in the data.

However, the significantly lower validation accuracy (65%) highlights a challenge in generalizing this learning to unseen data. This discrepancy points to potential overfitting, where the model becomes too specialized for the training data and struggles when confronted with new examples. Overfitting is a common issue in deep learning and suggests the need for further refinement, such as the introduction of regularization techniques, data augmentation, or adjustments in model complexity to enhance generalization performance. The model’s accuracy on the test set (83%) indicates that, despite overfitting concerns, it still performs well on unseen data, achieving a satisfactory level of generalization. This test accuracy, while not as high as the training set, demonstrates that the CNN model can be reliably used to detect emotions and gauge comprehension levels during live lectures. However, it is important to address the discrepancy between training and validation accuracies to ensure the model’s robustness in diverse learning environments. Further exploration of techniques like cross-validation or early stopping could be useful in refining the model to prevent overfitting.

This study also reveals that the optimal configuration for this CNN model includes 17 convolutional layers, achieving the highest accuracy after several iterations. This indicates that the depth of the network plays a significant role in capturing the complexity of emotional expressions and comprehension levels. However, the trade-off between accuracy and computational time (60:34 h for training) raises important considerations for practical, real-time applications. Reducing the model’s complexity without sacrificing accuracy could be explored by investigating more efficient architectures, such as lightweight CNNs or transfer learning models that are built on pre-trained networks. The model’s ability to capture emotional and comprehension dynamics over a 72 min lesson further demonstrates its practical applicability.

The real-time detection of comprehension levels in Figure 8 and Figure 9 illustrates the potential of CNN models to provide teachers with immediate feedback on students’ understanding. This could enhance pedagogical strategies by allowing educators to adjust the pace or content delivery based on real-time insights into students’ cognitive states. The emotional state data gathered in this study present an opportunity for the model to be integrated into personalized learning systems, potentially improving student outcomes by catering to individual emotional and cognitive needs. However, while the model performs well in aligning its predictions with student feedback (91.7% accuracy), the observed error margin (0% to 16%) between individual predictions and feedback suggests that the model’s reliability varies between students. This variation could stem from individual differences in facial expressions, emotional responses, or external factors influencing attention and comprehension. It may also be attributed to limitations in the dataset, which might not capture the full spectrum of student behaviors across diverse learning environments. Further work could involve collecting more diverse data and implementing personalized models that account for these differences to improve predictive accuracy. Another area for improvement lies in the integration of additional signals, such as speech patterns, pitch, blinking rates, and reading speed, which could provide a more holistic understanding of students’ cognitive engagement. While this study focused on facial expressions, combining multimodal data from both visual and auditory sources would likely enhance the robustness of the model. Recent studies have demonstrated the effectiveness of using multimodal approaches for emotion and comprehension detection, and incorporating these elements could further refine the system’s performance.

5. Conclusions

In conclusion, this research addresses the need for an objective and real-time evaluation of student engagement and comprehension in online learning environments. By combining the Viola–Jones algorithm for facial detection and Convolutional Neural Networks (CNNs) for facial expression analysis, along with the innovative use of photonic hardware, an advanced deep learning system has been developed that provides valuable insights into students’ emotional and cognitive states during virtual classes.

The implementation of the Viola–Jones algorithm ensures rapid and efficient detection of facial features in real-time, providing a necessary precursor for expression analysis. This foundational step allows the system to accurately identify and focus on student faces, supplying robust input for subsequent CNN-based analysis. CNNs, inspired by the human visual cortex, excel in automatically extracting and classifying features from visual data, making them particularly well-suited for analyzing subtle variations in facial expressions.

A distinguishing feature of this study is the proposed use of photonic hardware, which offers unparalleled speed and efficiency in data processing. This technological innovation enables the rapid analysis of large volumes of data, enhancing the system’s real-time capabilities and reducing latency. The integration of photonic technology is expected to significantly improve the performance of deep learning models, allowing for more detailed and accurate assessments of student engagement.

The system’s ability to monitor facial expressions in real-time offers significant advantages for educators. It facilitates immediate feedback on student engagement levels, allowing instructors to make timely interventions to maintain focus and participation. This dynamic and responsive teaching approach enhances overall learning effectiveness. Continuous monitoring provides valuable data for long-term educational planning, helping educators identify effective topics and teaching methods. Additionally, recognizing emotional states addresses student well-being, enabling educators to provide support and referrals when necessary.

In remote and hybrid learning models, this system ensures all students, whether physically present or attending virtually, receive equal attention and support. This inclusivity promotes a more equitable learning environment. Insights from facial expression analysis can also improve teacher training by equipping educators to manage classroom dynamics effectively, even in a virtual setting.

In summary, this research enhances the quality and effectiveness of online education by providing real-time, objective assessments of student engagement and comprehension. Future studies can build on this foundation by emulating advanced deep learning models, further enhancing the system’s capabilities and applications.

6. Patents

In Section 6 of this research, the potential for patenting an innovative system that combines the Viola–Jones algorithm with CNNs for real-time emotion detection in online education is explored.

While this study builds upon established algorithms, it contributes novel insights by extending these techniques to link emotional responses with comprehension levels in real-time educational settings. Specifically, the development of the table presented in this research is grounded in empirical measurements of emotional states, such as engagement or confusion, which were directly correlated with the students’ level of understanding throughout the lectures. This approach enhances the traditional application of emotional recognition algorithms by demonstrating how real-time emotional feedback can provide meaningful indicators of student comprehension, offering educators a deeper understanding of their students’ learning processes.

The unique approach leverages the rapid facial detection capabilities of Viola–Jones and the robust expression analysis of CNNs, forming the basis for a novel system that could be patented for its ability to assess student engagement dynamically. While the current system demonstrates promising results using traditional electronics for CNN processing, the next logical step would be incorporating MMFs to implement a photonic neural network. Photonic systems, especially those using MMFs, enable parallel data processing with light, significantly increasing the speed and efficiency of computation compared to conventional electronic systems. This innovation could serve as a basis for a patent, focusing on the potential of integrating photonic hardware with neural network architectures.

In this context, a patent could cover the unique combination of Viola–Jones, CNN, and photonic processing technologies. The proposed system would not only detect and analyze facial expressions but also do so with unprecedented speed due to the incorporation of photonic elements. This capability is crucial for real-time applications in online education, where the ability to monitor and respond to student emotions can instantly lead to better engagement and learning outcomes. Moreover, the patent could extend to cover a broader range of neural network applications beyond emotion detection, particularly focusing on the integration of MMFs to build a highly efficient photonic neural network. This technology could be adapted for various fields requiring real-time data processing, such as medical diagnostics, behavioral studies, and cognitive monitoring, offering a multi-functional system that leverages the speed of light for neural computations. In summary, the research presents a novel combination of facial detection and neural networks that could lead to multiple patentable innovations. These include the integration of CNN with photonic technology, the combination of Viola–Jones with CNN for real-time applications, and the potential use of MMFs for future photonic neural networks, all of which enhance the system’s real-time processing abilities, making it a highly valuable tool for educational and other applications.

Author Contributions

D.M. conceived the project and provided overall supervision. N.A. led the experimental work, organized the CNN data, and developed the core idea with support and guidance from D.M., I.W. and I.E. played a significant role in drafting sections of the manuscript, preparing the tables, and contributed to the simulation work, with assistance from D.M. and N.A.; N.A., I.W. and I.E. collaboratively prepared the figures. The response letter was drafted by D.M., with further input from I.W., I.E. and N.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The dataset used in this study contains private information, and sharing it would contravene privacy laws. Therefore, data supporting the findings of this research are not publicly available due to these legal restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hattie, J. Visible Learning: A Synthesis of over 800 Meta-Analyses Relating to Achievement; Rutledge: Oxford, UK, 2009. [Google Scholar]
Atteberry, A.; Loeb, S.; Wyckoff, J. Do first impressions matter? Predicting early career teacher effectiveness. AERA Open 2015, 1, 1–23. [Google Scholar] [CrossRef]
Hattie, J.; Timperley, H. The power of feedback. Rev. Educ. Res. 2007, 77, 81–112. [Google Scholar] [CrossRef]
Boud, D.; Molloy, E. Rethinking models of feedback for learning: The challenge of design. Assess. Eval. High. Educ. 2013, 38, 698–712. [Google Scholar] [CrossRef]
Fredrickson, B.L. The role of positive emotions in positive psychology: The broaden-and-build theory of positive emotions. Am. Psychol. 2001, 56, 218–226. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and PATTERN Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001. [Google Scholar]
Viola, P.; Jones, M.J. Robust real-time face detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Cen, K. Study of Viola-Jones real time face detector. AI2 (Allen Inst. Artif. Intell.) 2016, 9, 1–94. [Google Scholar]
Liao, S.; Zhu, X.; Lei, Z.; Zhang, L.; Li, S.Z. Learning Multi-Scale Block Local Binary Patterns for Face Recognition. In Proceedings of the Advances in Biometrics: International Conference, ICB 2007, Seoul, Republic of Korea, 27–29 August 2007; pp. 828–837. [Google Scholar]
Ahmad, F.; Najam, A.; Ahmed, Z. Image-based face detection and recognition: “state of the art”. arXiv 2013, arXiv:1302.6379. [Google Scholar]
Kaur, J.; Sharma, A.; Cse, A. Performance analysis of face detection by using Viola-Jones algorithm. Int. J. Comput. Intell. Res. 2017, 13, 707–717. [Google Scholar]
El Maghraby, A.; Abdalla, M.; Enany, O.; Nahas, M.Y.E. Detect and analyze face parts information using Viola-Jones and geometric approaches. Int. J. Comput. Appl. 2014, 101, 23–28. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Deng, L.; Yu, D. Deep learning: Methods and applications. Found. Trends® Signal Process. 2014, 7, 197–387. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; Van Der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [Google Scholar] [CrossRef] [PubMed]
Cohen, E.; Malka, D.; Shemer, A.; Shahmoon, A.; Zalevsky, Z.; London, M. Neural networks within multi-core optic fibers. Sci. Rep. 2016, 6, 29080. [Google Scholar] [CrossRef]
Fischer, B.; Chemnitz, M.; Zhu, Y.; Perron, N.; Roztocki, P.; MacLellan, B.; Di Lauro, L.; Aadhi, A.; Rimoldi, C.; Morandotti, R.; et al. Neuromorphic Computing via Fission-based Broadband Frequency Generation. Adv. Sci. 2023, 10, 2303835. [Google Scholar] [CrossRef]
Wang, T.; Sohoni, M.M.; Wright, L.G.; Stein, M.M.; Ma, S.Y.; Onodera, T.; Anderson, M.G.; McMahon, P.L. Image sensing with multilayer nonlinear optical neural networks. Nat. Photonics 2023, 17, 408–415. [Google Scholar] [CrossRef]
Lench, H.C.; Carpenter, Z.K. What Do Emotions Do for Us? The Function of Emotions: When and Why Emotions Help Us; Springer: Berlin/Heidelberg, Germany, 2018; pp. 1–7. [Google Scholar]
Lüftenegger, M.; Klug, J.; Harrer, K.; Langer, M.; Spiel, C.; Schober, B. Students’ achievement goals, learning-related emotions and academic achievement. Front. Psychol. 2016, 7, 603. [Google Scholar] [CrossRef]
Goetz, T.; Zirngibl, A.; Pekrun, R.; Hall, N. Emotions, Learning and Achievement from an Educational-Psychological Perspective; Universität Konstanz: Konstanz, Germany, 2003. [Google Scholar]
Pekrun, R. The impact of emotions on learning and achievement: Towards a theory of cognitive/motivational mediators. Appl. Psychol. 1992, 41, 359–376. [Google Scholar] [CrossRef]
Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Kirana, K.C.; Wibawanto, S.; Herwanto, H.W. Facial Emotion Recognition Based on Viola-Jones Algorithm in the Learning Environment. In Proceedings of the 2018 International Seminar on Application for Technology of Information and Communication, Semarang, Indonesia, 21–22 September 2018; pp. 406–410. [Google Scholar]
Malka, D.; Cohen, E.; Zalevsky, Z. Design of 4 × 1 power beam combiner based on multiCore photonic crystal fiber. Appl. Sci. 2017, 7, 695. [Google Scholar] [CrossRef]
Malka, D.; Sintov, Y.; Zalevsky, Z. Fiber-laser monolithic coherent beam combiner based on multicore photonic crystal fiber. Opt. Eng. 2015, 54, 011007. [Google Scholar] [CrossRef]
Malka, D.; Peled, A. Power splitting of 1× 16 in multicore photonic crystal fibers. Appl. Surf. Sci. 2017, 417, 34–39. [Google Scholar] [CrossRef]
Lanziano, L.; Sherf, I.; Malka, D. A 1 × 8 Optical Splitter Based on Polycarbonate Multicore Polymer Optical Fibers. Sensors 2024, 24, 5063. [Google Scholar] [CrossRef]
Shabairou, N.; Cohen, E.; Wagner, O.; Malka, D.; Zalevsky, Z. Color image identification and reconstruction using artificial neural networks on multimode fiber images: Towards an all-optical design. Opt. Lett. 2018, 43, 5603–5606. [Google Scholar] [CrossRef] [PubMed]
Blazar, D.; Kraft, M.A. Teacher and teaching effects on students’ attitudes and behaviors. Educ. Eval. Policy Anal. 2017, 39, 146–170. [Google Scholar] [CrossRef]
Tang, H.; Ortis, A.; Battiato, S. The Impact of Padding on Image Classification by Using Pre-Trained Convolutional Neural Networks. In Proceedings of the Image Analysis and Processing–ICIAP 2019: 20th International Conference, Trento, Italy, 9–13 September 2019; Proceedings, Part II 20. Springer International Publishing: Berlin/Heidelberg, Germany, 2019; pp. 337–344. [Google Scholar]
Bridle, J.S. Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In Neurocomputing: Algorithms, Architectures and Applications; Springer: Berlin/Heidelberg, Germany, 1990; pp. 227–236. [Google Scholar]
Bishop, C.M.; Nasrabadi, N.M. Pattern Recognition and Machine Learning; Springer: New York, NY, USA, 2006; Volume 4, p. 738. [Google Scholar]
Kane, T.J.; Staiger, D.O. Estimating Teacher Impacts on Student Achievement: An Experimental Evaluation; NBER Working Paper No. 14607; National Bureau of Economic Research: Massachusetts, CA, USA, 2008. [Google Scholar]
Elouataoui, W.; El Mendili, S.; Gahi, Y. An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis. Data 2023, 8, 182. [Google Scholar] [CrossRef]
Tipaldi, M.; Feruglio, L.; Denis, P.; D’Angelo, G. On applying AI-driven flight data analysis for operational spacecraft model-based diagnostics. Annu. Rev. Control. 2020, 49, 197–211. [Google Scholar] [CrossRef]
Arnarsson, I.Ö.; Malmqvist, J.; Gustavsson, E.; Jirstrand, M. Towards Big-Data Analysis of Deviation and Error Reports in Product Development Projects. In Proceedings of the DS 85-2: Proceedings of NordDesign 2016, Trondheim, Norway, 10–12 August 2016; pp. 083–092. [Google Scholar]
Ulrich, L.; Marcolin, F.; Vezzetti, E.; Nonis, F.; Mograbi, D.C.; Scurati, G.W.; Dozio, N.; Ferrise, F. CalD3r and MenD3s: Spontaneous 3D facial expression databases. J. Vis. Commun. Image Represent. 2024, 98, 104033. [Google Scholar] [CrossRef]
Marcolin, F.; Scurati, G.W.; Ulrich, L.; Nonis, F.; Vezzetti, E.; Dozio, N.; Ferrise, F. Affective virtual reality: How to design artificial experiences impacting human emotions. IEEE Comput. Graph. Appl. 2021, 41, 171–178. [Google Scholar] [CrossRef]
Fredrickson, B.L.; Branigan, C. Positive emotions broaden the scope of attention and thought-action repertoires. Cogn. Emot. 2005, 19, 313–332. [Google Scholar] [CrossRef]
Pekrun, R. The control-value theory of achievement emotions: Assumptions, corollaries, and implications for educational research and practice. Educ. Psychol. Rev. 2006, 18, 315–341. [Google Scholar] [CrossRef]
D’Mello, S.; Graesser, A. Dynamics of affective states during complex learning. Learn. Instr. 2012, 22, 145–157. [Google Scholar] [CrossRef]

Figure 1. Illustration of the emotion recognition model in virtual environment.

Figure 2. Example of input and integral image metrics.

Figure 3. Schematic sketch of a four-layer neural network illustration: (a) neural network; (b) optical neural network.

Figure 4. Block diagram of an emotion recognition system and level of understanding on a facial expression screen.

Figure 5. CNN model block diagram: (a) feature map size; (b) block diagram of the CNN model.

Figure 6. CNN model accuracy as a function of iterations.

Figure 7. Model accuracy as a function of the number of convolutional layers.

Figure 8. Illustration of comprehension level over time: (a) student A; (b) student B.

Figure 9. Frequency of specific emotions observed throughout the lesson, categorized according to the levels of understanding: (a) student A; (b) student B.

Figure 10. Image samples with corresponding face detection box: (a) student A; (b) student B.

Figure 11. Comparison of comprehension levels between the automatic CNN model and manual feedback.

Table 1. Emotion distribution across dataset.

Emotion State	Training Images (183,624)	Percentage of Training	Validation Images (45,904)	Percentage of Validation	Test Images (56,528)	Percentage of Test
Happy	13,623	7.42%	3443	7.5%	3900	6.9%
Surprise	10,730	5.85%	2387	5.2%	3448	6.1%
Neutral	109,997	59.87%	28,231	61.5%	32,899	58.2%
Sad	8298	4.52%	1790	3.9%	2374	4.2%
Fear	16,610	9.05%	3994	8.7%	5370	9.5%
Disgust	11,067	6.03%	2479	5.4%	3222	5.7%
Angry	13,299	7.26%	3580	7.8%	5315	9.4%

Table 2. Feelings conversion to level of understanding.

Emotion State	Level of Understanding During Class
Happy	80–100%
Surprise	70–80%
Neutral	50–70%
Sad	30–50%
Fear	20–30%
Disgust	10–20%
Angry	0–10%

Table 3. Confusion metrics.

Emotion State	TP	FN	FP	TN	Total Images
Happy	2810	527	402	161	3900
Surprise	2468	431	356	193	3448
Neutral	23,703	2200	3393	3603	32,899
Sad	1710	159	244	261	2374
Fear	3943	358	552	516	5370
Disgust	2298	217	332	376	3222
Angry	3824	355	547	589	5315

Table 4. Performance metrics.

Emotion State	Accuracy	Recall	Precision	F1
Happy	0.7618	0.842	0.875	0.858
Surprise	0.7716	0.851	0.874	0.862
Neutral	0.83	0.915	0.875	0.895
Sad	0.83	0.915	0.875	0.895
Fear	0.83	0.9167	0.877	0.896
Disgust	0.83	0.914	0.874	0.894
Angry	0.83	0.915	0.874	0.895
Overall	0.822	0.906	0.875	0.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Avital, N.; Egel, I.; Weinstock, I.; Malka, D. Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing. Inventions 2024, 9, 113. https://doi.org/10.3390/inventions9060113

AMA Style

Avital N, Egel I, Weinstock I, Malka D. Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing. Inventions. 2024; 9(6):113. https://doi.org/10.3390/inventions9060113

Chicago/Turabian Style

Avital, Nuphar, Idan Egel, Ido Weinstock, and Dror Malka. 2024. "Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing" Inventions 9, no. 6: 113. https://doi.org/10.3390/inventions9060113

APA Style

Avital, N., Egel, I., Weinstock, I., & Malka, D. (2024). Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing. Inventions, 9(6), 113. https://doi.org/10.3390/inventions9060113

Article Menu

Enhancing Real-Time Emotion Recognition in Classroom Environments Using Convolutional Neural Networks: A Step Towards Optical Neural Networks for Advanced Data Processing

Abstract

1. Introduction

2. Methods and Materials

3. Results

4. Discussion

5. Conclusions

6. Patents

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI