1. Introduction
Lecturers wield a significant impact on student learning and success [
1], and there is evident variation in the effectiveness of lecturers in fostering positive educational outcomes [
2]. The primary influential factor in achievement is feedback, but the effects vary significantly, highlighting the intricate nature of optimizing the advantages derived from feedback [
3]. Feedback mechanisms in higher education are frequently misconstrued, challenging to execute proficiently, and often fall short of their intended goal of substantially influencing student learning [
4].
An impassioned feeling frequently serves as a catalyst for achieving educational milestones. A student’s likelihood of accomplishing successful learning is higher when they are filled with zeal compared to when they are disinterested. This is a result of the emotion of ‘enthusiasm’ fostering a readiness that motivates the body and mind to engage in learning and scrutinize information. Consequently, educational approaches should align with the emotional state of the student.
Numerous methods exist for identifying emotions. Among the widely used methods is the gathering of facial characteristics [
5]. In physical classrooms, closed-circuit television systems can record these facial features, while in the context of online learning, webcams serve the same purpose. Following the acquisition, additional steps are taken to obtain the analysis results.
The Viola–Jones algorithm [
6,
7] is one of the frequently utilized algorithms in the context of real-time image processing. The Viola–Jones algorithm stands out as a widely acknowledged and extensively utilized method for facial detection in the realm of image processing [
8,
9,
10,
11]. Devised by Paul Viola and Michael Jones, this algorithm demonstrates prowess in swiftly and effectively identifying faces within digital images [
12]. Its functionality hinges on a cascade of classifiers trained on a diverse array of affirmative and negative samples. Its efficacy spans a spectrum of applications, encompassing real-time video scrutiny to systems centered on facial recognition, rendering it a fundamental pillar in the domain of computer vision.
In recent years, the advent of online learning platforms has transformed the educational landscape, necessitating innovative approaches to monitor and enhance student engagement. Recognizing students’ facial expressions in real-time during virtual classes offers a promising avenue to assess their emotional and cognitive states, which are crucial indicators of their learning experiences. This research aims to develop and implement a robust system for facial detection in a Zoom-based online learning environment using the Viola–Jones algorithm and analyzing facial expressions of students by using Convolutional Neural Networks.
Another central advancement in this context is convolutional neural networks (CNNs), a class of deep learning algorithms specifically designed for processing structured grid data, such as images [
13]. CNNs have revolutionized various fields by automating feature extraction and classification processes, which are crucial for analyzing visual data. The architecture of CNNs is inspired by the biological visual cortex and consists of several key components: convolutional layers, pooling layers, and fully connected layers. Convolutional layers apply filters to input data, detecting essential features like edges, textures, and patterns [
14,
15,
16,
17]. These layers are followed by pooling layers, which reduce the dimensionality of the data while preserving critical information, thereby minimizing computational complexity. Finally, fully connected layers integrate these features to perform tasks such as classification or regression.
One of the primary strengths of CNNs lies in their ability to learn hierarchical representations. Lower layers capture basic features, while higher layers identify more complex patterns and objects, enhancing the model’s ability to generalize across various tasks. This hierarchical learning, combined with techniques such as weight sharing and local connectivity, significantly improves CNNs’ efficiency and effectiveness. The robust performance and adaptability of CNNs have made them foundational to numerous advanced applications, including image recognition, object detection, and facial recognition.
In the context of educational technology, CNNs can be employed to analyze students’ facial expressions and body language to infer their engagement and comprehension levels. Such applications are instrumental in providing real-time feedback to educators, enabling them to tailor their teaching strategies to meet the needs of their students better. The educational potential of recognizing students’ facial expressions in online learning platforms like Zoom is substantial, offering notable benefits for both educators and learners. By monitoring facial expressions, educators can gain real-time insights into student engagement levels, allowing for timely interventions that maintain focus and participation. This capability supports personalized learning, enabling instructors to adapt teaching methods to meet individual needs, thereby enhancing learning effectiveness and student satisfaction.
The choice of the Viola–Jones algorithm and CNNs for this research was motivated by their simplicity, computational efficiency, and real-time processing capabilities, which align well with the current capabilities of photonic hardware [
18,
19,
20]. The Viola–Jones algorithm, being less resource-intensive, is well-suited for photonic neural networks utilizing multimode fibers and supports real-time face detection, crucial for immediate results. Starting with simpler algorithms provides a robust foundation for future advancements, enabling the implementation of more complex algorithms like you only look once (YOLO), CenterFace, and CornerNet as photonic technology evolves. Both algorithms have demonstrated proven effectiveness in various applications, particularly in face detection and image recognition.
By leveraging advanced deep learning algorithms and computer vision techniques, this system seeks to provide educators with valuable insights into students’ engagement levels, comprehension, and emotional well-being. The ultimate goal is to create a more responsive and personalized educational experience that adapts to the needs of each student, thereby improving overall learning outcomes in the context of remote education. The Viola–Jones algorithm, employed for real-time facial recognition, holds prominence within the domain of computer vision as a prevalent object recognition technique. Its widespread adoption stems from its efficacy in object identification, particularly in discerning facial features within images and videos. Renowned for its rapid detection capabilities and modest computational demands, the Viola–Jones algorithm proves conducive to real-time operational contexts. Its efficacy in facial recognition has garnered considerable acclaim, prompting its utilization across diverse applications such as facial recognition systems, object tracking, and beyond.
Furthermore, real-time analysis provides immediate feedback on teaching efficacy, prompting adjustments to improve comprehension and retention. Recognizing emotional states also addresses the crucial aspect of student well-being, allowing educators to provide necessary support and referrals. Data collected over time offer valuable insights into engagement trends, informing curriculum development and teaching strategies. This research is particularly relevant for optimizing remote learning, making it more interactive and responsive, and bridging the gap between traditional and digital classrooms. Additionally, in hybrid learning models, the system ensures equal attention and support for remote students, promoting inclusivity. Lastly, insights from facial expression analysis can enhance teacher training, improving their ability to manage classroom dynamics effectively, even virtually. Thus, this research has the potential to significantly enhance the quality and effectiveness of online education.
In this research, an advanced deep learning system is proposed to enhance the real-time assessment of students’ understanding during lessons. This system leverages the Viola–Jones object detection for facial detection combined with the CNN algorithm that analyzes facial expressions to evaluate and monitor students’ comprehension levels continuously. By implementing this innovative approach, the system provides real-time support to educators, enabling them to adapt their teaching strategies dynamically to improve overall classroom understanding. This real-time feedback mechanism facilitates immediate instructional adjustments, thereby fostering a more responsive and effective learning environment. The integration of Viola–Jones techniques with CNN represents a significant advancement in educational technology, offering a robust tool for enhancing pedagogical efficacy and student engagement.
2. Methods and Materials
In this project, the general data protection regulation (GDPR) is strictly adhered to in the handling of participant data. Informed consent is obtained from all participants, ensuring that consent is explicit, freely given, specific, and informed. Data collection is conducted in accordance with the principles of purpose specification and retention, meaning that personal data are collected only for specified, legitimate purposes and are not processed further for incompatible purposes. Data minimization is applied, ensuring that only the essential data required for the study are collected. Participants are afforded data subject rights, which include the ability to view, correct, or delete their data, as well as the “right to be forgotten”. Strict data security measures are in place to prevent unauthorized access or accidental loss of data. Additionally, when sensitive data are involved and large-scale data processing occurs, a data protection officer (DPO) is appointed to oversee compliance. These safeguards ensure that the system complies with GDPR regulations and can be integrated into the European educational system while maintaining participant privacy.
The inquiry into the functional role of emotions spans centuries, yet contemporary advancements in behavioral and neurological research have facilitated a more nuanced understanding of this fundamental question [
21]. Emotions play a crucial role in how achievement and goals are pursued and accomplished. Academic emotions constitute a multifaceted spectrum of affective experiences within educational contexts, influenced by a complex interplay of situational determinants and individual psychological factors [
22]. These emotions manifest across various academic domains, including lecture attendance, examination scenarios, solitary and collaborative study sessions, and scholarly discourse participation. The genesis of these emotions is twofold, arising from both intrinsic task engagement and extrinsic outcome anticipation, such as performance evaluations. A taxonomic approach to academic emotions typically employs a dual-axis classification system, distinguishing between valence (positive versus negative) and situational context (task-related versus social).
Exemplifying this classification, positive-activating emotions (e.g., enjoyment and hope) stand in contrast to negative-activating emotions (e.g., anxiety and anger), each exerting distinct influences on academic engagement and subsequent outcomes. Empirical investigations have elucidated the heterogeneity of emotional responses among learners, with notable subject-specific variations suggesting the necessity for domain-tailored interventional strategies. Methodologically, the assessment of academic emotions encompasses a diverse array of approaches, including psychometric surveys, qualitative interviews, and physiological measurements. Theoretical frameworks, such as the control-value theory, provide conceptual scaffolding for understanding the antecedents and consequences of emotions in learning contexts, thereby informing the development of interventions aimed at optimizing academic achievement through efficacious emotional regulation strategies [
23]. The affective dimensions of students’ educational experiences and academic outcomes are profoundly influenced by the perceived significance of learning activities and the contextual factors in which these occur. Emotions, which can be systematically categorized along dimensions of valence (positive or negative) and their relationship to task-oriented or social–interactive contexts, play a pivotal role in this process.
Emotions also serve an adaptive function by facilitating situational responses and mediating between reflexive and higher-order cognitive processes. Within educational environments, they exert a significant influence on mnemonic processes, cognitive strategy selection, attentional allocation, and motivational orientations.
Furthermore, the emotional states experienced during learning processes have been demonstrated to enhance memory consolidation and exert a considerable impact on academic performance metrics. Positive affective states are associated with enhanced divergent thinking and creative problem-solving capabilities. Conversely, negative emotional states tend to promote analytical processing modes but may concurrently deplete cognitive resources, potentially compromising performance on complex cognitive tasks [
24].
The Viola–Jones algorithm and CNN were chosen for this study due to their specific advantages in the context of real-time facial expression recognition using photonic hardware, despite the availability of more advanced algorithms. The Viola–Jones algorithm is renowned for its simplicity and efficiency in face detection, utilizing Haar-like features and a cascade of classifiers. This design results in reduced computational intensity, making it ideal for the initial face detection stage in an optical neural network setup, which emphasizes low-volume 3D connectivity and large bandwidth.
While more advanced algorithms like YOLO, CenterFace, and CornerNet offer superior results, they often require significantly higher computational resources, which can be challenging to implement effectively on photonic hardware currently in the nascent stages of managing complex deep learning models. The straightforward implementation of the Viola–Jones algorithm is better suited to the current capabilities of photonic neural networks, promoting more reliable and efficient performance.
Similarly, CNNs are chosen for their exceptional ability to process and analyze visual data. Inspired by the human visual cortex, CNNs excel in automatically extracting hierarchical features from images, enabling precise classification and recognition of facial expressions. Their deep learning nature ensures that complex patterns and nuances in facial expressions are effectively captured and analyzed.
Photonic hardware offers several significant advantages over traditional electronic hardware, making it a promising alternative for implementing neural networks and other computational tasks. High-speed data processing is one of the primary benefits, as photonic systems can process data at the speed of light, significantly increasing computational speed. This is particularly advantageous for real-time applications requiring fast data processing, such as facial recognition and neural network operations. Furthermore, photonic hardware inherently supports parallel processing, allowing multiple optical signals to be processed simultaneously without interference. This capability enables efficient handling of massive data sets and complex computations in real time.
Another advantage of photonic hardware is its low heat production. Unlike electronic components that generate significant heat during operation, photonic devices produce minimal heat, reducing the need for extensive cooling systems and improving energy efficiency and reliability in high-performance computing environments. Additionally, photonic systems offer significantly larger bandwidth compared to electronic systems, facilitating the transmission and processing of large volumes of data at high speeds. This makes photonic hardware ideal for applications requiring high data throughput.
Photonic systems also experience minimal latency due to the high speed of light transmission. This is crucial for applications where quick response times are essential, such as telecommunications and real-time data analysis. Moreover, photonic hardware can be scaled up to accommodate increasing computational demands. The use of optical fibers and other advanced photonic structures enables the creation of highly scalable systems capable of handling complex and large-scale computations.
By leveraging these advantages, photonic hardware presents a promising alternative to traditional electronic systems, offering improved performance, efficiency, and scalability for a wide range of applications, including neural networks, telecommunications, and high-performance computing. These benefits make photonic hardware an ideal choice for integrating the Viola–Jones algorithm and CNNs for efficient and reliable real-time facial expression recognition. Utilizing simpler, yet highly effective algorithms like Viola–Jones and CNNs ensures that the system can operate efficiently within the current technological constraints while providing a robust foundation for future advancements in photonic computing and more complex algorithm integration. In this study, an advanced computational system was developed and implemented using MATLAB 23.2 software, integrating CNNs with the Viola–Jones algorithm to detect and analyze student emotional states in an academic environment, as illustrated in
Figure 1.
The methodological framework encompassed several critical phases: image preprocessing, convolutional layer processing, and emotion classification, with a specific focus on differentiating between negative and positive comprehension levels.
The study was conducted during a 72 min instructional session involving 45 students. To enhance the system’s accuracy and robustness, a substantial dataset comprising 183,624 training images was employed. The image preprocessing phase involved the normalization of input images to ensure consistent scale and format across the dataset. Subsequently, the convolutional layers extracted salient features from these preprocessed images, enabling the CNN to effectively learn and identify patterns associated with various emotional states.
The final classification stage utilized these learned features to categorize the emotions into predefined classes indicative of students’ comprehension levels. To assess the system’s efficacy, a comparative analysis was performed between the emotional responses detected by the system and those self-reported by the students through comprehensive questionnaires. This evaluation aimed to validate the system’s accuracy in emotion detection and its potential application in educational settings for real-time assessment of student engagement and understanding.
By implementing this innovative approach, the system provides real-time support to educators, enabling them to adapt their teaching strategies dynamically to improve overall classroom understanding. This real-time feedback mechanism facilitates immediate instructional adjustments, thereby fostering a more responsive and effective learning environment. The integration of Viola–Jones techniques with CNN represents a significant advancement in educational technology, offering a robust tool for enhancing pedagogical efficacy and student engagement.
The Viola–Jones algorithm is highly efficient for real-time face detection due to several key factors. Firstly, it employs an integral image representation that allows for rapid computation of rectangle features, significantly speeding up the feature extraction process. This technique reduces computational complexity, enabling quick image processing and face detection. Secondly, the AdaBoost [
25] learning algorithm is used to select a small number of critical features from a larger set, enhancing detection accuracy while maintaining computational efficiency. By combining weak classifiers into a strong classifier, AdaBoost ensures both speed and accuracy. Thirdly, the algorithm’s cascaded classifier architecture quickly eliminates non-face regions, focusing computational resources on promising face-like regions. Each stage of the cascade is designed to reject a significant percentage of non-face sub-windows while retaining most face sub-windows, ensuring efficient handling of large images and video frames.
In the context of machine and deep learning, the term “weight” refers to a numerical value assigned to each training example that reflects its importance or influence on the learning process during each iteration. Each training example is assigned an equal weight, which means each example is considered equally important. Weights are updated based on whether the examples are correctly or incorrectly classified by the current weak classifier. The goal is to give more focus (higher weight) to the misclassified examples so that subsequent classifiers pay more attention to these harder examples.
This process helps the algorithm to improve its performance on difficult examples. The weights of misclassified examples are increased. The weights of correctly classified examples are decreased. To ensure that the weight remains a valid probability distribution it typically used to normalize the updated weights so they sum up to 1.
The importance of weight is combining weak classifiers. Each weak classifier’s contribution to the final strong classifier is weighted by its accuracy, which is determined by the weights of the examples it correctly classifies. Another importance of weight is focusing on difficult examples by increasing the weights of misclassified examples, AdaBoost forces subsequent classifiers to concentrate more on the difficult examples, thereby improving overall accuracy.
The AdaBoost algorithm iteratively combines weak classifiers to create a strong classifier. Each weak classifier focuses on different aspects of the data, adjusting its importance based on the errors of its predecessors. By sequentially adjusting the weights of misclassified instances, AdaBoost effectively emphasizes challenging examples, refining its predictions with each iteration.
This adaptive boosting technique has demonstrated robustness across various domains, achieving notable success in tasks such as face detection and object recognition. AdaBoost’s ability to improve classification accuracy through its iterative learning process makes it a fundamental tool in modern machine learning applications.
Weak classifiers have limited predictive power, performing slightly better than random guessing. In contrast, strong classifiers are highly accurate, reliably predicting class labels. Strong classifiers are typically composed of multiple weak classifiers, combined using ensemble methods such as AdaBoost, which improves overall prediction performance significantly.
The functionality of the Viola–Jones algorithm relies on the AdaBoost algorithm to create a cascaded series of rectangular feature classifiers [
26]. Central to its operation is the utilization of an image integral as illustrated in
Figure 2, whereupon each adjustment of the window prompts a recalibration of this integral representation.
The integral image is derived by accumulating the pixel values along the upper and left dimensions at each coordinate (x, y). Mathematical formulation for the integral image is elucidated by Equation (1):
Here,
represents the normalized weight determined by Equation (2):
where
W denotes the maximum weight. In the context of AdaBoost, a “miss” refers to an instance where the weak classifier incorrectly predicts the label of a training sample. For example, if the true label is positive (+1) but the classifier predicts it as negative (−1), it is considered a miss. AdaBoost updates the weights of these misclassified samples by increasing them, making these samples more significant in the next iteration. This adjustment directs the algorithm’s focus towards harder-to-classify examples, thus improving the overall performance of the classifier through iterative refinement. To delineate the facial region, the Viola–Jones algorithm employs AdaBoost classifiers arranged serially, comprising m-filters iteratively applied in sequence. Within each step, the weak classifier
is disregarded, while the classifier weights
are retained, calculated as per Equation (3). Subsequently, the threshold
is computed using Equation (3):
The updated threshold in iteration m + 1 is compared to the weight of the sub-windows , if the weight of the sub-window is lower than the threshold, the sub-window is neglected and not classified as a face, on the other hand if the sub-window is greater than the threshold, the sub-window is classified as a face.
Using the Viola–Jones algorithm for face detection includes calculating the integral image and classification as depicted in
Figure 2. The total weight can be defined as follows in Equation (4):
In the following research, the CNN model is used for facial expression detection. When dealing with N × N images, convolution takes place using an f × f filter. Through this process, the algorithm learns to recognize and emphasize important features.
Figure 3a,b demonstrates the feasibility of multimode fibers for optical neural networks.
Figure 3a illustrates a neural network with four layers, showcasing how optical fibers can be leveraged to achieve efficient data processing with minimal heat generation compared to electronic implementations.
Figure 3b presents a conceptual design for a neural network embedded within multimode fibers. This design also includes four layers, highlighting the scalability and potential complexity achievable with this approach. In this network, neurons and synapses are represented by individual silica cores in a multi-core fiber. Optical signals transfer transversely between these cores through optical coupling, while pump-driven amplification in erbium-doped cores mimics synaptic interactions.
This unique photonic CNN architecture showcases the potential for integrating optical technologies into CNN designs. The structure utilizes optical fiber components such as combiners [
27,
28], splitters [
29,
30], and erbium-doped fiber amplifiers, as demonstrated in previous research [
31], to process information using light instead of traditional electronic signals. This innovative approach significantly enhances both speed and energy efficiency, making it possible for the CNN to handle complex data patterns with improved performance.
While our study primarily addresses digital computation, this architecture holds significant potential for future real-time neural detection applications, such as emotion recognition [
32]. By leveraging optical-based components, the system offers a scalable and energy-efficient solution for high-performance neural processing. This approach promises faster and more efficient data handling, making it particularly suited for advanced neural detection tasks in fields like cognitive science and brain–computer interfaces.
Max-pooling layer is used to reduce the spatial dimensions of input feature maps, thereby decreasing computational complexity and promoting translational invariance. It operates by dividing the input into regions and selecting the maximum value from each region. For instance, a 3 × 3 max-pooling filter with a certain stride reduces nxn input to a 3 × 3 output, capturing the most significant features.
Evaluation: in this phase, the test set is classified using the previously constructed model, and the accuracy percentage of the model is evaluated.
Prediction: images from the test set are selected to assess the accuracy achieved by the model. During this evaluation, the model’s detected emotion and the true emotion of each image are compared. The results show the probability of correctly predicting the emotion using the model, as illustrated in simulation results.
Additionally, there is a max-pooling and dropout layer between each part to mitigate overfitting. Finally, a dense layer utilizing the Softmax function [
33,
34,
35] is employed to classify the images into their respective classes (emotions).
Figure 4 shows the block diagram of an emotion recognition system and the levels of understanding on a facial expression screen. This figure outlines the various stages involved in the system, from the initial input to the final output, illustrating how facial expressions are analyzed and interpreted to recognize emotions.
Dataset: the selection of an appropriate dataset is paramount in facilitating the optimal training of a model for the identification of emotions predicated on facial expressions. The dataset under consideration encompasses 229,528 images (training and validation only) meticulously categorized into seven distinct emotional states. The delineated emotions within the dataset encompass the spectrum of human affect, specifically including Anger, Disgust, Fear, Happiness, Neutrality, Sadness, and Surprise. The dataset is based on 45 students’ facial expressions during lectures.
Data loading and preprocessing: the selected dataset is loaded, and the image dimensions are standardized to 48 × 48 × 1 to expedite processing. The dataset is split into training and validation sets, with 80% allocated for training and 20% for validation, resulting in 183,624 training images and 45,904 validation images. Following training and validation, a separate test set comprising 56,528 images is prepared for final evaluation. The combined dataset, including training, validation, and test sets, consists of 286,056 images. The training set accounts for 64.19%, the validation set represents 16.05%, and the test set constitutes 19.76% of the total dataset. These percentages clarify the overall distribution across the dataset segments, ensuring a balanced approach to model training, validation, and evaluation. Display Dataset Samples: at this stage, samples are extracted from the training set to illustrate each emotion alongside its corresponding label. The segmentation of the dataset is presented in
Table 1:
To ensure that the emotions detected on students’ faces were directly related to the lecture content and not influenced by external factors, several steps were implemented in the experimental design. A lecturer delivering their first session to the students was selected, ensuring no prior familiarity that could affect emotional responses. Additionally, students were screened for any pre-existing emotional burdens or external factors that might skew results, and those identified as feeling overwhelmed or carrying emotional baggage were excluded from the sample. After this screening process, 45 participants remained, allowing for accurate system calibration. This approach aimed to isolate detected emotions solely in relation to the educational content, minimizing potential noise from external influences. Emotional feedback sessions were also conducted to confirm that the observed emotions were indeed linked to the learning material. Additionally, the data were labeled based on three key elements: the overall feeling reported by the students, the grade they received on specific questions corresponding to the 5 s sample interval, and the emotion recognized by the system. This multi-layered labeling approach allowed us to create a more detailed correlation between emotional states, comprehension, and performance during the lecture [
36].
In large datasets, individual variations [
37,
38,
39] or spontaneous data [
40,
41]—such as emotions influenced by personal issues—become statistically diluted. Big data analysis emphasizes overall trends rather than isolated instances, effectively reducing the impact of random fluctuations or external emotional factors on final results. In analyzing emotional responses within educational settings, the patterns observed typically reflect general engagement levels, as the volume of data helps balance individual deviations. By examining multiple students across various sessions, the model can reliably identify patterns associated with comprehension, even if some students experience personal influences. Furthermore, correlating facial expressions with post-lecture comprehension assessments validates the accuracy of the emotional data collected. Discrepancies between a student’s emotional responses and their comprehension level may indicate moments of mind wandering or temporary disengagement. Overall, leveraging big data enables a reliable and generalizable measure of student engagement, consistent despite occasional individual deviations.
Data pre-processing (Augmentation): in this phase, pre-processing is performed.
Augmentation: valuable techniques for enhancing the performance and outcomes of deep learning models. These methods provide diverse variations of the training data, thereby making the model more robust against overfitting. By applying various transformations to the training images, the model learns to prioritize essential features necessary for accurate classification.
CNN model: the CNN model using Viola–Jones algorithm identifies facial expressions by leveraging cascading classifiers for efficient face detection and convolutional neural networks for accurate emotion recognition.
Evaluation: in this model, the CNN’s performance is evaluated using four key parameters: accuracy, precision, recall, and F1-score. Those scores are based on four parameters: true positive (TP), true negative (TN), false positive (FP) and false negative (FN). The first indication represents if the model’s prediction value correlated with the label (false or true), and the second indicator represents the model’s prediction value (positive or negative). Accuracy captures the proportion of correctly predicted instances, including both true positives and true negatives, out of the total instances as represented in Equation (5):
Precision measures the accuracy of the positive predictions as represented in Equation (6):
Recall evaluates the model’s ability to identify all relevant instances as represented in Equation (7):
F1-score balances precision and recall into a single metric, providing a comprehensive assessment of the model’s overall effectiveness in detecting facial expressions as represented in Equation (8):
Prediction: the CNN predicts facial expressions by detecting facial features, analyzing their arrangement, and recognizing patterns associated with different emotions. It utilizes cascading classifiers for efficient face detection and subsequent analysis to accurately identify and classify expressions based on these features.
Load class recording and pre-processing: in this phase, a 72 min class recording is loaded, followed by the pre-processing of the recording. Samples are taken at 5 s intervals, focusing on a particular student. The images are resized to 48 × 48 × 1, suitable for the model. This process results in a collection of images of the student captured throughout the lesson.
Face recognition using the Viola–Jones algorithm: immediately after pre-processing, the Viola–Jones algorithm is employed to recognize the student’s face in each sampled image. The identified faces are saved, creating a dataset of the student’s images taken at 5 s intervals throughout the lesson.
Classification and conversion to level of understanding: it is well-established that positive emotions, such as engagement and interest, are indicators of active participation in the classroom. To further explore this, an experiment was conducted where students were asked to raise their hands whenever they did not understand the material, while simultaneously scanning their emotional responses in real time. After the lecture, a comprehension assessment was conducted during a post-lecture feedback session, where students answered questions to gauge their understanding of the content. By correlating the emotional data gathered during the lecture with the comprehension levels obtained from the post-lecture session, a more accurate model was developed to link emotional states with real-time understanding. To validate the mapping presented in
Table 1, a multi-layered approach was employed. Real-time emotional responses were continuously monitored throughout the lecture, and at the end of each session, comprehension assessments were used to verify the correlation between emotional states and comprehension levels. These findings were further reinforced by cross-referencing them with traditional statistical methods and comparing the results to existing statistical approaches, ensuring the robustness and accuracy of the model. By integrating emotional responses and comprehension metrics, this study provides a novel contribution to the field of emotion recognition, particularly in educational contexts, offering valuable insights into how emotional states can reliably reflect real-time understanding [
42,
43,
44].
In this phase, the students’ photos, sampled from the lesson recording, are classified into one of the seven emotions recognized by the model. After identifying the emotion using the model, the detected emotion is then mapped to the students’ level of understanding at that specific point in time during the lesson, as shown in
Table 2:
In the analysis of student engagement and understanding during a class session, several crucial steps are undertaken to ensure accurate and meaningful insights. This involves the classification of emotions, calculation of comprehension rates, and subsequent analysis. Here is a detailed breakdown of these steps:
Calculation of the Average Comprehension Rate: this phase involves averaging the level of comprehension among all sampled students throughout the lesson. Initially, the comprehension level of each student is calculated separately. These individual comprehension levels were then averaged to determine the overall class comprehension level during the lesson. In this project, tests and analyses were conducted on five different students.
Plotting Results: at this stage, the results are presented, providing a visual or tabular representation of the data collected and analyzed.
Results Analysis: finally, a thorough analysis of the results is performed. This involves interpreting the data to draw meaningful conclusions regarding the students’ comprehension and engagement levels throughout the lesson. The detailed findings are presented subsequently.
Figure 5a,b illustrates the CNN model integrated with the Viola–Jones algorithm.
Figure 5a describes the size of the feature map of each convolutional layer, which is highlighted by the frame colors in
Figure 5b. As shown in
Figure 5b, the CNN model utilizes an image size of 48 × 48 ×1 and employs the ReLU activation function for two-dimensional convolutions. After each group of layers, a max-pooling layer is used to select the maximum value from the preceding layer, thereby preserving prominent image features. Additionally, the large filter in each convolution is chosen to be 3 × 3 as recommended in several articles. The model consists of several distinct components. The first part is indicated in blue and consists of 3 two-dimensional convolutional layers with a size of 64 × 64 and a 3 × 3 filter. The second part is marked in purple and includes 3 two-dimensional convolutional layers with a size of 128 × 128 and a 3 × 3 filter. The third part is highlighted in red and contains 4 two-dimensional convolutional layers with a size of 26 × 256 and a 3 × 3 filter. The fourth part is in light blue and consists of 4 two-dimensional convolutional layers with a size of 512 × 512 and a 3 × 3 filter. The fifth part is in green and comprises 4 two-dimensional convolutional layers with a size of 1024 × 1024 and a 3 × 3 filter.
3. Results
The learning model’s evaluation consists of three main parts: training, validation, and testing. The training set, which constitutes 80% of the dataset (test and validation), enables the model to learn how to identify seven emotions. The training set achieved a success rate of 92%.
Following training, the validation set, comprising the remaining 20% (test and validation) of the dataset, evaluates the model’s learning progress. Validation was conducted every 50 training cycles and achieved a success rate of 65%, comparatively.
After several iterations, the accuracy levels stabilize. The training set stabilizes at 92% accuracy, while the validation set stabilizes at 65% accuracy. This pattern suggests that the model is learning effectively from the training data but struggles to generalize to new, unseen data, indicating a potential overfitting issue.
The discrepancy between the 65% validation accuracy and 83% testing accuracy can be attributed to factors such as hyperparameter adjustments, dataset characteristics, and model initialization.
For the test set, the CNN model achieved 83% accuracy.
Figure 6 demonstrates that the model’s success and accuracy improve significantly with a higher frequency of iterations. In the initial stages of running the model, there is a noticeable sharp increase in the accuracy levels for both the training and validation sets.
To determine the optimal number of convolutional layers that achieved the highest accuracy in our CNN model, an accuracy optimization study was conducted. This optimization is illustrated in
Figure 7. It can be observed from this figure that the highest accuracy of 83% is achieved with 17 convolutional layers, with the model requiring 60:34 h for training.
In
Table 3, the confusion metrics are presented, detailing the breakdown of true positives, false positives, true negatives, and false negatives for each emotion, representing the test segment.
The confusion metrics in
Table 3 indicate that the model performs well in detecting the Neutral emotion, as shown by the high number of true positives, reflecting strong accuracy in this category. For emotions like Happy, Surprise, and Angry, the model demonstrates a balanced performance between true positives and true negatives, indicating reliable classification for these emotions. However, improvement is needed in detecting Sad and Disgust, where higher false negative counts suggest difficulties in correctly identifying these emotions.
Table 4 shows the model evaluation segmented by emotions, including accuracy, precision, recall, and F1-score for each emotion in the test segment.
The table demonstrates strong performance across all metrics, with accuracy ranging from 0.7618 for Happy to 0.83 for several emotions, and recall spanning from 0.842 for Happy to 0.9167 for Fear. Precision values are consistently high, ranging from 0.874 to 0.877, and F1-scores show a robust range from 0.858 to 0.896. Although Happy and Surprise show slightly lower values in comparison to other emotions, the overall performance remains strong. Even in the lowest cases, such as the accuracy of 0.7618 for Happy, the model maintains a high level of reliability. These results indicate that the model is highly effective in detecting and classifying emotions, with consistent precision and recall across different emotional states.
The objective of this research is to employ facial recognition technology to analyze and process emotions, categorizing these emotions based on the students’ levels of understanding at specific time intervals.
The study was conducted over a 72 min lesson, during which the facial expressions of 45 students were recorded. Samples were taken every five seconds for each student. By applying the CNN model, both the emotions and comprehension levels of the students were detected. For example, the results for two students demonstrate real-time detection of comprehension during the lecture.
The computational costs of training the CNN model were a key consideration in the development of this system. The training process, conducted on an HP Z4 Rack G5 Workstation with an Intel Xeon W-2245 processor and NVIDIA Quadro P5000 GPU, initially required approximately 85 h to complete for 60 epochs. To optimize this process, an additional NVIDIA Tesla V100 GPU was introduced, which reduced the total training time to around 60 h. The model processes approximately 200 images per second, and compute unified device architecture (CUDA) was utilized to distribute the workload efficiently across both GPUs. This optimized setup ensures the model can handle real-time emotion detection and comprehension analysis in a classroom setting while maintaining computational efficiency.
Figure 8a,b illustrates the comprehension level over time, including the average comprehension levels of student A and student B as the model detects. In this Figure, the blue line depicts the changes in comprehension level, while the red line represents the students’ average understanding of the material taught during the lesson.
Figure 9a,b shows a histogram summarizing the frequency of specific emotions observed throughout the lesson of student A and student B, categorized according to the levels of understanding.
Figure 10a,b illustrates an example of two student images samples with corresponding face detection to 48 × 48 image and gray–scaled box using the Viola–Jones algorithm.
To evaluate the effectiveness of the CNN–based deep learning system, its results were compared with student feedback collected at the end of a 72–min lesson. The feedback asked students to rate their level of understanding.
Figure 11 shows that the average results from the CNN deep learning system (blue line) align closely with the student feedback results (red line), with an average accuracy of 91.7%. Notably, the error margin between individual student feedback (red circles) and the system’s predictions (blue circles) varies from 0% to 16%.