1. Introduction
The World Health Organization (WHO) has advised all nations to restrict face-to-face contact to prevent the transmission of COVID-19. This recommendation applies to the formerly conducted classroom education, which has transformed into online teaching and learning through a virtual learning environment (VLE) such as Moodle [
1,
2]. The abrupt transition from classroom to online teaching and learning created a sharp rise in the student failure rate caused by the behavior problems and their interactions in the VLE, for instance, some students lack of their ability in accessing learning material, assignment submission, and interaction during a discussion in the VLE. Assisting students in engaging with the VLE system is one of the critical factors for graduating from online courses [
3]. The lack of interaction with classmates and teachers is one of the obstacles in this learning transition [
4]. Diminished engagement due to a quick transition to online learning during a pandemic was another challenge [
5]. Moreover, the investigation of student access to course materials during digital education was reported in [
6]. Then, the effectiveness of a distance learning system was assessed using a comprehensive methodology in [
7]. Therefore, early prediction of student learning performance and analysis of initial student behavior in the VLE are crucial to reduce the high failure rate in online courses during the pandemic.
The advancement of machine learning technology has received widespread attention for forecasting student performance in earlier weeks of the teaching and learning process in the VLE. The VLE records student interaction data such as course view, assignment submission, resource view, quiz, and discussion. This VLE student data interaction was employed to develop predictive models for various reasons, including predicting student performance. For instance, a multi-layer perceptron (MLP) architecture was proposed to predict student learning performance using the VLE database [
8]. This early prediction scheme enables teachers to intervene with an at-risk student with a high failure rate in the online course. Furthermore, the machine learning analytics model can provide teachers with statistical insight to aid teaching and learning.
Nevertheless, existing research on student learning performance prediction only focuses on improving model prediction performance. This research goal will suffer from the weakness of temporal feature challenges and imbalanced data set distribution, resulting in poor performance and model prediction explainability in the prediction scenario of early student learning performance. Firstly, the temporal feature challenges refer to a week-by-week student learning performance prediction. For instance, the model could predict the student learning performance as early as the sixth week within 16 weeks of the teaching and learning process. However, the research on student performance prediction is ineffective without the student’s activity data set captured in a whole week [
8,
9]. Secondly, the VLE data set has a highly imbalanced class data distribution between the positive and negative classes. Imbalanced class data distribution represents a data set in which some classes have a significantly larger sample size than others. For instance, the number of samples for failed students from the course was less than 3% compared to those who passed the course. However, a model developed with an imbalanced label data distribution may significantly degrade the prediction performance [
9,
10]. Finally, as for model prediction explainability, the output of model prediction only showed the probability score of students passing or failing. This probability score is complex for the teacher to explain the results to his or her students. The model needs to explain its prediction results based on the importance of student engagements in the VLE. Thus, allowing the teacher to intervene based on the model explainability results is essential. Nevertheless, the research focused on predicting student learning performance in superiority on one of these issues rather than considering them in their entirety. For instance, previous studies proposed a hybrid sampling strategy to improve the model performance with an imbalanced VLE data set [
9,
11,
12], while the others focused on machine learning performance by investigating various model architectures such as the decision tree, naïve Bayes, and MLP on student learning performance [
8,
13,
14].
To address the issues mentioned earlier, an intelligent predictive framework for explainable student performance prediction (ESPP) is proposed in this paper. It is a novel approach that addresses the above issues holistically. First, the proposed framework built a week-wise students’ activity data set based on student interaction via clickstream activities in the VLE during the COVID-19 pandemic. After constructing the new week-wise students’ activity data set, a hybrid sampling method with a synthetic minority over-sampling technique (SMOTE) and random under-sampling (RUS) was employed to balance the data set. Then, the prediction model was constructed by using a convolutional neural network (CNN) and long short-term memory (LSTM) layers for student performance prediction. By integrating the convolutional and LSTM mechanisms, the proposed deep learning (DL) model could accurately extract the spatiotemporal features of students’ activities’ data. Finally, the explainability of the best prediction model will be generated by visualizing the students’ activity features maps and the feature importance contributing to the model. The importance features are visualized by using the local interpretable model-agnostic explanations (LIME).
In summary, the main contributions of our studies are four-fold. (1) Firstly, a new data set for student learning performance prediction was proposed. The data set was student interaction in the VLE collected during the COVID-19 pandemic arranged in a week-wise timely manner. (2) Secondly, the effectiveness of the proposed models was evaluated and compared with baseline models such as logistic regression (LR), support vector machine (SVM), and LSTM when dealing with spatiotemporal features in predicting student learning performance. (3) Thirdly, an insight into enhancing the school environment was introduced by supporting decision makers in implementing early intervention techniques to reduce the failure rate. For instance, the proposed method accurately forecasted student learning performance early in the sixth week with an accuracy of 0.91. This result could help the student improve his or her performance as early as possible. (4) Finally, an explainable DL model was provided by visualizing and analyzing individual predictions, the importance of features, and identifying typical predictions. This explainable instrument enables human and DL model interactions to intervene in the student failure rate. Furthermore, the teachers and students can focus on which features contribute to the poor performance. Thus, these assessments could be used to enhance the user experience in the VLE.
After a brief introduction, this paper is structured as follows.
Section 2 summarizes a review of the present literature on student learning performance prediction and explainable machine learning. Then, the data set and proposed methodologies are explained in
Section 3.
Section 4 presents the findings. Finally,
Section 5 and
Section 6 discuss the results and summarize the current work, respectively.
2. Related Works
Various studies have considered the issue of predicting students’ academic achievement as a regression problem [
15] or a classification problem [
16]. In the first category, the learners’ perspective results were forecasted, while in the second one, the learners’ outcome was estimated as a pass, fail, or dropout. In addition to using various data analytic techniques, factors affecting student performance were also identified to predict student performance [
17]. Moreover, the study areas of predicting student performance can be evaluated from several viewpoints, including prediction of early withdrawal from ongoing courses, analysis of inherent aspects influencing their achievement, and the application of statistical methods to evaluate student achievement. Early prediction is a relatively new concept in this area. It encompasses techniques for assessing students in real time with the intention of retaining them by providing appropriate procedures and interventions and then monitoring and minimizing the failure rates.
Deploying machine learning methods in analyzing student learning patterns and predicting at-risk students were conducted in some studies [
16,
18]. Research in ESPP took a subsequent method to transform the class period into a series of weekly structures and to measure the learning achievement based on students’ interaction with the VLE. Machine-learning methods were used to predict students at risk based on attendance, quizzes, and assignments, with the addition of mid-term exams in the ninth week [
19]. In another study, various data mining techniques, consisting of the decision tree, naïve Bayes classifier, k-nearest neighbor, SVM, and multi-layer perceptron (MLP), were deployed to predict at-risk students. Several studies employed LR as the baseline and identified the best prediction modeling compared among them [
20,
21]. According to the result presented in [
22], KNN was the most accurate algorithm to classify successful or unsuccessful students and determine the performance metrics. Moreover, student engagement patterns were effective in capturing students’ behavior and persuading a better impact on their performance [
23].
According to existing literature, deep learning on learning analytics to predict student learning success is still in its early stages. Deep learning is a computational method composed of several processing layers to study data representation using several levels of abstraction [
24]. Student interaction with the VLE was captured to predict his or her learning performance deployed on a recurrent neural network (RNN) model. This DL approach outperformed the machine learning baseline methods [
16,
25]. Additionally, students’ learning success was predicted by using their attendance and behavior over log data information [
26]. By employing the RNN and LSTM models, this study could produce student engagement and interaction patterns in the VLE database. These DL methods were more effective in the early prediction of grades than conventional regression methods. However, the more advanced the prediction model, the lower the interpretability. This area of research has a small contribution in terms of the explainability of the results. Thus, this study provides the explainability approach on an ESPP framework using the LIME explainability model as proven by [
27,
28].
3. Methods
3.1. Implementing Teaching Design in the VLE System
The study case of this research was the Digital Transformation course at Gadjah Mada University in Indonesia. During the COVID-19 pandemic, this course has been held online using a VLE system. This research study was initiated by collecting teaching materials and questionnaires from student in the previous academic year. Then, the VLE system was designed and tested from September 2020 to January 2021. To achieve the aim of the study, an ESPP framework was proposed. The development of the ESPP framework was divided into three major phases, as illustrated in
Figure 1.
First, phase 1 aimed to determine the main elements in the VLE system focusing on four core areas below. The first core area is key subjects. The second core area is life and career skills. The third core area is learning and innovation skills. Finally, the fourth core area is information, media, and technology skills. The previous study suggested three main elements: learning activities, learning resources, and learning support [
29]. Each element was carefully explained and managed using several learning objects, as displayed in
Figure 2. These learning objects are valuable features that can be used as input of the prediction model. Then, the VLE system was developed and tested. After that, this VLE system could be repurposed by adding and revising the content to improve the student experience during online learning for the next academic years. Meanwhile, phase 2 aimed to extract the data from the VLE database to create meaningful insights regarding student performance. This phase involved data extraction, cleaning, anonymization, and labeling.
Finally, phase 3 showed the performance evaluation results of the baseline models of machine learning (ML) and DL in classification tasks, especially in student performance early prediction. In addition, this work provides an interpretability analysis by using the LIME method in this paper. The output of the prediction model is the students’ final course grade category, i.e., pass or fail. The framework will extract the trained data set from the VLE repository. A hybrid-sampling method was implemented to the data set in order to reduce the overfitting and misclassification of imbalanced data distribution in multi-class prediction tasks. Then, the proposed model was designed by using several traditional modern machine learning classifiers as the baseline model for evaluating the performance by using their performance metrics. Finally, the importance features of the results were visualized by using the XAI method via LIME. In the following sections, a brief description of the results is presented. The works mentioned above were developed using Python programming, Tensor flow, and LIME libraries.
This proposed framework simultaneously predicts student performance and generates the interpretability from the student side and the VLE point of view. The interpretability technique is useful for explaining what the results had been shown. In addition, it could explain why the results will follow the predicted results at the end of the semester. On the other hand, from the point of view of the VLE system, the interpretability technique becomes a general evaluation material to improve the features that do not play a role in balancing the learning process so that the VLE system can be revised and improved for learning in the next academic year.
3.2. Designing the ESPP Data Set
The data set was recorded from a 16-week, fully online “Digital Transformation” course at Gadjah Mada University in Indonesia from September 2020 to January 2021. The course was conducted through Moodle, a learning management system containing eight learning objects. The first materials were released at the beginning, updated every week, and available to students until the end of the course. The data set consisted of two different types: data on student achievements during the course and interaction data on students accessing materials and doing activities in the VLE. The learning process was designed using three major learning elements and was implemented using eight learning features as described in
Figure 2 including assignment (F1), file (F2), forum (F3), homepage (F4), label (F5), page (F6), quiz (F7), and URL (F8). During the learning process, students navigated through the website to read the module, discuss in a forum, send files for assignments, or complete quizzes. The clickstream data (i.e., log of users when navigating in VLE webpages) were recorded. These were valuable data that could be used to generate insights regarding student performance in the final week of the course. More than 202,000 logs of 977 students for this data set were obtained and used as inputs for the prediction model. Meanwhile, final course scores were formerly maintained alphabetically. However, this prediction study converted the final scores into a categorical binary representation. The classification was conducted by differentiating students with the “passed” (1) if the final score ≥ 50. In contrast, the “failed” and “withdrawn” (final score < 50) were combined in the class “0,” indicating at-risk students.
3.3. Prediction Model Based on the LSTM Network
An LSTM-based model and its derivative models (i.e., CNN-LSTM and Conv-LSTM) were applied in this study. Specifically, the LSTM can be employed to extract temporal patterns in nonlinear time-series data [
30]. The LSTM utilized two gates (i.e., forget gate and input gate) to control the value of the cell state
Y. The forget gate
f determined the value of the current cell state by retaining the previous cell state, while the input gate
i determined the value of the current cell state by maintaining the network input. First, the forget and the input gates calculated the current input
xt, previously hidden state
ht−1, weight
w, and bias
b using a ‘sigmoid’ activation function γ.
Next, Equation (3) formulated a new candidate value of cell state
, i.e., calculated using a ‘tanh’ activation function
θ. Finally, the old cell state
Yt−1 was updated to the current cell state
Yt using Equation (4).
Three models, including LR, SVM, and LSTM, were selected as the baseline models for performance comparison because they performed well in time-series data prediction [
31,
32]. The goal was to evaluate which model could generate the highest accuracy in classifying at-risk students in the early week of the semester. The architectures of the LSTM, CNN-LSTM, and Conv-LSTM are depicted in
Table 1. The models were constructed using several deep hidden layers (HL) with a number of trainable parameters integrating two main components: feature extraction and predictor. First, the baseline LSTM contained three layers, i.e., an LSTM layer, a dropout layer, and a fully connected network (FCN) or dense layer that propagated the inputs to the outputs. The number of neurons in each layer was equal to the input sample, and the number of outputs was two, representing the categorical label (i.e., “Fail” and “Pass”). Second, CNN-LSTM consisted of the CNN layer combined with LSTM as feature extraction and the FCN layer as the classifier. The feature extraction module was constructed using two layers of CNN1D, a pooling layer and a flatten layer. The output of the feature extraction module was in one dimension. Then, an LSTM layer (with 16 LSTM cells) was utilized. On the classifier, an FCN layer was employed. Because of the relatively small sample size of educational data (i.e., covering 16 weeks), the more sophisticated deep CNN architecture could not further improve the prediction performance. Thus, this model was proposed with a simplified concept. Third, the design of the Conv-LSTM model was detailed as follows. A feature extraction module was employed using a ConvLSTM2D layer. Subsequently, a flatten layer was applied. Finally, the FCN layer was implemented. This design provided a simple layer construction with a moderate number of parameters while maintaining good accuracy.
The inputs of LSTM, CNN-LSTM, and Conv-LSTM models were denoted in the following order
x [
N, T, c],
x [
N, s, t, c], and
x [
N, s, r, t, c], respectively, where
c, N, T, t, r, and
s represent the number of features (channels), the number of samples taken by the prediction model to capture the temporal patterns, time-step, subsequence (columns), rows, and the number of samples, respectively. In both CNN-LSTM and Conv-LSTM architectures, the number of sequence data (i.e., time-step
T) was split into subsequence
t using a divider
s; hence,
T = s × t. The number of samples
N were 6, 8, 10, 12, and 16, indicating the week-wise data sequence. The detailed setting can be seen in the layer setting in
Table 1. A constant number of features, dropout rate
d, filter
o, kernel size
k, pooling window
w, learning rate
α, epoch
ε, and batch size
b were used. Padding parameter
p was set to
‘same’. These three designs were compared in terms of performance matrices using early sets and a full set of data, the efficient number of parameters, and the model interpretability.
3.4. Explainable AI Model
Explainable AI (XAI) was described specifically to exploit the local explainability of time series data [
33]. Explainable AI collectively refers to methods that can exploit the interpretability of a given decision-making process, such as traditional and modern ML models. An enormous potential is shown from this branch of AI study that can unbox the modern ML ‘black-box’ model [
34]. LIME is powerful because it provides accessibility and simplicity [
35]. LIME inherits the basic idea of model agnosticism to explain any given supervised learning model by treating it as a ‘black-box’ separately. LIME provides a local explanation by weighting adjacent observations. Local explanations mean that LIME gives locally faithful explanations within the surrounding observations of the sample being explained. A local linear model based on the adjacent weighted observations was trained to achieve these explanations. LIME minimizes the objective function
ξ using Equation (5), where
f is the prediction model,
x is the specific observation,
g is a local explanation, which is the element of
G,
πx is the proximity of the adjacent observations around
x, and
Ω(
g) is the complexity of g, which is kept low.
The LIME explainer was trained using the LIME tabular explainer function. For each observation, LIME outputted importance values from data feature x [N, T, c] for each week-wise data. As week-wise individual observations were ineffectual, this study focused on the global observations of feature c with LIME. Nevertheless, the study also presents the local explanations and represents the inputs’ data set, thus making the comparison more sensible.
4. Experiments
This study evaluated the performance of baseline ML models and LSTM-based models in classification tasks, especially in student performance early prediction. In addition, this work performed an interpretability analysis by using the LIME method. In the following sections, a brief description of the results is presented.
4.1. Performance Evaluation
For the first experiment, a real-world clickstream data set was collected containing records of 977 students who took an online course. There were eight features for each student in the data set and 16 monitoring weeks. The following eight features were used as data input for this experiment: assignment, file, forum, homepage, label, page, quiz, and URL. Meanwhile, the final student grade category (i.e., pass and fail) was employed as the classification label.
Table 2 presents the statistical analysis of the real-world data set in terms of the normalized standard deviation (NSD), normalized mean (NM), skewness, and kurtosis. The NSD measured how close the data distribution was to the mean value. Skewness measured data asymmetry around the mean value. Normal distribution, ideally symmetric, had a zero skewness. Meanwhile, kurtosis indicated the distribution susceptibility to outliers. The ideal kurtosis value of the normal distribution was close to 3. Based on statistical analysis, the NSD and NM of the class label were the highest, with a value of 0.332847 and 0.873503, respectively. From here, the data set was imbalanced with most decisions going to the majority class (i.e., grade: pass). On the other hand, similar data characteristics were observed for the eight features. The features had a positive value of skewness. The kurtosis of the features also deviated from 3 when compared with the class label. Because the kurtosis was higher than 3, the distribution of the eight features had a heavier tail and a sharper peak than that of the normal distribution.
Correspondingly, this imbalanced data set was investigated in terms of performance comparison, as depicted in
Table 3. The imbalanced data set will become a major problem, for example, for an imbalanced data set with a 10% at-risk rate. Hence, a classification model may successfully make a prediction with a 0.9 accuracy for all students. However, in this case, the classification model failed to predict any at-risk students. To make detailed investigations, the baseline approaches of LR, SVM, and LSTM were evaluated. Then, a hybrid sampling technique was performed on the data set. This technique synthesized new examples from minority classes using the SMOTE algorithm. SMOTE worked by selecting minority examples being close in the feature space, drawing a line between the examples, and then determining a new sample at a point along that line. After that, a random under-sampling was performed on the majority class. This hybrid (i.e., SMOTE-RUS) sampling technique achieved better results, as stated in [
36].
Table 3 signifies that the hybrid sampling method improved the performance of the prediction model, especially on precision, recall, and F1-score matrices, increasing to more than 90% than those without any sampling strategy. Meanwhile, accuracy was comparable for both with and without hybrid sampling.
4.2. Characteristic of the Convolutional Neural Network Model
The second experiment explored whether the proposed convolutional deep neural network models (i.e., CNN-LSTM and ConvLSTM) could achieve better than baseline NN models (i.e., LSTM) for weight convergence in feedforward neural networks. In the LSTM approach, the 128 vector lengths with absolute values were constructed into one-dimensional data for data recognition to identify at-risk students. On the other hand, in both the CNN-LSTM and the ConvLSTM approaches, the 128 vector lengths were composed into two-dimensional 16 × 8 data to improve spatial feature recognition between adjacent data values.
An LSTM model architecture with an LSTM layer as feature extraction and a fully connected network (FCN) layer as a prediction model enabled the architecture to learn more complex long short-term representations for classification tasks. Each layer propagated the output of the previous layer as its input, providing comprehensive learning by training the weights to reach a convergence state. A dropout layer was used between HLs to avoid interdependency between the LSTM and Dense layers, reducing overfitting and underfitting during the training process. In the training phase, the data were computed at each epoch using several scenarios, i.e., from the first week to i-th week, where i = 6, 8, 10, 12, and 16. The week-wise sequences were converted to equal-length normalized input vectors. Then the vectors were padded into the input layer. A fixed Adam optimizer was applied with a dropout rate of 0.2 and categorical cross-entropy as the loss function. Multiple evaluations were performed to discover the optimal epoch, batch size, and learning rate, ranging from 10 to 300, 10 to 100, and 0.01 to 0.0001, respectively. The hyper parameter setting was decided after several experiments with the final epoch e = 200, the batch size b = 20, and the learning rate α = 0.001.
The results in
Figure 3 illustrate the efficacy of the proposed (a) LSTM, (b) CNN-LSTM, and (c) Conv-LSTM models by representing the training improvement after 200 epochs. The figure presents three kinds of neural network models, i.e., LSTM, CNN-LSTM, and Conv-LSTM, to predict students’ performance. Generally, on the left side of
Figure 3, the loss accuracy progressively decreases from the early weeks to the late weeks. In contrast, on the right side of
Figure 3, the accuracy progressively increases from the initial weeks to the final weeks. As a sufficient and conclusive data set could not be obtained in the initial weeks, especially in the first week, where the previous instance was unavailable, the experiment provided the loss and accuracy correction from week 6.
Figure 3 exhibits the progress for accurate earlier prediction using the testing subset. On average, for three LSTM-based models, the accuracy scale ranged from 0.865 in the sixth week to 0.879 achieved in the eighth week up to 0.942 obtained in the last, 16th, week. A gradual improvement in the accuracy was observed since the 10th week, indicating the efficiency of the deployed model in learning the distinct multivariate patterns of students’ performance with accuracy over 0.871, especially using both CNN-LSTM and Conv-LSTM. The figure depicts that the loss values continuously degrade, signifying a progressively smaller difference between the target class label and the predicted class label. In addition, overfitting became another challenge in the Al model. Therefore, how well the proposed models perform on a new data set cannot be known unless tested on actual data. To address this issue, a 10-fold cross-validation was performed on separate training and testing subsets, as illustrated in
Figure 4. In general, the 10-fold cross-validation accuracy of CNN-LSTM and Conv-LSTM from week 6 to the final week was higher than those of the LSTM model. The longer the data set, the better the performance of the proposed models with an accuracy of more than 0.950 when using a 16-week-wise data set.
4.3. Early Prediction of At-Risk Students
The third experiment aimed to investigate the average performance of the proposed models compared to the baseline models when making an early six-week-wise prediction.
Table 4 compares the performance of the models using the average values of the 10-fold cross-validation. The results revealed that the CNN-LSTM and the Conv-LSTM performed better than traditional machine learning models (i.e., LR and SVM) and regular LSTM models for all measurement matrices. On average, the LR, the SVM, the LSTM, the CNN-LSTM, and the Conv-LSTM had a recall rate of 0.64, 0.77, 0.84, 0.88, and 0.91, respectively. These results indicate that the LR and the SVM encountered difficulties in dealing with many input variables without vectorization. The classic machine learning methods had difficulties in extracting a more complex pattern. On the other hand, the LSTM-based model achieved better results with its massive learning capability. Specifically, the CNN-LSTM and Conv-LSTM models performed more superiorly than the LSTM model by 4.7% and 8.3% in the overall F1-score, respectively. Moreover, the Conv-LSTM model achieved the highest performance from all tested models with a moderate number of parameters.
4.4. Interpretability Analysis
The last experiment unveiled that both LSTM-based approaches could provide explainability of the results comparable to the traditional methods in the early warning prediction. However, due to the complexity of both models, it was hard to identify the important features of both models. However, several local features could be identified in the feature importance visualization in
Figure 5. The visualized data set regions between the successful and at-risk students could be distinguished. Furthermore, the LIME method could describe that students should notice several areas because of the importance of these activities, which could strongly affect the final results. The students could use this explainability to improve and put their efforts into the mentioned activities; thus, they would be successful at the end of the course.
6. Conclusions
The advancement of machine learning technologies has received great attention in the development of VLE for predicting student performance in earlier weeks of the teaching and learning process. In early prediction cases, traditional machine learning models fail to predict student performance due to insufficient data, imbalanced data sets, and a lack of understanding about how the models predict the results. This study provides an intelligent framework for explainable student performance prediction using two innovative prediction models called CNN-LSTM and Conv-LSTM to simultaneously improve the predictive models’ performance and explainability. The prediction performance was evaluated by comparing the proposed CNN-LSTM and Conv-LSTM models with three baseline LSTM, SVM, and LR models, resulting in F1-scores of 0.91, 0.88, 0.84, 0.79, and 0.64, respectively. The proposed prediction models offer many improvements including a lower misclassification rate, a higher sensitivity rate, and explainability features for instructors to improve the VLE activities. There are two potential extensions of the research in the VLE. One possible extension is to conduct experiments on hybrid learning strategies with deep convolutional prediction models. The other extension is to inspect the student’s performance if the instructor intervention is imposed during learning activities.