Explainable AI in Education: Techniques and Qualitative Assessment

Gunasekara, Sachini; Saarela, Mirka

doi:10.3390/app15031239

Open AccessArticle

Explainable AI in Education: Techniques and Qualitative Assessment

by

Sachini Gunasekara

^*

and

Mirka Saarela

Faculty of Information Technology, University of Jyväskylä, P.O. Box 35, FI-40014 Jyväskylä, Finland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(3), 1239; https://doi.org/10.3390/app15031239

Submission received: 28 December 2024 / Revised: 23 January 2025 / Accepted: 24 January 2025 / Published: 25 January 2025

(This article belongs to the Special Issue Advances in Neural Networks and Deep Learning)

Download

Browse Figures

Versions Notes

Abstract

:

Many of the articles on AI in education compare the performance and fairness of different models, but few specifically focus on quantitatively analyzing their explainability. To bridge this gap, we analyzed key evaluation metrics for two machine learning models—ANN and DT—with a focus on their performance and explainability in predicting student outcomes using the OULAD. The methodology involved evaluating the DT, an intrinsically explainable model, against the more complex ANN, which requires post hoc explainability techniques. The results show that, although the feature-based and structured decision-making process of the DT facilitates natural interpretability, it struggles to model complex data relationships, often leading to misclassification. In contrast, the ANN demonstrated higher accuracy and stability but lacked transparency. Crucially, the ANN showed great fidelity in result predictions when it used the LIME and SHAP methods. The results of the experiments verify that the ANN consistently outperformed the DT in prediction accuracy and stability, especially with the LIME method. However, improving the interpretability of ANN models remains a challenge for future research.

Keywords:

ANN; decision tree; fidelity; stability; LIME; SHAP; feature importance

1. Introduction

Machine learning (ML) technologies are being increasingly adopted across diverse industries such as education, law, healthcare, finance and banking, agriculture, and transportation. A technique utilized by American courts to decide whether to order pretrial detention or release is a famous example that illustrates this. The possibility that an individual would commit a crime again is determined using the system known as Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). Judges make decisions on an offender’s release from jail or confinement based on COMPAS [1]. While ML models are increasingly adopted across various domains, significant challenges remain due to their black box model, making them difficult to interpret, especially in critical areas like medical diagnostics, criminal justice, and decision making.

The necessity for explainable AI (XAI) is highlighted by the fact that humans are frequently wondering about the reasons behind particular decisions made by ML models. For example, consider an ML model that makes predictions about the probability of passing or failing a course based on various variables, including assessment score, attendance, submission date, online discussion, and assignment submissions. Students, as well as educators, certainly need to know the variables that go into making a forecast that a student is going to fail if the model makes that prediction. XAI is defined by the Defense Advanced Research Projects Agency (DARPA) in the United States as systems that can explain their reasoning to users, highlighting their strengths and limitations, as well as forecasting their future behavior in order to address this difficulty [2]. In order to improve explainability and make sure that forecasts are transparent, comprehensible, and reliable, the majority of the work in the field of ML explanation research is concentrated on developing new methods and approaches [3]. XAI techniques are important in every field since they aim to define understanding and results of AI systems [4].

Initially, explainable approaches in the education domain were maonly focused on model-specific methods like Decision Tree (DT) models [5,6,7]. It was simple for analysts and educators to comprehend the logic behind the forecasts since these models are naturally interpretable. Despite this, the necessity for increasingly advanced models became apparent as the quantity and complexity of educational data expanded. To handle complex datasets and increase prediction accuracy, nonlinear black box models have arisen, such as ensemble approaches and deep learning. For instance, Shahiri et al. [8] noted that neural network models had the best prediction accuracy, whereas DTs had the second highest. Although these models performed better, their underlying workings were significantly less transparent than those from previous models, which presented interpretability issues.

Because of their high performance, black box models such as deep learning offer great promise for the educational field [9,10]. The widely used post hoc methods, SHapley Additive exPlanations (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME), improve the interpretability of such black box ML models. By exposing the inner workings of black box models, these techniques help both educators and learners gain a deeper understanding of how certain predictions are formed [11,12]. As deep learning models become increasingly popular in education, there is a clear need for evaluations that measure the effectiveness of explanatory methods and determine what explanations are best suited for specific instances.

Nevertheless, there is not yet a widely accepted set of metrics to assess the quality of explanation methods in ML, complicating the comparison of various approaches [3,11,13]. This challenge arises because explainability is inherently subjective; definitions of a “good” explanation vary based on context, analytical objectives, and target audiences. Such variability makes it difficult to objectively compare explanation techniques across diverse applications. Consequently, assessing model explainability in various contexts could benefit from a more scalable and consistent approach.

Despite the presence of numerous articles comparing the performance of AI models in education, there are only a few that focus on the quantitative assessment of explainability [13,14,15,16]. Using the Open University Learning Analytics Dataset (OULAD), this study attempts to fill that gap by adding quantitative explainability measures to ML methods.

The structure of this paper is as follows: Section 2 presents the conceptual framework for data-driven experiments, focusing on the explainability metrics currently available for assessing ML models’ interpretability, including stability and fidelity. The experimental setup utilized in our study is described in Section 3, detailing the steps involved in data preparation, model training, and the application of post hoc explainability techniques like LIME and SHAP. The results are presented in Section 4, where we compare and contrast the interpretability of deep learning models with conventional methods and discuss the relationship between explainability metrics and model performance. In Section 5, we provide a comprehensive discussion of the findings, examining how different features affect the models and how selected explainability metrics help us understand the predictions made by the models. Finally, Section 6 identifies the limitations of our research and suggests future work to enhance the explainability in AI in education, mainly through improved usage of post hoc interpretability techniques.

2. Theoretical Background

In this section, we provide the conceptual framework of the data-driven experiments. The two ML methods, the primary concepts behind feature importance, and the LIME and SHAP techniques are first explained. The explainability metrics that are currently available for use in assessing the explainability of ML models are then briefly reviewed.

2.1. Decision Tree

For applications involving regression and classification, a common ML method is the decision tree. By dividing data into subsets according to feature values, it creates a structure resembling a tree, with each internal node denoting a decision based on a feature and each leaf node representing an outcome or prediction. The aim is to learn decision rules from the input variables and build a model that predicts the target variable. DTs handle both category and numerical data, need minimum data preparation, and are simple to analyze.

Entropy, information gain, and and Gini index are a few instances of statistical measurements that are utilized to determine the best splits at each node [17]. The term “Entropy” refers to a statistical measure of the degree of randomness or impurity in a dataset. Nodes that belong to a single class, or pure classes, have low entropy, whereas nodes that belong to combined classes have high entropy. “Information Gain” is a measure of the reduction in entropy that happens when a dataset is divided based on a particular feature. It is derived from entropy. Since the greatest information gain attribute maximizes the purity of the resultant nodes, it is chosen as the best split. Comparably, the “Gini index” is another way to quantify impurity; it is easier to calculate and shows the likelihood of incorrectly classifying a randomly selected element based on the distribution of classes inside a node.

2.2. Artificial Neural Network

Artificial Neural Network (ANN) is a type of network that interprets the information received from humans by mimicking the functioning of organic nervous systems, namely brain cells [18]. Many linked nodes that may form intricate non-linear interactions make up the ANN model. With each node receiving inputs from preceding nodes and transmitting outputs to subsequent nodes, data flows in a single direction in a feedforward network: from input nodes to output nodes. On the other hand, backpropagation is an ANN technique in which neurons assess their error contributions after processing input. As a result, neurons can better adapt and map inputs to the intended output.

As per Figure 1, the input, hidden, and output layers make up the architecture of the ANN model. In our case study with the OULA dataset, the input layer consisted of 14 variables, capturing a diverse range of features relevant to the analysis. As a link between the input and output layers, the hidden layer was made up of neurons, each utilizing the ReLU (Rectified Linear Unit) activation function to perform computations to find hidden patterns in the data. The final result of these computations based on the examined data was displayed in the output layer. It consisted of a single neuron that uses a sigmoid activation function, making it suitable for binary classification tasks.

2.3. Local Interpretable Model-Agnostic Explanations

A method for interpreting ML models that offers insights into specific predictions is called LIME. It functions using more comprehensible, simpler models (like linear models) to approximate more complicated models (like neural networks or random forests) around a particular instance of interest. Using perturbations to the input data and monitoring alterations in the predictions, LIME produces local explanations. This increases the transparency and interpretability of the “black box” behavior of complicated models by assisting users in understanding which factors affected a certain prediction.

2.4. SHapley Additive exPlanations

SHAP is a ML model interpretation technique that offers data on specific predictions. Based on cooperative game theory, SHAP uses the concept of Shapley values to determine how much each feature contributed to the outcome. In order to ensure an equitable distribution of contributions, it computes the impact of each feature by taking into account every possible combination of feature values. SHAP improves the interpretability and transparency of intricate models like ensemble techniques and neural networks by offering both local and global interpretations.

2.5. Feature Importance

The impact of each feature or variable on the predictive ability of the model is represented by feature importance, which is a key concept in ML. Feature importance assists in determining what inputs are most pertinent to the prediction task by measuring the degree to which each feature affects the model’s decision-making process. By eliminating less important elements from models, this insight might help simplify them or better understand the behavior of complicated models in datasets. Algorithms differ in how they determine feature importance; for example, DTs use Gini importance, random forests use permutation importance, and more complicated models like neural networks use LIME or SHAP values to explain their findings [19]. Understanding the significance of features is essential for analyzing model outputs and improving their interpretability or performance in practical applications.

2.6. Explainability Metrics

Although a number of authors have emphasized the significance of explainability, only a few of them have discussed how it is evaluated [13,14,15,16]. In order to appropriately assist decision-making processes, these explanations should also include useful and actionable information [20]. Several authors have identified kinds of notions that also result in explanations in a variety of ways, including mixtures of text, rules, numerical data, and visual information. There is a collection of notions that may be defined into metrics to evaluate explanations conceptually. ML models often involve trade-offs between their explainability and performance [21]. Given their intricacy and interpretive difficulties, users may encounter difficulties in understanding how extremely accurate models—such as deep neural networks—make decisions. As a result, in recent years, different metrics for assessing explainability systems have been introduced. Specifically, we used two quantitative evaluation metrics: fidelity and stability.

2.6.1. Fidelity Metrics

In order to measure an explanation’s “faithfulness” to the model’s underlying decision-making process, we first evaluated the fidelity metric, which quantifies how well an explanation follows the behavior of the original model. Since an intrinsically explainable model-specific explanation depends directly on the predictions of the original model, its fidelity is 100% by nature. However, since model-agnostic explanation methods (like LIME) rely on local surrogate models, fidelity is a significant quality metric for assessing their effectiveness. If the explanation precisely corresponds to the model’s actual behavior, it is deemed to have high fidelity; if not, it may not sufficiently clarify the model’s decision-making process. In addition, it can be observed that there is a strong correlation between accuracy and fidelity. Specifically, high-accuracy explanations result from high fidelity in the black box model [3]. Both local and global fidelity refers to how well an explanation approximates the model’s behavior for a given instance or small subset of instances, respectively, and how well it does so throughout the whole set of data values.

2.6.2. Stability Metrics

Secondly, we assessed the stability metric, which is a critical metric for assessing the reliability of ML models’ explanations, particularly for intricate models like ANNs. The stability of the explanations is evaluated when slight perturbations are made to the input data. For instance, if a model’s forecast explanation for a given instance changes dramatically when the input features are minimally altered, it may be unstable and so less interpretable [3].

To measure the degree of stability, we simply compute the consistency of post hoc explanations under small perturbations of the input instance. For each iteration, an instance is selected (e.g., the first test instance), and its original explanation is generated using the post hoc method. The instance is then perturbed by adding random noise, resulting in multiple slightly altered versions. Post hoc explanations are produced for each perturbed instance, and the feature importance weights are extracted. Stability is assessed by calculating the cosine similarity between the original explanation’s feature weights and those of each perturbed explanation, ensuring consistency by aligning the feature sets. The average cosine similarity across all perturbations serves as the stability score for that iteration. This process is repeated for 50 iterations and 5 folds, with the final stability score being the mean stability score across all folds, indicating the robustness of the post hoc method’s explanations.

3. Experimental Setup

This section describes the experimental setup used in the analysis. The dataset is first presented, going into detail about the particular subset selected for this research, including key features and its size. To predict student performance, we relied on a binary classification task. Subsequently, we outlined the analytical process, which includes assessing the performance and explainability of the two ML models, ANN and DT. We employed feature importance for the DT classifiers and then LIME and SHAP to obtain insights into the individual forecasts using local explanations that explain why a specific student is predicted to succeed or fail. Python 3.10 was used for all experiments.

3.1. OULA Dataset

In the fields of learning analytics and educational data mining, the OULAD [22] is a vast and well-known dataset. An array of student data from several kinds of modules, including demographics, assessments, and experiences (the data are organized into many CSV files such as studentInfo.csv, assessments.csv, studentRegistration, etc.) in the virtual learning environment, is available in this dataset, which was released by the Open University, the leading distance learning institution in the United Kingdom. The OULA dataset provides researchers with access to data from more than 32,000 students, and researchers can review a variety of areas of study, such as student performance (e.g., [15,23,24,25,26,27]), dropout and retention prediction (e.g., [13,28,29,30,31]), and personalized learning paths (e.g., [32]). Recent research has assessed ML models deployed in educational contexts based on fairness, as well as performance, metrics. Several studies evaluate these models’ fairness quantitatively to make sure that the predicted results do not unfairly favor or harm particular student groups based on factors like gender, socioeconomic position, or race. A few examples of fairness measures that have been studied to identify biases in educational data models include equalized odds and demographic parity. The studies that have used the OULAD and other notable datasets to assess student performance, retention, and dropouts were thoroughly compared using both performance and fairness metrics, as shown in Table 1. By examining a variety of contributing variables and parameters, a rising amount of studies in the recent literature have used ML approaches to simulate student performance. These studies demonstrate how well ML works when analyzing educational data, providing insightful information about predicting student success and assessing academic performance.

As has been previously said, while these studies assessed interpretability using linear or non-linear ML models, they did not address the use of potentially widely accepted metrics for evaluating explainability. Therefore, we used the same dataset to predict student performance in a task involving binary classification. Following that, we went over the analysis process, which included assessing the performance and explainability of two ML models, including ANN and DTs. We used local interpretations, such as LIME and SHAP, to provide a deeper understanding of the model’s decision-making process. We also assessed the explainability using two other measures to provide an understandable evaluation of the model’s accuracy and transparency.

Initially, we chose a subset of the entire dataset from the OULAD for our work (https://github.com/gogoladzetedo/Open_University_Analytics (accessed on 2 September 2024)). Three modules—codes AAA, BBB, and CCC—had their data taken into consideration. Since the purpose of the data analysis was to estimate the passing rate of the students, we also concentrated on the student information. As a result, Table 2 summarizes the conduct of the students, which was achieved by selecting a subset of attributes and developing aggregated attributes. The initial classes “Pass” and “Distinction” were combined in order to create the “Pass” class. Similarly, the original dataset’s “Fail” and “Withdrawn” classes were combined to create the “Fail” class. In order to properly preprocess the OULAD dataset, important steps were conducted to ensure data quality and analysis effectiveness. First, missing values were addressed by removing rows or columns with too much missing information. With the aim of enhancing machine learning model performance and reducing bias based on varying feature magnitudes, feature normalization was then carried out to scale numerical features to a standard range, which is usually between 0 and 1. To enable thorough model evaluation and prevent overfitting, the dataset was finally divided into training and test subsets, ensuring that the model performed well when applied to new data. The preprocessing steps were critical in preparing the OULAD for effective analysis and model development. Due to this, only two of the four classes in the original dataset—Pass (5963 students) and Fail (7128 students)—were taken into consideration. Consequently, 14 features and 17,091 samples are included in our dataset.

3.2. Model Evaluation

The performance of the ML models was evaluated using five-fold cross-validation. The performance metrics were accuracy, precision, recall, and F-score, averaged over five folds. Every step, including feature importance analysis and performance evaluation, was performed 50 times to provide an accurate and trustworthy assessment. Figure 2 illustrates the entire analytical process. For explainability, LIME and SHAP explanations were used to generate local explanations for individual predictions, offering insights into the model’s decision-making process for specific instances.

4. Results

The architecture of a DT model that was created using the dataset is shown in Figure 3, which offers important insights into the variables affecting a student’s success or failure. The most important feature, “assessment_count”, which indicates how many assessments a student has finished, is highlighted at the root node of the tree. The student is predicted by the tree to fail if the assessment count is four or fewer, but passing is more likely when the assessment count is larger. This conclusion makes sense since students often do better the more assessments they complete. There are other features that influence the decision-making process besides the root node. The second most significant attribute was the “score”, which represents the average performance on assessments. Students have a higher chance of passing if they do well on assessments. It is interesting to note that other variables like “delay”, which can indicate how long it takes to complete assessments, and “sum_click”, which represents all online activity or interactions, also matter. These variables demonstrate that a student’s chances of success may be raised by making more efforts and by actively engaging.

The LIME approach was used to give the model predictions’ interpretability after the ANN model was applied to the dataset. With LIME, we can deconstruct this model’s intricate decision-making process and gain an understanding of the separate roles that each feature plays in generating a forecast. We may assess how various factors affect the model outputs and gain a better understanding of the reasoning behind the predictions by concentrating on specific students, like Students 29–30. By outlining the relative significance of various features in predicting the class, LIME helps to make these opaque models clearer, as demonstrated by the accompanying visualizations. As seen in the visualization for Students 29–30 (see Figure 4), the LIME approach was used to explain the predictions produced by the ANN model. Green bars signify positive contributions that support the outcome (PASS), and red bars signify negative contributions that contradict the result (FAIL). The plot features are displayed along the y-axis. For example, the feature code_presentation_2014J ≤ 0.00 for Student 29 has a significant negative contribution, which contradicts the PASS result. However, features such as code_module_CCC ≤ 0.00 and sum_click > 3.52 strongly support the PASS result.

The SHAP method was applied to make the model predictions and understand the individual roles that each feature plays in producing the predictions. By analyzing specific students, such as Students 501 and 801, we evaluated how different factors influence the model’s outputs and gain deeper insights into the logic behind the predictions. SHAP highlights the relative importance of features in determining the predicted class, as demonstrated by the accompanying heatmap visualization. SHAP was used to explain the model’s predictions such as Students 501’s and 801’s visualization (see Figure 5). Green shades show negative contributions that contradict the result, whereas orange shades show positive contributions that support the outcome. For instance, the feature code_presentation_2013B significantly contributed negatively (−0.28) to the result for Student 801. Conversely, attributes like score have a positive impact and support the forecast.

The mean and standard deviation, computed across 50 repetitions and five-fold cross-validation, were the sets that were compared with the feature importance (see Table 3). These metrics offer a thorough assessment of the reliability and accuracy of the model across various data subsets.

Table 4 summarizes the results of the explanation technique for the two quantitative quality metrics for the OULA dataset. We developed a method to assess the stability and fidelity of the predictions produced by the DT model using the feature importance method. After training the DT model on the training set of data, as shown in Figure 2, ten randomly selected examples were taken from the test set for in-depth examination. The original model predicted each sampled instance and recorded the corresponding feature importance scores for each, which measured each feature’s contribution to the model’s decision-making process. The model was retrained on the same dataset with different random states in order to simulate 50 runs and produce surrogate models for comparison in order to assess fidelity. The proportion of instances in which the predictions made by these surrogate models that were matched with those made by the original model was used to compute the fidelity score. As for stability, the method uses a tolerance threshold for comparison to evaluate how consistently feature importance rankings are applied across the several trained models. To provide a complete picture of the DT model’s performance, the average fidelity and stability scores were calculated for each sampled instance.

In order to compute stability, we performed two rounds of LIME and SHAP explanations for each of the 10 instances (which were randomly selected from the test data). For fidelity evaluation (see Section 2.5), we recorded the accuracy by comparing the predictions from both LIME and SHAP to those of the original ANN model for each instance. After calculating the fidelity and stability scores for each method, the average scores were computed for the 10 instances to provide a comprehensive evaluation.

Figure 6 and Figure 7 present the total feature importance produced by LIME and SHAP for the different features in the dataset. Every bar depicts the average importance of a feature, while the error bars show the standard deviation, which reflects the variations in significance among the various dataset occurrences. Positive values show a positive contribution from the feature to the model’s predictions, whereas negative values imply a negative contribution. The distribution of the error bars illustrates how each characteristic is constantly given a certain priority by explanation techniques. Larger error bars on features show more variation, suggesting that their impact is more instance-specific or variable between model runs. The comparatively narrow error bars and significant mean values show that certain features are clearly highly important and show this consistently across several runs. Features like “score”, “sum_click”, and “assessment_count”, for example, display a strong positive mean with a small standard deviation. These features exhibit a consistent and significant contribution to the model’s forecasts. These visualizations aid in determining which features are most important in influencing model predictions, and they also illustrate the positive effects and negative aspects of each individual feature in the predictive model.

5. Discussion

In this experiment, two ML models—the DT and ANN model—were assessed according to their interpretability and performance. The performance indicators for the DT and ANN classifiers demonstrate clear variations in their efficacy, as seen in Table 3. While the ANN model showed a better accuracy of 92.09% and a slightly higher standard deviation of 0.0012, the DT attained an accuracy of 89% with a standard deviation of 0.0002. Overall, these metrics (accuracy, recall, F-score, and precision) show that the ANN classifier performs better than the DT on all assessed variables, indicating increased reliability, as well as effectiveness, in classification tasks. Nevertheless, performance by itself does not give a whole picture, especially when it comes to the explainability and stability of the model.

Table 4 presents the stability and fidelity of the different explanatory methods used with the ANN and DT classifiers. The DT proved this, achieving a 100% score for both fidelity and stability. This suggests that the DT’s explanations were completely consistent between instances and that the behavior of the local surrogate models accurately approximated it. However, for this particular instance (see Table 5), the original model’s prediction was incorrect, as seen by the disparity between the real class and the predicted class. Also, if both the original and surrogate models predict the same incorrect class, the fidelity score will still be 100%.

Due to several causes, such as decision boundary restrictions, model complexity, or insufficient representation of particular instances in the training data, the DT model could have misclassified specific instances and predicted Class 0 instead of Class 1. This model tends to suffer from overfitting, particularly when dealing with noisy data or many features that are weakly correlated with the target outcome. The DT splits the data sequentially on individual features, which can lead to highly specialized branches that do not generalize well to new, unseen data. For instance, strong predictors like “assessment_count” were heavily utilized in the top splits of the DT model created for the OULA dataset, as has been previously reported. But this binary decision could miss more complex relationships, such as how the combination of delayed assignment submissions (delay with 0.044912) and total clicks on resources (sum_click at 0.033845) influence overall performance. The simple splits in the tree struggle to capture these complex interactions between features, which results in overspecialized branches that are not very portable to new student data. The model’s accuracy of 89% indicates that its overall performance is still good despite these poorer predictors. For the most significant predictors, it is evident how interpretable the model is; nevertheless, its understanding of the weaker features is rather limited.

However, LIME, when applied to ANN model, showed marginally less fidelity (80%) and stability (80%), but it still had many benefits, especially when it comes to handling complex data and capturing intricate patterns. SHAP, when also applied to the ANN model, demonstrated comparable fidelity (79.39%) but significantly outperformed LIME in stability (98.43%). Both SHAP and LIME apply for local explanations, helping to interpret individual predictions by attributing importance to specific features. The lower fidelity score suggests that the SHAP and LIME surrogate models do not fully replicate the ANN’s predictions; however, the ANN is particularly good at picking up on small relationships within the data that more simple models, like the DT, could miss. A few features, including the number of completed assessments (assessment_count) and the submission delay (delay), are evident in Figure 6 and had a crucial role in the prediction model. ANNs integrate these features in a more complex way, capturing interactions between multiple elements at once, whereas DT depends on binary splits to make decisions about these aspects. For example, instead of only assessing if a student was late in submitting assignments, an ANN may investigate how this delay interacts with the overall number of assessments completed or the specific module the student is enrolled in, as indicated by features like code_module_BBB or sum_click. Based on the analysis, it can be concluded that the ANN model delivers more consistent and accurate predictions compared to the DT model. Additionally, the stability scores indicate that, although the ANN model with LIME explanations is effective, the ANN model with SHAP explanations exhibits higher stability, resulting in more consistent predictions.

Finally, the findings from the OULAD dataset might be affected by its specific features, like the demographics, culture, and educational background of the Open University students. As a result, the outcomes are likely more relevant to online or distance learning settings and may not fully apply to the traditional classroom environment with different teaching methods. Additionally, the geographic and demographic diversity represented by the dataset enhances its applicability to other online education platforms globally. However, researchers should be mindful of potential limitations when applying the results to settings with different education systems or cultures.

6. Limitations and Future Research

Though employing LIME to evaluate the explainability in DT and ANN models has shown promising outcomes, this method has several limitations. When used on high-dimensional datasets such as the OULAD, where feature interactions may affect the stability of explanations, LIME’s reliance on perturbation-based approaches might lead to inconsistent or noisy outcomes. Additionally, while LIME and SHAP offer explanations at the local stage, they might not adequately convey the model’s global behavior. This can provide challenges, particularly for intricate models such as deep neural networks, where it is crucial to comprehend the overall decision-making process. We used LIME and SHAP to evaluate explainability, and the results showed that the abovementioned models produced significant fidelity and stability metrics. To obtain more comprehensive knowledge of how different algorithms impact model interpretability, more research utilizing other explanation strategies might be necessary.

Future work might examine the relationships between these explanation techniques and various feature types, with a particular focus on essential explainability metrics like stability and fidelity, to ensure explainability in the decisions made by the models. Additionally, as noted by Al-kfairy et al. [41], in order to use AI systems in education, it is essential that ethical concerns be addressed, such as ensuring accountability and fairness in explainable AI. Similar to this, Vetter et al. [42] highlighted the significance of a local challenge of AI ethics, especially in sensitive data, where explainability is essential in fostering trust and supporting both educators and students in making decisions. Integrating these perspectives into future research could help bridge the gap between practical model performance and the ethical considerations of deploying AI in educational settings.

Author Contributions

Conceptualization, S.G. and M.S.; methodology, S.G.; validation, S.G. and M.S.; formal analysis, S.G. and M.S.; investigation, S.G. and M.S.; resources, S.G.; data curation, S.G. and M.S.; writing—original draft preparation, S.G.; writing—review and editing, S.G. and M.S.; supervision, M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Academy of Finland (project no. 356314).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset analyzed for this study can be found at https://github.com/gogoladzetedo/Open_University_Analytics (accessed on 2 September 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
COMPAS	Correctional Offender Management Profiling for Alternative Sanctions
DARPA	Defence Advanced Research Projects Agency
DT	Decision Tree
LIME	Local Interpretable Model-Agnostic Explanations
ML	Machine Learning
NN	Neural Network
OULAD	Open University Learning Analytics Dataset
SHAP	SHapley Additive exPlanations
XAI	Explainable Artificial Intelligence
ReLU	Rectified Linear Unit

References

Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A survey on bias and fairness in machine learning. ACM Comput. Surv. (CSUR) 2021, 54, 1–35. [Google Scholar] [CrossRef]
Gunning, D.; Aha, D. DARPA’s explainable artificial intelligence (XAI) program. AI Mag. 2019, 40, 44–58. [Google Scholar]
Carvalho, D.V.; Pereira, E.M.; Cardoso, J.S. Machine learning interpretability: A survey on methods and metrics. Electronics 2019, 8, 832. [Google Scholar] [CrossRef]
Saarela, M.; Podgorelec, V. Recent Applications of Explainable AI (XAI): A Systematic Literature Review. Appl. Sci. 2024, 14, 8884. [Google Scholar] [CrossRef]
Gunasekara, S.; Saarela, M. Explainability in Educational Data Mining and Learning Analytics: An Umbrella Review. In Proceedings of the International Conference on Educational Data Mining, Atlanta, GA, USA, 15–19 July 2024; pp. 887–892. [Google Scholar]
Li, C.; Li, M.; Huang, C.L.; Tseng, Y.T.; Kim, S.H.; Yeom, S. Educational Data Mining in Prediction of Students’ Learning Performance: A Scoping Review. In Proceedings of the IFIP World Conference on Computers in Education, Hiroshima, Japan, 21–25 August 2022; pp. 361–372. [Google Scholar]
Masruroh, S.U.; Rosyada, D.; Vitalaya, N.A.R. Adaptive Recommendation System in Education Data Mining using Knowledge Discovery for Academic Predictive Analysis: Systematic Literature Review. In Proceedings of the 2021 9th International Conference on Cyber and IT Service Management (CITSM), Bengkulu, Indonesia, 22–23 September 2021; pp. 1–6. [Google Scholar]
Shahiri, A.M.; Husain, W. A review on predicting student’s performance using data mining techniques. Procedia Comput. Sci. 2015, 72, 414–422. [Google Scholar] [CrossRef]
Waheed, H.; Hassan, S.U.; Nawaz, R.; Aljohani, N.R.; Chen, G.; Gasevic, D. Early prediction of learners at risk in self-paced education: A neural network approach. Expert Syst. Appl. 2023, 213, 118868. [Google Scholar] [CrossRef]
Matetic, M. Mining learning management system data using interpretable neural networks. In Proceedings of the 2019 42nd International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 20–24 May 2019; pp. 1282–1287. [Google Scholar]
Swamy, V.; Radmehr, B.; Krco, N.; Marras, M.; Käser, T. Evaluating the explainers: Black-box explainable machine learning for student success prediction in MOOCs. Educ. Data Min. 2022, 2, 98–109. [Google Scholar]
Rangone, G.N.; Montejano, G.A.; Garis, A.G.; Pizarro, C.A.; Molina, W.R. An Educational Data Mining Model based on Auto Machine Learning and Interpretable Machine Learning. In Proceedings of the 2022 IEEE Global Conference on Computing, Power and Communication Technologies (GlobConPT), New Delhi, India, 23–25 September 2022; pp. 1–6. [Google Scholar]
Sohail, S.; Alvi, A.; Khanum, A. Interpretable and Adaptable Early Warning Learning Analytics Model. Comput. Mater. Contin. 2022, 71, 3211–3225. [Google Scholar] [CrossRef]
Kondo, N.; Matsuda, T.; Hayashi, Y.; Matsukawa, H.; Tsubakimoto, M.; Watanabe, Y.; Tateishi, S.; Yamashita, H. Academic Success Prediction based on Important Student Data Selected via Multi-objective Evolutionary Computation. In Proceedings of the 2020 9th International Congress on Advanced Applied Informatics (IIAI-AAI), Kitakyushu, Japan, 1–15 September 2020; pp. 370–373. [Google Scholar]
Capuano, N.; Rossi, D.; Ströele, V.; Caballé, S. Explainable Prediction of Student Performance in Online Courses. In Proceedings of the Learning Ideas Conference, New York, NY, USA, 14–16 June 2023; pp. 639–652. [Google Scholar]
Alamri, R.; Alharbi, B. Explainable student performance prediction models: A systematic review. IEEE Access 2021, 9, 33132–33143. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O. Decision trees. In Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2005; Volume 2, pp. 165–192. [Google Scholar]
Yang, G.R.; Wang, X.J. Artificial neural networks for neuroscientists: A primer. Neuron 2020, 107, 1048–1070. [Google Scholar] [CrossRef]
Malik, S.; Jothimani, K. Enhancing Student Success Prediction with FeatureX: A Fusion Voting Classifier Algorithm with Hybrid Feature Selection. Educ. Inf. Technol. 2024, 29, 8741–8791. [Google Scholar] [CrossRef]
Vilone, G.; Longo, L. Notions of explainability and evaluation approaches for explainable artificial intelligence. Inf. Fusion 2021, 76, 89–106. [Google Scholar] [CrossRef]
Saarela, M.; Heilala, V.; Jääskelä, P.; Rantakaulio, A.; Kärkkäinen, T. Explainable student agency analytics. IEEE Access 2021, 9, 137444–137459. [Google Scholar] [CrossRef]
Kuzilek, J.; Hlosta, M.; Zdrahal, Z. Open university learning analytics dataset. Sci. Data 2017, 4, 170171. [Google Scholar] [CrossRef]
Chavez, H.; Chavez-Arias, B.; Contreras-Rosas, S.; Alvarez-Rodríguez, J.M.; Raymundo, C. Artificial neural network model to predict student performance using nonpersonal information. Front. Educ. 2023, 8, 1106679. [Google Scholar] [CrossRef]
Casalino, G.; Ducange, P.; Fazzolari, M.; Pecori, R. Fuzzy Hoeffding Decision Trees for Learning Analytics. In Proceedings of the OLUD@ WCCI, Parma, Italy, 18 July 2022; pp. 1–9. [Google Scholar]
Alonso, J.M.; Casalino, G. Explainable artificial intelligence for human-centric data analysis in virtual learning environments. In Proceedings of the International Workshop on Higher Education Learning Methodologies and Technologies Online, Novedrate, Italy, 6–7 June 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 125–138. [Google Scholar]
Gámez-Granados, J.C.; Esteban, A.; Rodriguez-Lozano, F.J.; Zafra, A. An algorithm based on fuzzy ordinal classification to predict students’ academic performance. Appl. Intell. 2023, 53, 27537–27559. [Google Scholar] [CrossRef]
Gunasekara, S.; Saarela, M. Quantitative Assessment of Explainability in Machine Learning Models: A Study on the OULA Dataset. In Proceedings of the 40th ACM/SIGAPP Symposium on Applied Computing, Sicily, Italy, 31 March–4 April 2025. [Google Scholar]
Gupta, A.; Garg, D.; Kumar, P. Mining sequential learning trajectories with hidden markov models for early prediction of at-risk students in e-learning environments. IEEE Trans. Learn. Technol. 2022, 15, 783–797. [Google Scholar] [CrossRef]
Pei, B.; Xing, W. An interpretable pipeline for identifying at-risk students. J. Educ. Comput. Res. 2022, 60, 380–405. [Google Scholar] [CrossRef]
Qin, A.; Boicu, M. EduBoost: An Interpretable Grey-Box Model Approach to Identify and Prevent Student Failure and Dropout. In Proceedings of the 2023 IEEE Frontiers in Education Conference (FIE), College Station, TX, USA, 18–21 October 2023; pp. 1–7. [Google Scholar]
Lu, J.; Mou, J.; Li, P. Interpretive Analyses of Learner Dropout Prediction in Online STEM Courses. In Proceedings of the 2023 5th International Conference on Computer Science and Technologies in Education (CSTE), Xi’an, China, 21–23 April 2023; pp. 1–9. [Google Scholar]
Casalino, G.; Ducange, P.; Fazzolari, M.; Pecori, R. Incremental and interpretable learning analytics through fuzzy hoeffding decision trees. In Proceedings of the International Workshop on Higher Education Learning Methodologies and Technologies Online, Palermo, Italy, 21–23 September 2022; pp. 674–690. [Google Scholar]
Casalino, G.; Castellano, G.; Zaza, G. Neuro-fuzzy systems for learning analytics. In Proceedings of the International Conference on Intelligent Systems Design and Applications, Online, 13–15 December 2021; pp. 1341–1350. [Google Scholar]
Heuer, H.; Breiter, A. Student Success Prediction and the Trade-Off between Big Data and Data Minimization. In Proceedings of the DeLFI 2018—Die 16. E-Learning Fachtagung Informatik, Bonn, Germany, 10–12 September 2018; pp. 219–230. [Google Scholar]
Rizvi, S.; Rienties, B.; Khoja, S.A. The role of demographics in online learning; A decision tree based approach. Comput. Educ. 2019, 137, 32–47. [Google Scholar] [CrossRef]
Azizah, E.N.; Pujianto, U.; Nugraha, E. Comparative performance between C4.5 and Naive Bayes classifiers in predicting student academic performance in a Virtual Learning Environment. In Proceedings of the 2018 4th International Conference on Education and Technology (ICET), Malang, Indonesia, 26–27 October 2018; pp. 18–22. [Google Scholar]
Wasif, M.; Waheed, H.; Aljohani, N.R.; Hassan, S.U. Understanding student learning behavior and predicting their performance. In Cognitive Computing in Technology-Enhanced Learning; IGI Global: Hershey, PA, USA, 2019; pp. 1–28. [Google Scholar]
Haiyang, L.; Wang, Z.; Benachour, P.; Tubman, P. A time series classification method for behaviour-based dropout prediction. In Proceedings of the 2018 IEEE 18th International Conference on Advanced Learning Technologies (ICALT), Mumbai, India, 9–13 July 2018; pp. 191–195. [Google Scholar]
Queiroga, E.M.; Batista Machado, M.F.; Paragarino, V.R.; Primo, T.T.; Cechinel, C. Early prediction of at-risk students in secondary education: A countrywide k-12 learning analytics initiative in uruguay. Information 2022, 13, 401. [Google Scholar] [CrossRef]
Saarela, M. On the relation of causality-versus correlation-based feature selection on model fairness. In Proceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, Ávila, Spain, 8–12 April 2024; pp. 56–64. [Google Scholar]
Al-kfairy, M.; Mustafa, D.; Kshetri, N.; Insiew, M.; Alfandi, O. Ethical challenges and solutions of generative AI: An interdisciplinary perspective. Informatics 2024, 11, 58. [Google Scholar] [CrossRef]
Vetter, M.A.; Lucia, B.; Jiang, J.; Othman, M. Towards a framework for local Interrogation of AI ethics: A case study on text generators, academic integrity, and composing with ChatGPT. Comput. Compos. 2024, 71, 102831. [Google Scholar] [CrossRef]

Figure 1. ANN architecture for OULAD prediction.

Figure 2. Flowchart of our applied analysis.

Figure 3. A visualization of the top portion of the DT model, focusing on the most significant features and splits due to space constraints. Based on this approach, “assessment_count” emerges as the most important feature in the model.

Figure 4. Visualized feature plot for Students 29–30 when using LIME on the ANN model. Green bars signify positive contributions that support the outcome (PASS), and red bars signify negative contributions that contradict the result (FAIL).

Figure 5. Visualized feature contributions heatmap for Students 501 and 801 using SHAP on the ANN model.

Figure 6. Aggregated feature importances from LIME explanations. Based on this approach, “num_of_attempts” emerged as the most important feature in the model.

Figure 7. Aggregated feature importances from SHAP explanations. Based on this approach, “assessment_count” and “score” emerged as the most important features in the model.

Table 1. A comparative review of the studies that utilized the OULAD.

Ref.	Objectives	Algorithms Were Utilized	XAI Technique(s)	Evaluation
[15]	Measure student performance using demographic, administrative, engagement, and intra-course outcome data	LR, RF, KNN, DT, SVM, MLP	SHAP	Both SVM and RF show good performance in most of the tasks/Equally crucial for predictions is ensuring fairness and fostering trust in their adoption.
[33]	Forecast student performance by examining how they engage with platforms	Neuro-fuzzy network	Feature selection	Feature selection slightly reduced the neuro-fuzzy system performance.
[25]	Predict student performance	DT, FURIA	Decision rules, Visualization Tools	Automatically produced multimodal explanations using a combination of text and graphics.
[31]	Study how online behavior and background information impact STEM dropout rates separately and together	LR, SVM, MLP, GB	SHAP	Achieved 90% accuracy using GB.
[30]	Identify and prevent student failure and dropout	KNN, SVM, DT, MLP, RF, Voting classifier	Feature selection	Ensemble Black Box model performs the best over white-box model.
[29]	Identify at-risk students by examining learning activities more closely on a weekly basis, with an emphasis on output interpretability	SVM, DT, RF, Proposed model	LIME	Proposed model delivered the best performance compared with the baseline models when identifying at-risk students.
[13]	Introduce an interpretable rule-based genetic programming classifier using student data from multiple sources	RF	CN2 rule inducer algorithm	A hybrid statistical fuzzy system identified students likely to fail by analyzing their performance in initial assessments.
[34]	Measure student performance using daily activities	SVM, DT, RF, LR	Feature selection	Achieved 90.85% accuracy using SVM.
[35]	Determine how demographics affect academic success	DT	Decision rules	Achieved 83.14% accuracy.
[36]	Identify academic performance by the web pages visited	C4.5 Tree, NB	None	Achieved 63.8% accuracy using NB.
[37]	Determining which students are most likely to be unsuccessful	LR, SVM, RF, LR, NB	Feature selection	Achieved 63.8% accuracy using NB.
[38]	Using data to build day-by-day sequences to estimate dropout rates early	DT	Decision rules	Achieved 90% accuracy.
[39]	To identify and analyze biases by considering gender, participation in social welfare programs, and school zone location	LR, NB, DT, MLP, RF	Feature importance	Achieved high accuracy using RF, analyzed biases.
[40]	To evaluate the impact of different feature selection methods on the performance and fairness of ML models	LR, SVM, MLP, RF	Feature selection	Achieved the best performance using RF/causality-based FS, which generally resulted in fairer ML models, whereas correlation-based FS tended to yield models with higher performance.

Note: SVM = Support VectorMachine, DT = Decision Tree, RF = RandomForest, LR = Logistic Regression, NB = Naïve Bayes, ANN = Artificial Neural Network, MLP = Multi Layer Perceptron, GB = Gradient Boosting, KNN = K-Nearest Neighbor, SHAP = SHapley Additive exPlanations, and FURIA = Fuzzy Unordered Rule Induction Algorithm.

Table 2. Description of attributes.

Attribute Name	Description
gender	Student’s gender (F/M)
region	The geographic region in which the student resided during the module presentation. (East Anglian Region, East Midlands Region, West Midlands Region, South West Region, North Western Region, North Region, Yorkshire Region, South East Region, South Region, London Region, Scotland, Ireland, Wales)
highest_education	The highest level of education attained by students prior to the module presentation (Lower Than A Level, A Level or Equivalent, HE Qualification, No Formal quals., Post Graduate Qualification).
imd_band	The Index of Multiple Deprivation (IMD) band of the place where the student resided during the module_presentation. It is a metric used by the UK government to assess deprived regions within local authorities in England. (e.g., 30–40%)
age_band	Band of student’s age (0–35, 35–55, 55≤)
num_of_attempts	The total number of times the student has tried this module.
studied_credits	The total amount of credits earned by student for each module they are presently enrolled in.
disability	Indicates if the student has disclosed a disability (Y/N).
code_module	Identification code (AAA, BBB, and CCC) of the module that the assessment is associated with.
code_presentation	The presentation ID code, which is used to register a student for the module (2013B, 2013J, 2014B, 2014J).
assessment_count	The quantity of assessments the student submitted for the module.
sum_click	How many times the student engaged with the material.
delay	Average delays in assessment submissions (with average number of days fixed assessment date and submission date by student).
score	The assessment result for the student. A score of less than 40 is considered a fail. The scores fall between 0 and 100.

Table 3. Performance metrics for the OULAD test sets are presented, including the mean and standard deviation calculated over five-fold cross-validation with 50 repetitions. The highest average score for each metric is highlighted in bold.

Metric	Decision Tree		ANN Classifier
	Mean	Std	Mean	Std
Accuracy	0.8900	0.0002	0.9209	0.0012
Precision	0.8899	0.0001	0.8987	0.0030
Recall	0.8900	0.0001	0.9320	0.0044
F1 Score	0.8900	0.0001	0.9149	0.0014

Table 4. Fidelity and stability for feature importance, LIME, and SHAP on the OULA dataset (over five-fold cross-validation and 50 repetitions).

	Feature Importance
	Fidelity %	Stability %
Decision Tree	100	100
	LIME
ANN model	80	80
	SHAP
ANN model	79.39	98.43

Table 5. Fidelity and stability scores for selected instances in a decision tree model.

Instance	True Class	Original Prediction	Fidelity (%)	Stability (%)
1529	1	0	100	100
684	0	0	100	100

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gunasekara, S.; Saarela, M. Explainable AI in Education: Techniques and Qualitative Assessment. Appl. Sci. 2025, 15, 1239. https://doi.org/10.3390/app15031239

AMA Style

Gunasekara S, Saarela M. Explainable AI in Education: Techniques and Qualitative Assessment. Applied Sciences. 2025; 15(3):1239. https://doi.org/10.3390/app15031239

Chicago/Turabian Style

Gunasekara, Sachini, and Mirka Saarela. 2025. "Explainable AI in Education: Techniques and Qualitative Assessment" Applied Sciences 15, no. 3: 1239. https://doi.org/10.3390/app15031239

APA Style

Gunasekara, S., & Saarela, M. (2025). Explainable AI in Education: Techniques and Qualitative Assessment. Applied Sciences, 15(3), 1239. https://doi.org/10.3390/app15031239

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Explainable AI in Education: Techniques and Qualitative Assessment

Abstract

1. Introduction

2. Theoretical Background

2.1. Decision Tree

2.2. Artificial Neural Network

2.3. Local Interpretable Model-Agnostic Explanations

2.4. SHapley Additive exPlanations

2.5. Feature Importance

2.6. Explainability Metrics

2.6.1. Fidelity Metrics

2.6.2. Stability Metrics

3. Experimental Setup

3.1. OULA Dataset

3.2. Model Evaluation

4. Results

5. Discussion

6. Limitations and Future Research

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI