1. Introduction
According to South Korea’s 2022 Basic Education Statistics [
1], the school-age population is declining. Compared to 2021, the 2022 school-age population of South Korean universities decreased by 2.6% and that of colleges decreased by 6.4%. The number of students matriculating to universities has significantly decreased since the advent of the COVID-19 pandemic. The decline in the school-age population is a severe problem in South Korea.
When the school-age population declines, universities must be prepared to both recruit new students and prevent current ones from dropping out. One of the main concerns for universities is the student dropout rate, which not only costs individuals but also impacts universities, local communities, and the nation. For instance, in 2020 the dropout rate increased by 1.1%, with rates of 1.9% in metropolitan areas and 3% in provinces. Students in provincial universities may be more susceptible to dropping out due to factors such as employment prospects or career goals [
2]. Those who drop out often transfer to higher-ranked universities in metropolitan areas. This trend is expected to exacerbate the already significant issue of population influx to urban areas, leading to a structural waste of educational and financial resources. Furthermore, the problem has additional side effects, including missed educational opportunities for those who could have enrolled, disruptions to the academic atmosphere, and a waste of management resources.
Universities can prevent students from wasting time and losing interest by predicting which students are likely to drop out and providing personalized support. In order to do so, universities require a system that can be responsive to students who experience a change of heart. Several studies [
3,
4,
5,
6,
7,
8] have examined the prediction of student dropout rates in universities. Two key factors crucial in predicting dropout rates are precision and recall. High dropout precision means that the model can accurately predict which students are likely to drop out, which is essential for directing counseling resources. Conversely, high dropout recall is important because it enables universities to identify all students who are at risk of dropping out. When only one of the indexes is high, there can be problems. For instance, a model may correctly predict twenty students who are about to drop out (100% precision) but fail to predict another forty (50% recall), resulting in missed counseling opportunities. Alternatively, a model may correctly predict all students who will drop out (100% recall) but have low precision, wasting university resources on students who will not actually drop out.
In this paper, we solve the problem of predicting the students who are about to drop out of the university using data-driven algorithms. The advantages of data-driven algorithms are their abilities to automatically learn from data, adapt to changing circumstances, and improve their performance over time. They uncover patterns and insights that may not be immediately apparent to human analysts and process large amounts of data quickly and accurately.
Improving precision and recall indexes depends on the quantity and quality of data accumulated by the university. Since universities typically maintain various student records, data quality is generally not a significant concern. However, ensuring an adequate amount of data that describe students who drop out is essential for predictive models’ accuracy. Dealing with asymmetrical data is challenging because the average student dropout rate is low, at only 1.9% in metropolitan areas and 3% in provinces. Consequently, over 97% of the data do not describe students who drop out. To overcome this challenge, the preprocessing of the feature set is necessary to address imbalanced data and prevent them from negatively affecting machine learning processes.
The imbalanced data preprocessing methods can be categorized into an algorithmic approach, a data approach, and a cost-sensitive approach [
9]. The algorithmic approach adjusts or tunes the model’s hyperparameters to increase the model’s performance. However, finding the appropriate values for the hyperparameters takes a long time, and it may only work with specific machine learning models. The data approach in the preprocessing process samples the available data, which may reduce the probability of overfitting the given data and the model. However, it may have low accuracy because only a tiny portion of the data are used. To address imbalanced data using the data approach, three methods are commonly used: oversampling [
3,
9], undersampling [
3], and a combined approach [
8]. Oversampling inflates minor class data, undersampling reduces major class data, and the combined approach balances the benefits of both. The cost-sensitive approach is a method of re-learning data by giving different weights to misclassified data by exploiting other algorithms. Although the weights can be automatically learned, they can only work with some models. Note that algorithmic and cost-sensitive approaches depend on the specific algorithm in the supervised learning.
In this paper, we introduce the Student Dropout Prediction (SDP) system, which aims to enhance the precision and recall index of predicting student dropouts, providing valuable insights to academic administration and counselors. The SDP system identifies significant features through permutation importance and SHAP analysis and addresses data imbalance by utilizing a data approach. It predicts potential student dropouts by employing a hybrid model that combines the XGBoost model with the SMOTE oversampling method and the CatBoost model with the RandomOverSamplerSMOTEENN model. To further assist academic administration, the data are analyzed using a clustering method to identify distinct groups of students who require different types of support, such as mentoring, dormitory assistance, or scholarships.
Between 2015 and 2021, we obtained 67,060 student records from Gyeongsang National University and identified 27 essential feature sets from the available 40 features. Additionally, by predicting the reasons for dropouts and providing department-specific guidelines, we were able to offer personalized counseling to students. The contribution of this paper is as follows:
We offer guidelines for designing a model based on the most recent dropout data from South Korea’s Flagship National University.
We propose the SDP system, a hybrid model that enhances dropout precision and recall while more accurately identifying the “high-risk” group and detecting a greater number of dropouts.
To provide customized counseling to students at risk of dropping out, we employ a clustering algorithm to identify the reasons behind this tendency. These reasons are subsequently shared with counselors and departments for effective intervention.
Section 2 presents the related work on predicting university dropout. The characteristics and basic statistics of the data used in this paper are described in
Section 3. The proposed prediction model, the SDP (Student Dropout Prediction) system, is described in
Section 4.
Section 5 presents the experiment results.
Section 6 discusses the applicability of the presented results and suggestions to the academic administrators. Finally,
Section 7 concludes the paper.
2. Related Work
Yaacob et al. [
5] conducted a study on 64 computer science students in the 1st and 2nd semesters in the year 2016, measuring their academic grades in 26 courses including mathematics and IT courses. The authors experimented with several machine learning models, such as logistic regression, KNN, random forest, artificial neural networks, and decision trees, to predict the students’ performance. Although the data were imbalanced, no special imbalanced data processing was applied. Logistic regression exhibited the highest accuracy and AUC values. However, the authors did not measure the dropout precision and dropout recall metrics; instead, they evaluated their model’s performance using the AUC.
Shynarbek et al. [
6] collected 366 student records in the department of computer science at Suleyman Demirel University, comprising grades in mathematics and computer-related courses from 2016 to 2017. The authors created a feature set using only mathematics and computer subjects and applied several machine learning models, such as the naive Bayes model, support vector machines, logistic regression, and artificial neural networks, to predict students’ academic performance. Unlike Yaacob et al. [
5], they used four metrics, accuracy, recall, precision, and F1 score, to measure the prediction rate. Shynarbek et al. [
6] replaced the missing values with random values in the data preparation process. The data imbalance was not mentioned in the paper. The naive Bayes and artificial neural network methods exhibited the highest accuracy (0.96) and precision (0.94), recall (0.94), and F1 (0.94) scores, respectively.
Silva et al. [
7] used 331 undergraduate students’ academic grades and personal information, including 23 feature sets, from the department of computer engineering at Universidade de Trás-os-Montes e Alto Douro (UTAD) from 2011 to 2019. Of the 331 students, 124 are dropouts and 207 are students who successfully graduated from the university. The authors applied several machine learning models, such as CatBoost, random forest, XGBoost, and artificial neural networks, to predict the students’ academic performance. To handle imbalanced data, they applied RandomOverSampling during preprocessing. In the preprocessing process, they scaled the data with MinMaxScaler and performed RandomOverSampling as imbalanced data processing. The authors used three metrics, precision, recall, and F1 score, to evaluate the models’ performance. The training/test ratio was 8:2, and they performed 10-fold cross-validation. Artificial neural networks, XGBoost, and random forest exhibited the highest precision (0.85), recall (0.83), and F1 score (0.81), respectively.
Fernández et al. [
8] collected data from 1418 undergraduate students, where 783 were dropouts and 635 were non-dropouts. The feature set comprised 19 enrollment-related fields, 14 qualification-related fields, and 4 scholarship-related fields, excluding student IDs and redundant data. The authors used numerical data with MinMaxScaler and categorical data with one-hot encoding. To handle the imbalance in the data, the authors applied the SMOTETomek method, a combination of the SMOTE and Tomek links methods, during preprocessing. The authors applied several machine learning models, such as gradient noosting, random forest, support vector machine, and ensemble models, to predict the students’ dropout rate in each semester. They evaluated the models’ performance using the dropout recall and dropout precision metrics. In the enrollment model, the dropout recall of gradient boosting was the highest at 72.340, and dropout precision using the support vector machine method was the highest at 65.854. In the 1st semester model, the dropout recall was 82.237 for the ensemble model and gradient boosting had the highest dropout precision of 84.277. In the 2nd semester model, the ensemble model had the highest dropout recall at 82.237, and the gradient boosting had the highest dropout precision at 79.245. In the 3rd semester model, both the dropout recall and dropout precision were the highest in the random forest model, at 88.462 and 86.792, respectively. In the 4th semester model, the dropout recall and dropout precision were the highest in the support vector machine model, at 91.549 and 89.041, respectively.
Barros et al. [
3] gathered 7718 student records from the Federal Institute of Rio Grande do Norte, utilizing 6 mathematics and 19 demographics- and socio-economic-related courses as feature sets. To deal with imbalanced data, they employed downsampling, SMOTE, ADYSYN, and balanced bagging techniques for each experiment. They tested artificial neural network and decision tree models with training/test ratios of 75% and 25%, respectively. The highest precision was obtained using the oversampling (SMOTE, ADAYSYN) technique of artificial neural networks at 0.991. For recall and F1, the unprocessed decision tree method performed the best, at 0.977 and 0.976, respectively.
Baranyi et al. [
4] not only acquired the university transcript and personal information but also utilized high school grades to predict dropouts. They used balanced 8319 students from 2013 to 2019 at the Budapest University of Technology and Economics, composing 30 feature sets—5 related to the university program, 21 to high school, and 4 to personal data. They tested various models, such as artificial neural networks, Tabnet, XGBoost, random forest, and BaggingFCNN. They optimized the hyperparameters of artificial neural networks using the hyperas package. The authors also used SHAP analysis to identify the most influential variables and found that “years elapsed” (the years since the matura examination) was the most influential variable, followed by grade-related features such as “University admission score”. The experiment showed that artificial neural networks had the highest precision of 0.747 and recall of 0.667.
Niyogisubizo et al. [
10] predicted class dropout using data from Constantine the Philosopher University in Nitra from 2016 to 2020. The authors utilized primary data, including “tests”, “access”, and “project”, which had a high correlation. They stacked random forest, XGBoost, and gradient boosting and used the output as input for artificial neural networks. The stacking ensemble showed high performance with overall precision, recall, and F1 score values of 0.93, 0.93, and 0.92, respectively, of midpoint and midpoint deviation.
The review of related work found that many previous studies had small data sets and did not address imbalanced data. Additionally, most of the works considered academic grades as the most important predictor of student dropout. However, the methods and data sets used in these studies varied, making it challenging to compare the models. Furthermore, the results of previous works often only showed high precision or recall metrics but not both, and the reasons for dropout were not analyzed.
To address these gaps, the authors of this study used a large data set spanning five years and included student activities in addition to academic grades. We also compared the proposed approach with existing models using our data to ensure a fair comparison. We used a hybrid model to achieve high precision and recall rates for predicting student dropout. Finally, we analyzed the reasons for dropout to assist counselors and administrators in supporting students and making informed decisions. Overall, this study contributes to the field by using a comprehensive approach that considers various factors to predict student dropout and analyzes the underlying reasons for dropout.