A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space
Abstract
:1. Introduction
- Development of an educational dataset incorporating multidimensional space–time attributes aimed at forecasting student performance.
- Successful application of a predictive model based on a dataset enriched with multidimensional space–time attributes for forecasting student performance.
- Examination of attribute significance in relation to prediction outcomes, pinpointing critical elements impacting student performance.
- Based on the prediction results and feature importance analysis, this study proposes a dynamic optimization strategy for enhancing teaching and learning behaviors in actual educational processes. It provides a guiding and feasible solution for the participants in educational activities to use the educational information generated during the prediction and analysis process.
2. Dataset and Data Preprocessing
2.1. Dataset Construction
2.1.1. Data Source
2.1.2. Basic Information of Students
2.1.3. Students’ Performance at All Stages of the Semester
2.1.4. Education Indicators of Student Origin
2.2. Data Processing
- In addressing the issue of limited and incomplete data, this study employed a novel data imputation method to fill in missing values. Additionally, a new data augmentation technique was adopted to increase the volume of data, facilitating the training of predictive models.
- In the data processing stage, the category imbalance of students should be taken into account and processed.
2.2.1. Normalization
2.2.2. Missing Value Completion
- Select K value: determine the size of K, usually through cross-validation to select the best K.
- Calculate Distance: calculate the Euclidean distance between the target point and all known points. The Euclidean distance formula is:
- Find K Nearest Neighbors: select the K known data points that are closest to the target point.
- Weighted averaging: weight the values of K neighbors, with weights inversely proportional to distance. The weighted average interpolation for K nearest neighbors can be expressed as
2.2.3. Deal with Unbalanced Data
- Select Minority Sample: select a random sample from the minority group.
- Calculate neighbors: use a distance metric to find k nearest neighbors for that sample.
- Generate a new sample: a neighbor is randomly selected from these k neighbors and a new sample is synthesized according to the following formula:
3. Research Methods
3.1. Related Technologies
3.1.1. Predictive Models
- XGboost
- 2.
- LightGBM
- 3.
- Random forest
- 4.
- AdaBoost
- 5.
- Decision tree
- (1)
- Feature selection: The optimal feature is selected through some criteria (such as information gain, Gini coefficient, etc.).
- (2)
- Split nodes: Splits the data into different subsets based on the selected characteristics.
- (3)
- Recursive construction: The process of feature selection and splitting is repeated for each subset until the stop condition is met.
- (4)
- Pruning: After the tree construction is completed, pruning may be performed to reduce overfitting and improve the generalization ability of the model.
- 6.
- SVM:
3.1.2. SHAP Analysis
- Data Collection and Integration: Collect three types of data, including students’ demographic information, scores at various stages of the semester, and educational indicators from their places of origin, and integrate these three types of data.
- Data Preprocessing: Impute missing data using the KNN interpolation method and handle imbalanced data using SMOTE.
- Training Machine Learning Models: Obtain a multidimensional spatiotemporal dataset through steps (1) and (2), and use this dataset to train six machine learning models, including XGBoost, LightGBM, Random Forest, AdaBoost, Decision Tree, and SVM.
- Optimal Model Selection: Evaluate the models using four metrics—accuracy, recall, precision, and F1 score—to select the best predictive model.
- Feature Importance Analysis: Perform SHAP analysis and weight analysis on the models to assess the importance of each feature.
- Data Ablation: Combine and divide the multidimensional spatiotemporal dataset into seven sub-datasets, train the machine learning models using these seven sub-datasets, and analyze the experimental results.
4. Experimental Results
4.1. Forecast Results
- Accuracy
- 2.
- Recall
- , or True Positive, represents the count of true positives, while , or False Negative, indicates the number of false negatives.
- 3.
- F1 Score
4.2. Feature Analysis
4.3. Data Ablation
5. Applicability and Feasibility
5.1. Guidance for Student Actions
5.2. Guidance for Teacher Interventions
6. Summary and Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Shen, Y.; Yin, X.; Jiang, Y.; Kong, L.; Li, S.; Zeng, H. Case Studies of Information Technology Application in Education: Utilising the Internet, Big Data, Artificial Intelligence, and Cloud in Challenging Times; Springer: Singapore, 2023. [Google Scholar]
- Zhao, L.; Ren, J.; Zhang, L.; Zhao, H. Quantitative analysis and prediction of academic performance of students using machine learning. Sustainability 2023, 15, 12531. [Google Scholar] [CrossRef]
- Peña-Ayala, A. Educational data mining: A survey and a data mining-based analysis of recent works. Expert Syst. Appl. 2014, 41, 1432–1462. [Google Scholar] [CrossRef]
- Perkash, A.; Shaheen, Q.; Saleem, R.; Rustam, F.; Villar, M.G.; Alvarado, E.S.; de la Torre Diez, I.; Ashraf, I. Feature optimization and machine learning for predicting students’ academic performance in higher education institutions. Educ. Inf. Technol. 2024. [Google Scholar] [CrossRef]
- Wang, X.; Zhao, Y.; Li, C.; Ren, P. ProbSAP: A comprehensive and high-performance system for student academic performance prediction. Pattern Recognit. 2023, 137, 109309. [Google Scholar] [CrossRef]
- Grayson, A.; Miller, H.; Clarke, D.D. Identifying barriers to help-seeking: A qualitative analysis of students’ preparedness to seek help from tutors. Br. J. Guid. Couns. 1998, 26, 237–253. [Google Scholar] [CrossRef]
- Mengash, H.A. Using data mining techniques to predict student performance to support decision making in university admission systems. IEEE Access 2020, 8, 55462–55470. [Google Scholar] [CrossRef]
- Baruah, A.J.; Baruah, S. Data augmentation and deep neuro-fuzzy network for student performance prediction with MapReduce framework. Int. J. Autom. Comput. 2021, 18, 981–992. [Google Scholar] [CrossRef]
- Feng, G.; Fan, M.; Chen, Y. Analysis and prediction of students’ academic performance based on educational data mining. IEEE Access 2022, 10, 19558–19571. [Google Scholar] [CrossRef]
- Liu, C.; Wang, H.; Yuan, Z. A method for predicting the academic performances of college students based on education system data. Mathematics 2022, 10, 3737. [Google Scholar] [CrossRef]
- Yue, L.; Hu, P.; Chu, S.-C.; Pan, J.-S. Multi-objective gray wolf optimizer with cost-sensitive feature selection for predicting students’ academic performance in college English. Mathematics 2023, 11, 3396. [Google Scholar] [CrossRef]
- Injadat, M.; Moubayed, A.; Nassif, A.B.; Shami, A. Multi-split optimized bagging ensemble model selection for multi-class educational data mining. Appl. Intell. 2020, 504, 506–4528. [Google Scholar] [CrossRef]
- Bansal, V.; Buckchash, H.; Raman, B. Computational intelligence enabled student performance estimation in the age of COVID-19. SN Comput. Sci. 2022, 3, 41. [Google Scholar] [CrossRef] [PubMed]
- Asselman, A.; Khaldi, M.; Aammou, S. Enhancing the prediction of student performance based on the machine learning XGBoost algorithm. Interact. Learn. Environ. 2023, 31, 3360–3379. [Google Scholar] [CrossRef]
- Pallathadka, H.; Wenda, A.; Ramirez-Asís, E.; Asís-López, M.; Flores-Albornoz, J.; Phasinam, K. Classification and prediction of student performance data using various machine learning algorithms. Mater. Today Proc. 2023, 80, 3782–3785. [Google Scholar] [CrossRef]
- Zhang, T.; Liu, H.; Tao, J.; Wang, Y.; Yu, M.; Chen, H.; Yu, G. Enhancing Dropout Prediction in Distributed Educational Data Using Learning Pattern Awareness: A Federated Learning Approach. Mathematics 2023, 11, 4977. [Google Scholar] [CrossRef]
- Van de Werfhorst, H.G.; Mijs, J.J.B. Achievement inequality and the institutional structure of educational systems: A comparative perspective. Annu. Rev. Sociol. 2010, 36, 407–428. [Google Scholar] [CrossRef]
- Li, X. Education in China—Unbalanced Educational Development Caused by Regional Differences: Taking Gansu and Beijing (2010–2015) from Economic Perspective as Examples. J. Educ. Humanit. Soc. Sci. 2023, 8, 1441–1448. [Google Scholar] [CrossRef]
- Kim, A.S.N.; Stevenson, C.R.; Park, L. Homework, in-class assignments, and midterm exams: Investigating the predictive utility of formative and summative assessments for academic success. Open Scholarsh. Teach. Learn. 2022, 2, 92–102. [Google Scholar] [CrossRef]
- Ünal, F. Data mining for student performance prediction in education. In Data Mining: Methods, Applications and Systems; Birant, D., Ed.; IntechOpen: London, UK, 2020; pp. 423–432. [Google Scholar]
- Jain, S.; Shukla, S.; Wadhvani, R. Dynamic selection of normalization techniques using data complexity measures. Expert Syst. Appl. 2018, 106, 252–262. [Google Scholar] [CrossRef]
- Rani, P.; Vashishtha, J. An appraise of KNN to the perfection. Int. J. Comput. Appl. 2017, 170, 13–17. [Google Scholar] [CrossRef]
- Nitesh, V.C. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321. [Google Scholar]
- Chen, T.; Guestrin, C. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; pp. 3149–3157. [Google Scholar]
- Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
- Hastie, T.; Rosset, S.; Zhu, J.; Zou, H. Multi-class adaboost. Stat. Its Interface 2009, 2, 349–360. [Google Scholar] [CrossRef]
- Suthaharan, S. Decision tree learning. In Machine Learning Models and Algorithms for Big Data Classification: Thinking with Examples for Effective Learning; Springer: Boston, MA, USA, 2016; pp. 237–269. [Google Scholar]
- Jakkula, V. Tutorial on Support Vector Machine (SVM). Sch. EECS Wash. State Univ. 2006, 37, 3. [Google Scholar]
- Assegie, T.A. Evaluation of the Shapley additive explanation technique for ensemble learning methods. Proc. Eng. Technol. Innov. 2022, 21, 20–26. [Google Scholar] [CrossRef]
- Sahlaoui, H.; Alaoui, E.A.A.; Nayyar, A.; Agoujil, S.; Jaber, M.M. Predicting and interpreting student performance using ensemble models and shapley additive explanations. IEEE Access 2021, 9, 152688–152703. [Google Scholar] [CrossRef]
- Yacouby, R.; Axman, D. Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, Online, 20 November 2020; pp. 79–91. [Google Scholar]
Feature Category | Feature Name |
---|---|
Basic student data | Age |
Gender | |
Class | |
Nation | |
Origin of student | |
College entrance examination score | |
Student category (urban/rural) | |
Performance data of students at each stage | Experimental score in stages 1 to 7 of the semester |
Term stage 1 to 7 test scores | |
Education indicators of student origin | Number of teachers with PhD degrees in the source area |
Number of teachers with master’s degrees in the student regions | |
Number of teachers with undergraduate degrees in the source region | |
Number of senior teachers in the student source regions | |
Number of deputy senior teachers in the origin | |
Number of intermediate teachers in the origin | |
Number of digital terminals in the student origin | |
Number of multimedia classrooms in student places | |
Student source of education investment assets | |
Number of educational instruments and equipment in the student source area | |
Total number of books collected in the student areas | |
label | Student course scores |
Models | Hyperparameter Settings |
---|---|
XGBoost | n_estimators = 100, max_depth = 6, learning_rate = 0.3, subsample = 1.0, colsample_bytree = 1.0, min_child_weight = 1, gamma = 0, reg_alpha = 0, reg_lambda = 1, scale_pos_weight = 1 |
LightGBM | boosting_type = ‘gbdt’, num_leaves = 31, max_depth = −1, learning_rate = 0.1, n_estimators 100, subsample_for_bin = 200,000, min_split_gain 0.0, min_child_weight = 0.001, min_child_samples = 20, subsample = 1.0, subsample_freq = 0, colsample_bytree = 1.0, reg_alpha = 0.0, reg_lambda = 0.0, class_weight = None, importance_type = ‘split’, n_jobs = −1, silent = True, random_state = None |
Random Forest | n_estimators = 100, criterion = ‘gini’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = ‘auto’, max_leaf_nodes = None, min_impurity_decrease = 0.0, bootstrap = True, oob_score = False, warm_start = False |
AdaBoost | base_estimator = DecisionTreeClassifier(max_depth = 1), n_estimators = 50, learning_rate = 1.0, algorithm = ‘SAMME.R’, random_state = None |
SVM | C = 1.0, Kernel = ‘rbf’, Degree = 3, Gamma = ‘scale, coef0 = 0.0, Shrinking = True, Probability = False, tol = 0.001, cache_size = 200, class_weight = None, Verbose = False, max_iter = −1, decision_function_shape = ‘ovr’, break_ties = False, random_state = None |
Decision Tree | Criterion = ‘gini’, splitter = ‘best’, max_depth = None, min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_leaf = 0.0, max_features = None, random_state = None, max_leaf_nodes = None, min_impurity_decrease = 0.0, class_weight = None, Presort = ‘deprecated’ |
Equipment and Software | Equipment Model and Software Version |
---|---|
CPU | 13th Gen Intel(R) Core(TM) i5-13500HX 2.50 GHz |
GPU | NVIDIA GeForce RTX 4060 |
Operating system | CentOS 7.6 |
Testing software version | python 3.9, numpy 1.23.3, pandas 1.5.0, scikit-learn 1.1.2 |
Model | XGBoost | LightBGM | RF | AdaBoost | DT | SVM |
---|---|---|---|---|---|---|
Accuracy | 0.95 | 0.90 | 0.92 | 0.89 | 0.87 | 0.89 |
Recall | 0.96 | 0.85 | 0.93 | 0.92 | 0.85 | 0.88 |
Precision | 0.93 | 0.93 | 0.91 | 0.87 | 0.92 | 0.93 |
F1 score | 0.94 | 0.89 | 0.92 | 0.89 | 0.88 | 0.90 |
Dataset | Model | ACC | RC | PC | F1 |
---|---|---|---|---|---|
D1 | XGB | 0.68 | 0.70 | 0.67 | 0.68 |
LGBM | 0.62 | 0.61 | 0.65 | 0.63 | |
RF | 0.64 | 0.58 | 0.72 | 0.64 | |
AB | 0.63 | 0.70 | 0.60 | 0.65 | |
DT | 0.60 | 0.61 | 0.73 | 0.66 | |
SVM | 0.61 | 0.62 | 0.65 | 0.63 | |
D2 | XGB | 0.83 | 0.80 | 0.85 | 0.82 |
LGBM | 0.79 | 0.75 | 0.83 | 0.79 | |
RF | 0.75 | 0.82 | 0.70 | 0.76 | |
AB | 0.81 | 0.80 | 0.83 | 0.81 | |
DT | 0.71 | 0.75 | 0.69 | 0.72 | |
SVM | 0.74 | 0.77 | 0.70 | 0.73 | |
D3 | XGB | 0.80 | 0.81 | 0.78 | 0.79 |
LGBM | 0.76 | 0.75 | 0.78 | 0.76 | |
RF | 0.77 | 0.76 | 0.79 | 0.77 | |
AB | 0.75 | 0.70 | 0.80 | 0.75 | |
DT | 0.73 | 0.71 | 0.77 | 0.74 | |
SVM | 0.72 | 0.68 | 0.75 | 0.71 | |
D1 + D2 | XGB | 0.87 | 0.86 | 0.90 | 0.88 |
LGBM | 0.85 | 0.81 | 0.88 | 0.84 | |
RF | 0.82 | 0.77 | 0.83 | 0.80 | |
AB | 0.86 | 0.85 | 0.89 | 0.87 | |
DT | 0.80 | 0.81 | 0.79 | 0.80 | |
SVM | 0.83 | 0.82 | 0.85 | 0.83 | |
D1 + D3 | XGB | 0.82 | 0.85 | 0.80 | 0.82 |
LGBM | 0.80 | 0.76 | 0.82 | 0.79 | |
RF | 0.79 | 0.78 | 0.83 | 0.80 | |
AB | 0.76 | 0.73 | 0.80 | 0.76 | |
DT | 0.73 | 0.71 | 0.77 | 0.74 | |
SVM | 0.75 | 0.73 | 0.81 | 0.77 | |
D2 + D3 | XGB | 0.88 | 0.86 | 0.90 | 0.88 |
LGBM | 0.85 | 0.83 | 0.88 | 0.85 | |
RF | 0.82 | 0.77 | 0.86 | 0.81 | |
AB | 0.85 | 0.81 | 0.90 | 0.85 | |
DT | 0.81 | 0.85 | 0.77 | 0.81 | |
SVM | 0.83 | 0.88 | 0.79 | 0.83 | |
D1 + D2 + D3 | XGB | 0.95 | 0.97 | 0.94 | 0.95 |
LGBM | 0.90 | 0.88 | 0.92 | 0.90 | |
RF | 0.93 | 0.94 | 0.90 | 0.92 | |
AB | 0.91 | 0.95 | 0.88 | 0.91 | |
DT | 0.88 | 0.86 | 0.93 | 0.89 | |
SVM | 0.89 | 0.87 | 0.92 | 0.89 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Luo, Z.; Mai, J.; Feng, C.; Kong, D.; Liu, J.; Ding, Y.; Qi, B.; Zhu, Z. A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space. Mathematics 2024, 12, 3597. https://doi.org/10.3390/math12223597
Luo Z, Mai J, Feng C, Kong D, Liu J, Ding Y, Qi B, Zhu Z. A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space. Mathematics. 2024; 12(22):3597. https://doi.org/10.3390/math12223597
Chicago/Turabian StyleLuo, Zheng, Jiahao Mai, Caihong Feng, Deyao Kong, Jingyu Liu, Yunhong Ding, Bo Qi, and Zhanbo Zhu. 2024. "A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space" Mathematics 12, no. 22: 3597. https://doi.org/10.3390/math12223597
APA StyleLuo, Z., Mai, J., Feng, C., Kong, D., Liu, J., Ding, Y., Qi, B., & Zhu, Z. (2024). A Method for Prediction and Analysis of Student Performance That Combines Multi-Dimensional Features of Time and Space. Mathematics, 12(22), 3597. https://doi.org/10.3390/math12223597