Comparative Analysis of Machine Learning Models for Predicting Student Success in Online Programming Courses: A Study Based on LMS Data and External Factors
Abstract
:1. Introduction
2. Materials and Methods
2.1. Data Collection
2.1.1. Academic and Demographic Data
- First Partial Grades: Grade 1, Grade 2, and Exam 1, each with a specific weight of 50 points.
- Demographic Characteristics: Current age, sex, nationality, ethnicity, presence of any disability, province and canton of birth, and province and canton of residence.
- Students enrolled in an Object-Oriented Programming course.
- Complete availability of academic and interaction data in Moodle.
- Students with incomplete or significantly missing data.
- Students who dropped the course before the first midterm.
2.1.2. Moodle Interaction Data
- Course Logins: Number of times students logged into the course.
- Reviewed Resources: Number of times students accessed the study materials.
- Assignment Submissions: Quantity and punctuality of submissions.
- Evaluation of Participation: Completion and review of quizzes.
- Grade Access and Review: Accessing and reviewing grades.
2.2. Definition of the Dependent Variable
- At Risk (Class 0): Students with grades below 35.
- Not at Risk (Class 1): Students with grades of 35 points or higher.
2.3. Data Preprocessing
2.3.1. Cleaning and Preparation
- Data Type Conversion: Categorical variables were transformed into the ‘category’ type and numerical variables into ‘float’ or ‘int’ as appropriate.
- Categorical variables encoding One-Hot Encoding were used for nominal categorical variables.
2.3.2. Feature Selection
- Academic Grades: Grade 1, Grade 2, Exam 1.
- Moodle Interactions: Course logins, submitted assignments, completed quizzes, reviewed resources.
- Demographic Data: Current age, sex, and presence of disability.
2.4. Machine Learning Models
- Logistic Regression: A Linear model used for binary classification.
- Random Forest Classifier: An ensemble of decision trees that enhances accuracy and reduces overfitting.
- Support Vector Machine (SVM): An algorithm that seeks the hyperplane that best separates classes.
- Artificial Neural Network (MLP): A model capable of capturing complex nonlinear relationships in the data.
2.5. Experimental Procedure
2.5.1. Dataset Division
- Training Set: 70% of the data were used to train the models.
- Test Set: The remaining 30% was used to evaluate the performance of the models.
2.5.2. Handling Class Imbalance
2.5.3. Normalization and Scaling
2.5.4. Hyperparameter Optimization
- Logistic Regression: The regularization parameter was adjusted C.
- Random Forest: Various tree depths, estimators, and splitting criteria were explored.
- SVM: Different kernels (linear and RBF) and parameter values were tested C.
- Neural Network: The number of neurons, activation functions, learning rate, and number of epochs were adjusted.
2.5.5. Training and Validation
- Area Under the ROC Curve (AUC-ROC).
- Precision, Recall, and Specificity.
- Confusion Matrix.
2.6. Model Evaluation
2.6.1. Performance Metrics
- ROC Curves: To Visualize the balance between true positive and false positive rates.
- Classification reports: Precision, recall, and F1-score.
2.6.2. Interpretation of Results
2.7. Tools and Technologies Used
- Programming Language: Python 3.11 (Python Software Foundation, Beaverton, OR, USA).
- Libraries.
- Pandas 2.2.3 (NumFOCUS, Austin, TX, USA) and NumPy 1.26.4 (NumFOCUS, Austin, TX, USA): data manipulation and processing.
- Scikit-learn 1.5.2 (Scikit-learn Developers, licencia BSD): Implementing machine-learning models and preprocessing.
- Imbalanced-learn 0.12.3 (Imbalanced-learn Developers, Europe): Handling class imbalance with SMOTE.
- TensorFlow 2.17.0 (Google LLC, Mountain View, CA, USA) and Keras 3.5.0 (Google LLC, Mountain View, CA, USA) (via SciKeras): Build and train the neural network.
- Matplotlib 3.9.2 (Matplotlib Development Team) and Seaborn 0.13.2 (Michael L. Waskom): For data and result visualization.
2.8. Ethical Considerations
3. Results
3.1. Descriptive Analysis
3.2. Linear Regression Model Using the Ordinary Least Squares (OLS)
- R-squared (R2 = 0.794): This value indicates the proportion of variability in the dependent variable (final_note) is explained by the model. A value of 0.794 indicates that approximately 79.4% of the variation in final_note is explained by the predictor variables (note1, note2, exam1, etc.). This suggests that the model has a good explanatory power.
- Adjusted R-squared (0.790): The Adjusted R-squared is similar to R2, but adjusts for the number of predictors in the model. This adjustment helps prevent R2 from simply increasing with more predictors without truly improving the model. In this case, the value of 0.790 was very close to R2, indicating that the number of variables was appropriate for explaining the outcome without overfitting.
- F-statistic (202.9) and Prob (F-statistic) (1.96 × 10−190): The F-statistic measures the overall quality of the model fit by comparing it with a model without predictors (only the mean). A high F-statistic value and a very low p-value (1.96 × 10−190) indicate that the model with predictor variables is significantly better than that without them, suggesting that the model has a good overall fit.
- Omnibus, Prob (Omnibus), Jarque–Bera (JB), Prob (JB): These tests evaluate whether the model residuals are normally distributed. A low p-value (0.000) in both cases suggests that the residuals do not follow a normal distribution, which may indicate that the model does not fit well in all the cases.
- Skew (−0.961) and kurtosis (5.084): A negative skew indicates a distribution with a longer tail to the left. A Kurtosis greater than 3 indicates a distribution that is more “peaked” than normal.
- Durbin–Watson (1.800): This value is relatively close to 2, which indicates no strong evidence of autocorrelation in the residuals.
3.3. Model Optimization and Hyperparameter Selection
3.4. Cross-Validation
3.5. Evaluation on the Test Set
3.6. Classification Reports
- Logistic Regression has high precision for Class 1 (96%) but moderate precision for Class 0 (57%), indicating that it is very effective at identifying students not at risk, but less precise in identifying those at risk.
- Random Forest shows balanced precision and recall for both classes, especially excelling in Class 1 with 93% across all metrics.
- SVM has high precision for Class 1 (95%), similar to Logistic Regression, but lower precision for Class 0 (54%) compared to Random Forest.
- The neural Network (MLP) maintains high precision for Class 1 (96%) and reasonable performance for Class 0 (55% precision), similar to the SVM.
3.7. ROC Curves
- Logistic Regression: Although model was initially used alone, its performance on the test set was outstanding, especially in detecting at-risk students (high sensitivity). Its high AUC-ROC value (0.9354) indicates a strong discriminative ability.
- Random Forest: This model showed the highest overall accuracy (89%) and a good balance between precision and recall in both classes. Its ability to handle non-linear relationships and capture complex interactions between variables may have contributed to this performance.
- SVM: Although it had an acceptable performance, its precision and AUC-ROC were lower than those of the other models. This suggests that, for this dataset, SVMs with an RBF kernel did not sufficiently capture the present complexities.
- Neural Network (MLP): Despite its excellent performance in cross-validation, its performance on the test set was slightly lower, indicating possible overfitting. However, it maintained a good overall predictive ability.
3.8. Confusion Matrices
3.9. Feature Importance
- note2: 20.72%
- exam1: 16.34%
- note1: 16.24%
- resources_reviewed: 8.22%
- assignments_reviewed: 5.01%
- First Partial Grades: The variables note1, note2, and exam1 were consistently the most important, highlighting the relevance of early academic performance in predicting final success. This is consistent with academic logic, in which initial grades reflect students’ understanding of and adaptation to the course.
- Interactions in Moodle: Variables related to online activities, such as resources_reviewed and assignments_reviewed, also had a significant influence. This finding indicates that greater engagement with online resources and assessments is associated with better academic outcomes.
- Demographic Variables: Features such as age, gender, ethnicity, disability, and nationality had less influence on the prediction, suggesting that, while they are relevant, their impact is less pronounced compared to grades and participation in Moodle.
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Gaftandzhieva, S.; Talukder, A.; Gohain, N.; Hussain, S.; Theodorou, P.; Salal, Y.K.; Doneva, R. Exploring Online Activities to Predict the Final Grade of Student. Mathematics 2022, 10, 3758. [Google Scholar] [CrossRef]
- Riestra-González, M.; del Puerto Paule-Ruíz, M.; Ortin, F. Massive LMS log data analysis for the early prediction of course-agnostic student performance. Comput. Educ. 2021, 163, 104108. [Google Scholar] [CrossRef]
- Adnan, M.; Habib, A.; Ashraf, J.; Mussadiq, S.; Raza, A.A.; Abid, M.; Bashir, M.; Khan, S.U. Predicting at-Risk Students at Different Percentages of Course Length for Early Intervention Using Machine Learning Models. IEEE Access 2021, 9, 7519–7539. [Google Scholar] [CrossRef]
- Kocsis, A.; Molnár, G. Factors influencing academic performance and dropout rates in higher education. Oxf. Rev. Educ. 2024, 1–19. [Google Scholar] [CrossRef]
- Luo, Y.; Han, X.; Zhang, C. Prediction of learning outcomes with a machine learning algorithm based on online learning behavior data in blended courses. Asia Pac. Educ. Rev. 2022, 25, 267–285. [Google Scholar] [CrossRef]
- Pelima, L.R.; Sukmana, Y.; Rosmansyah, Y. Predicting University Student Graduation Using Academic Performance and Machine Learning: A Systematic Literature Review. IEEE Access 2024, 12, 23451–23465. [Google Scholar] [CrossRef]
- Nimy, E.; Mosia, M.; Chibaya, C. Identifying At-Risk Students for Early Intervention—A Probabilistic Machine Learning Approach. Appl. Sci. 2023, 13, 3869. [Google Scholar] [CrossRef]
- Peraic, I.; Grubisic, A. Predicting Academic Performance of Students in a Computer Programming Course using Data Mining. Int. J. Eng. Educ. 2023, 39, 836–844. [Google Scholar]
- Alhazmi, E.; Sheneamer, A. Early Predicting of Students Performance in Higher Education. IEEE Access 2023, 11, 27579–27589. [Google Scholar] [CrossRef]
- Gonzalez-Nucamendi, A.; Noguez, J.; Neri, L.; Robledo-Rella, V.; García-Castelán, R.M.G. Predictive analytics study to determine undergraduate students at risk of dropout. Front. Educ. 2023, 8, 1244686. [Google Scholar] [CrossRef]
- Shafiq, D.A.; Marjani, M.; Habeeb, R.A.A.; Asirvatham, D. Student Retention Using Educational Data Mining and Predictive Analytics: A Systematic Literature Review. IEEE Access 2022, 10, 72480–72503. [Google Scholar] [CrossRef]
- Bütüner, R.; Calp, M.H. Estimation of the Academic Performance of Students in Distance Education Using Data Mining Methods. Int. J. Assess. Tools Educ. 2022, 9, 410–429. [Google Scholar] [CrossRef]
- Costa, S.F.; Diniz, M.M. Application of logistic regression to predict the failure of students in subjects of a mathematics undergraduate course. Educ. Inf. Technol. 2022, 27, 12381–12397. [Google Scholar] [CrossRef]
- Ramaswami, G.; Susnjak, T.; Mathrani, A. Supporting Students’ Academic Performance Using Explainable Machine Learning with Automated Prescriptive Analytics. Big Data Cogn. Comput. 2022, 6, 105. [Google Scholar] [CrossRef]
- Alturki, S.; Cohausz, L.; Stuckenschmidt, H. Predicting Master’s students’ academic performance: An empirical study in Germany. Smart Learn. Environ. 2022, 9, 38. [Google Scholar] [CrossRef]
- Arroyo-Barrigüete, J.L.; Carabias-López, S.; Curto-González, T.; Hernández, A. Portability of Predictive Academic Performance Models: An Empirical Sensitivity Analysis. Mathematics 2021, 9, 870. [Google Scholar] [CrossRef]
- Esteban, A.; Romero, C.; Zafra, A. Assignments as Influential Factor to Improve the Prediction of Student Performance in Online Courses. Appl. Sci. 2021, 11, 10145. [Google Scholar] [CrossRef]
- Alhassan, A.; Zafar, B.; Mueen, A. Predict Students’ Academic Performance based on their Assessment Grades and Online Activity Data. Int. J. Adv. Comput. Sci. Appl. 2020, 11, 185–194. [Google Scholar] [CrossRef]
- Aljohani, N.R.; Fayoumi, A.; Hassan, S.-U. Predicting At-Risk Students Using Clickstream Data in the Virtual Learning Environment. Sustainability 2019, 11, 7238. [Google Scholar] [CrossRef]
- Zabriskie, C.; Yang, J.; DeVore, S.; Stewart, J. Using machine learning to predict physics course outcomes. Phys. Rev. Phys. Educ. Res. 2019, 15, 020120. [Google Scholar] [CrossRef]
- Peter, A.K.; David, A.A. Application of the Maximum Likelihood Approach to Estimation of Polynomial Regression Model. Int. J. Math. Comput. Res. 2022, 10, 2693–2700. [Google Scholar] [CrossRef]
- Briggs, N.E.; MacCallum, R.C. Recovery of Weak Common Factors by Maximum Likelihood and Ordinary Least Squares Estimation. Multivar. Behav. Res. 2003, 38, 25–56. [Google Scholar] [CrossRef] [PubMed]
- Bujang, S.D.A.; Selamat, A.; Ibrahim, R.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H.; Ghani, N.A.M. Multiclass Prediction Model for Student Grade Prediction Using Machine Learning. IEEE Access 2021, 9, 95608–95621. [Google Scholar] [CrossRef]
- Sandra, L.; Lumbangaol, F.; Matsuo, T. Machine Learning Algorithm to Predict Student’s Performance: A Systematic Literature Review. TEM J. 2021, 10, 1919–1927. [Google Scholar] [CrossRef]
- Chen, C.-H.; Yang, S.J.H.; Weng, J.-X.; Ogata, H.; Su, C.-Y. Predicting at-risk university students based on their e-book reading behaviours by using machine learning classifiers. Australas. J. Educ. Technol. 2021, 37, 130–144. [Google Scholar] [CrossRef]
- Yan, L.; Liu, Y. An Ensemble Prediction Model for Potential Student Recommendation Using Machine Learning. Symmetry 2020, 12, 728. [Google Scholar] [CrossRef]
- Ouatik, F.; Erritali, M.; Ouatik, F.; Jourhmane, M. Predicting Student Success Using Big Data and Machine Learning Algorithms. Int. J. Emerg. Technol. Learn. (iJET) 2022, 17, 236–251. [Google Scholar] [CrossRef]
Feature | Count | Mean | Std Dev | Min | 25th Percentile | 50th Percentile (Median) | 75th Percentile | Max |
---|---|---|---|---|---|---|---|---|
note1 | 591 | 12.55 | 2.46 | 0 | 11.83 | 13.17 | 14.17 | 15.00 |
note2 | 591 | 11.28 | 3.67 | 0 | 10.00 | 12.00 | 14.00 | 15.00 |
exam1 | 591 | 16.18 | 4.29 | 0 | 14.00 | 18.00 | 19.50 | 20.00 |
course_accesses | 591 | 119.34 | 93.65 | 0 | 59.50 | 99.00 | 151.00 | 1203.00 |
grades_reviewed | 591 | 4.49 | 9.81 | 0 | 0.00 | 1.00 | 5.00 | 114.00 |
quizzes_completed | 591 | 2.62 | 0.63 | 0 | 2.00 | 3.00 | 3.00 | 4.00 |
quizzes_reviewed | 591 | 54.82 | 25.89 | 0 | 37.50 | 50.00 | 69.00 | 214.00 |
resources_reviewed | 591 | 29.25 | 26.39 | 0 | 8.00 | 24.00 | 43.00 | 276.00 |
updated_assignments_submitted | 591 | 1.19 | 2.14 | 0 | 0.00 | 0.00 | 2.00 | 18.00 |
assignments_reviewed | 591 | 39.53 | 32.76 | 0 | 18.00 | 32.00 | 53.00 | 301.00 |
current_age | 591 | 28.02 | 7.66 | 17 | 22.00 | 27.00 | 33.00 | 60.00 |
Dep. Variable | final_note | R-squared | 0.794 |
Model | OLS | Adj. R-squared | 0.790 |
Method | Least Squares | F-statistic | 202.9 |
Prob (F-statistic) | 1.96 × 10−190 | Log-Likelihood | −2027.3 |
No. Observations | 591 | AIC | 4079 |
Df Residuals | 579 | BIC | 4131 |
Df Model | 11 | Covariance Type | Non-robust |
Omnibus | 100.335 | Durbin–Watson | 1.800 |
Prob (Omnibus) | 0.000 | Jarque–Bera (JB) | 198.003 |
Skew | −0.961 | Prob (JB) | 1.01 × 10−43 |
Kurtosis | 5.084 | Cond. No. | 1.27 × 103 |
Model | Optimum Hyperparameters |
---|---|
Logistic Regression | C = 0.1 |
Random Forest | max_depth: None, min_samples_split: 2, n_estimators: 200 |
SVM | C: 10, kernel: ‘rbf’ |
Artificial Neural Network (MLP) | activation: ‘tanh’, batch_size: 32, dropout_rate: 0.0, epochs: 50, neurons: 32 |
Model | Cross-Validation AUC Scores | Mean AUC |
---|---|---|
Logistic Regression | [0.9458, 0.9603, 0.9588, 0.9705, 0.9569] | 0.9584 |
Random Forest | [0.9911, 0.9899, 0.9892, 0.9861, 0.9921] | 0.9897 |
SVM | [0.9802, 0.9928, 0.9747, 0.9792, 0.9880] | 0.9830 |
Artificial Neural Network (MLP) | [0.9901, 0.9932, 0.9922, 0.9871, 0.9918] | 0.9909 |
Model | Accuracy | Precision | Recall | f1-Score | AUC-ROC |
---|---|---|---|---|---|
Logistic Regression | 87% | 90% | 87% | 88% | 0.9354 |
Random Forest | 89% | 89% | 89% | 89% | 0.9103 |
SVM | 85% | 88% | 85% | 86% | 0.8558 |
Artificial Neural Network (MLP) | 86% | 89% | 86% | 87% | 0.9016 |
Model | Class | Precision | Recall | f1-Score |
---|---|---|---|---|
Logistic Regression | Class 0 (At Risk) | 57% | 83% | 68% |
Class 1 (Not at Risk) | 96% | 88% | 92% | |
Random Forest | Class 0 (At Risk) | 66% | 66% | 66% |
Class 1 (Not at Risk) | 93% | 93% | 93% | |
SVM | Class 0 (At Risk) | 54% | 76% | 63% |
Class 1 (Not at Risk) | 95% | 87% | 91% | |
Artificial Neural Network (MLP) | Class 0 (At Risk) | 55% | 79% | 65% |
Class 1 (Not at Risk) | 96% | 87% | 91% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Arévalo-Cordovilla, F.E.; Peña, M. Comparative Analysis of Machine Learning Models for Predicting Student Success in Online Programming Courses: A Study Based on LMS Data and External Factors. Mathematics 2024, 12, 3272. https://doi.org/10.3390/math12203272
Arévalo-Cordovilla FE, Peña M. Comparative Analysis of Machine Learning Models for Predicting Student Success in Online Programming Courses: A Study Based on LMS Data and External Factors. Mathematics. 2024; 12(20):3272. https://doi.org/10.3390/math12203272
Chicago/Turabian StyleArévalo-Cordovilla, Felipe Emiliano, and Marta Peña. 2024. "Comparative Analysis of Machine Learning Models for Predicting Student Success in Online Programming Courses: A Study Based on LMS Data and External Factors" Mathematics 12, no. 20: 3272. https://doi.org/10.3390/math12203272
APA StyleArévalo-Cordovilla, F. E., & Peña, M. (2024). Comparative Analysis of Machine Learning Models for Predicting Student Success in Online Programming Courses: A Study Based on LMS Data and External Factors. Mathematics, 12(20), 3272. https://doi.org/10.3390/math12203272