In recent years, massive open online courses (MOOCS) have received widespread attention. Because of their numerous resources and the excellent resources from many famous universities and the appearance of platforms such as edX, Coursera and Udacity, MOOCS have become more and more popular all over the world. However, due to the lower binding force of MOOC compared with the traditional classroom, many students have abandoned their learning courses due to some internal or external factors, resulting in a waste of educational resources. In order to reduce the occurrence of this phenomenon, researchers have focused on the research of MOOC learners’ dropout behavior prediction, expecting to accurately find out the learners with dropout risk, and take intervention measures in advance to enable them to persist in learning, so as to improve the completion rate of the course [
1]. Therefore, predicting MOOC learners’ dropout tendency based on learning behavior has become a hot topic in MOOC big data analytics [
2] and educational data mining research [
3] field nowadays.
MOOCs dropout prediction aims to find out the possibility of quitting the course at a certain time in the future based on learners’ current learning behavior records [
4]. Many scholars have been studying the problem of MOOCs learners dropping out of school. Some researchers used traditional classification methods such as logical regression LR [
5,
6,
7], KNN [
6], SVM [
7,
8,
9] to establish prediction models. Lu Xiaohang et al. extracted 19 features from three perspectives of click stream, homework test and forum behavior on the basis of courses data, regarded the whole learning cycle of learners as a time series, built a sliding window model combined with machine learning algorithm to dynamically predict the dropout rate of learners [
8]. Kloft et al. extracted 19 student behavior characteristics which could be represented as single real numbers from click stream data, comprehensively ran logical regression and linear SVM methods to predict students’ dropout behavior in the coming weeks, and concluded that adding forum data to the prediction in the previous weeks can effectively improve the prediction accuracy [
7]. Liang et al. used three classification features, namely, enrollment feature (Features characterizing learner’s behavior in particular course), user feature (Features characterizing learner’s behavior in platform) and course feature (Features characterizing course profile), to build Gradient Boosting Decision Tree model to predict the probability of students dropping out of school in the next 10 days [
9]. Other researchers used neural network methods such as CNN [
3,
10,
11] and LSTM [
11,
12] to predict dropout behavior. Literature [
10] designed a simple feature matrix based on time series, combining time information with students’ behavior characteristics, and using convolutional neural network model to predict dropout. The proposed CNN model considers the local correlation of learning behaviors and improves the dropout prediction accuracy. Literature [
12] regarded dropout as a time series prediction problem, constructed a time series of student behavior based on student click streams and forum data, and used LSTM to predict whether students would drop out of school. LSTM network outperforms other models in terms of AUC score. Literature [
11] comprehensively considered the advantages of CNN and LSTM, proposed a hybrid network structure CLMS Net composed of CNN, LSTM and SVM. The proposed model can automatically extract features from student behavior data, and enhance the performance of dropout prediction.Although the above methods can obtain a good prediction effect, they have not analyzed the factors that affect MOOC learners’ dropout behavior from a statistical perspective. Wang Mengmeng et al. [
13] analyzed the factors affecting MOOC learners’ turnover from three aspects: learners’ own factors, curriculum related factors and technical factors, and proposed specific strategies to stimulate and maintain MOOC learners’ learning motivation combining with the ten principles of online learning motivation and stimulation proposed by relevant scholars [
14]. Yang et al. [
15] explored the impact of forum posting and social network behavior on school dropout from the perspective of social network analysis. The above work focused on the theoretical analysis of the influencing factors of MOOC dropout, but lacked quantitative analysis of each factor. In general, although there are many studies on the dropout rate, there is no standardized research method recognized by the academic community, and most of the studies are still in the process of exploration [
5].
In this research, we select the 2013 Spring curriculum data on the HarvardX platform, use statistical methods to conduct quantitative analysis on the factors that may affect MOOC learners’ dropout behavior, and make feature selection based on the statistical analysis results. Then, we establish prediction models for MOOC learners’ dropout behavior using three traditional prediction models in machine learning: logical regression, K-nearest neighbor algorithm and random forest algorithm(a decision tree model based on bagging framework). The model with high prediction performance is selected through experimental comparison to predict the dropout behavior. This research can enable the teaching staff and managers to discover the learners’ dropout behavior tendency in advance and make appropriate intervention as soon as possible, and make learners better adapt to the learning process, thereby reducing the dropout rate, improving the quality of teaching, and achieving better teaching results [
16].
The main contributions of this study are: (1) dividing the factors that affect learners’ dropout behavior into two categories: learners’ own characteristics and learners’ learning behaviors for statistical analysis; (2) conducting feature selection based on statistical analysis of factors affecting dropout behavior; (3) establishing learner dropout behavior models using logistic regression, random forest and K-nearest neighbor models, respectively; (4) carrying out extensive experiments on the data set of the Harvard X platform, train and test the three prediction models, and drawing the conclusion that the random forest prediction effect is better than the other two models.
The remainder of this paper is organized as follows: The second section introduces the dataset and methodology. The third section presents the results of the study. The fourth part fulfills the discussion, followed by the conclusion in the last section.