1. Introduction
General education courses in undergraduate programs are assumed to ensure that all college students acquire the foundational interdisciplinary knowledge, as well as the analytical and communication skills that are necessary to address the demands of their selected majors and those of their chosen professions [
1]. Failures in general education courses may ignite a cascade of undesirable effects, which can span from mild (the repetition of a course) to severe (e.g., academic dismissal, delayed degree attainment, and loss of financial aid eligibility). Thus, how students perform in such courses can be considered key in determining their academic success, including retention and graduation [
2,
3]. It is an accepted fact that the timing of the identification of at-risk students is a critical aspect of the effectiveness of the implementation of remedial interventions [
4,
5,
6]. However, educators usually have very little information about students’ performance during the first half of the semester, which can make the identification of at-risk students both challenging and broadly consequential if erroneous conclusions are reached. To wit, unrecognized difficulties (an event classified in signal detection theory as a miss) are likely to lead to course failure. Notwithstanding the need for valid early predictions of students’ academic performance, which rely on limited information, most of the research on algorithms that are intended to assist educators’ performance forecasts has relied upon much greater amounts of information collected within a much larger timeframe and has often involved discipline-specific subject matters [
7,
8,
9]. Examples are predictions of final course grades in a particular subject matter based on students’ grade point average (GPA), as well as grades in pre-requisite courses [
10], or more simply, on students’ academic history, as exemplified by their performance in past courses [
11]. Yet, the algorithms that yield optimal results tend to vary considerably [
12,
13], along with a myriad of innovative stand-alone or hybrid solutions that appear as a regular stream in the extant literature [
14]. As a result, the selection and subsequent use of a suitable technique for predicting at-risk students may become so challenging and overwhelming for an educator whose expertise is other than computer science that ignoring potentially viable technical solutions is the most likely course of action [
15]. For such educators, continued reliance on personal intuition and conscious reasoning may seem preferable to the ordeal of understanding the technically dense machine learning literature. However, this comes at the cost of an increased likelihood of biases affecting the processing of students’ information for assessment and decision-making [
16,
17], including personal preferences for the parameters to take into account and for the type and amount of data that are necessary to generate sensible predictions. Consider that, as the semester progresses, the amount of information about students that is available to an educator accumulates, but its utility decreases as remedial actions become harder to implement and their success becomes more uncertain [
18,
19]. Earlier predictions are unquestionably more valuable than later predictions, but at the beginning of a course, very little information is available to the educator, making predictions about a student’s difficulties even more uncertain (e.g., is initial poor performance symptomatic of a momentary hurdle, perhaps linked to the idiosyncrasies of an assignment, or a reliable indication of serious issues?).
The COVID-19 pandemic has complicated the prediction matter by suddenly relocating students, most of whom were exclusively accustomed to face-to-face instruction, to online instruction. Although the synchronous online mode that is adopted by many institutions of higher learning has replicated many aspects of the face-to-face mode (e.g., real-time interactions in a virtual classroom), physical distance, technical idiosyncrasies, competencies, and other issues (e.g., students’ degree and manner of adaptation to environmental changes) may have made learning in online courses different from that of face-to-face courses [
20,
21]. For instance, it has been proposed that the online mode has fostered the practice of a more continuous engagement in learning activities [
20]. As evidence of change, studies have reported higher online performance (as measured by course grades) than pre-pandemic face-to-face performance [
21,
22,
23,
24,
25]. However, other studies have reported declines or no change at all [
4,
25,
26,
27]. Thus, because uncertainty endures as to whether remote instruction during the pandemic has fostered relevant changes to students’ learning, it remains unclear whether performance predictions that are made online and face-to-face can be considered equivalent.
In the present study, we examined whether algorithms that are commonly used for the predictions of final grades could be of assistance to educators in both face-to-face and online courses when the only information available to the educators is students’ performance on the first test and assignment. Both assessment measures can be classified as formative assessment tools [
28]. These are tools that are used by students to assess their learning in a course and by educators to determine the effectiveness of the instruction they deliver, thereby defining formative assessment as serving both diagnostic and feedback functions. In a course, formative assessment measures can be said to be particularly critical to students’ academic success, since the information they provide has the potential to foster change in the way that students approach the curriculum and understand its demands, as well as in the way that educators teach. Thus, in principle, the earlier the assessment, the higher may be its impact on both students and educators. Formative assessment differs from summative assessment (i.e., final tests), whose primary aim is to measure learning comprehensively across the entire semester as an evaluation of the extent to which it meets pre-set learning outcomes. A summative assessment indicator is the final course grade that is given to each student at the end of the semester. The effectiveness of early formative assessment measures, each of which covers a portion of the curriculum to be acquired in a course, resides in their ability to adequately predict final course grades which reflect the student’s learning of the entire course curriculum.
It is customary for institutions of higher education to demand that students meet a minimum performance requirement to gain access and remain enrolled in any degree program. At the institution that was selected for the present research, this requirement entails maintaining a GPA that is better than a C (greater than 79%). Thus, to ensure authenticity, we classified the final course grades into three performance categories: high (H—equal to or greater than 90%); medium (M—80–89.99%), and low (L—79% or below). This stringent classification scheme created categories of comparable size, while it minimized the impact of grade inflation and educators’ grading idiosyncrasies, as well as reflected the standards of academic success at the selected institution.
At the outset, we recognized that the predictive validity of a forecast may refer to a variety of key parameters, such as accuracy [(hits + correct rejections)/all responses, including hits, correct rejections, misses, and false alarms)]; precision [hits/(hits + false alarms)]; and sensitivity [hits/(hits + miss)]. In the task of identifying at-risk students, however, correct rejections are not particularly relevant. Furthermore, false alarms are much less costly or even less relevant than misses. Namely, false alarms are likely to reflect cases of temporary difficulties experienced by individual students which are mistakenly identified as enduring and/or severe (a false alarm), thereby creating unnecessary but fleeting stress in such students. Thus, in the present study, we relied on sensitivity as a measure of the predictive validity of forecasts of at-risk students (i.e., learners receiving an L grade at the end of the semester). A sensitivity score for an L classification was conceptualized as a proportion, including the number of correctly classified L grades divided by the number of grades that were either misclassified as H and M or correctly classified as L.
The study involved female undergraduate students of a society that is in transition from a patriarchal order to one that is akin to gender equity in education and employment [
29,
30,
31,
32]. In such a society, of which a prototypical example is the Kingdom of Saudi Arabia (KSA), female students of college age are the main target of top-down gender-equity interventions. Decrees and massive financial investments aim to re-set the country’s social structure to favor meritocracy for both sexes at the expense of tribal and patriarchal favoritism [
33]. Thus, the academic success of female college students is a priority for the adequate development of the economic and social engine of KSA, making our research a window into the performance of this highly valued population, as well as into the utility of early performance assessment in the said population.
The current study tested several popular learning algorithm(s) to answer two interrelated questions:
Can at-risk students (defined as those with an end-of-the-semester score of L in a general education course) be effectively identified by very early performance indicators (i.e., grades on the first test and first assignment) through one of these algorithms?
Do predictions of at-risk students vary between face-to-face instruction and synchronous online instruction, as well as with the specific subject matter taught in a course?
We selected a sample of courses that are representative of the general education curriculum of a Saudi higher education institution that follows a curriculum of U.S. import and a student-centered pedagogy. The courses had been taught by the same instructor both online (during the pandemic) and face-to-face (before the pandemic) for at least three semesters in each mode. The acceptable sensitivity threshold for the selected algorithms was determined by a sample of educators who taught similar courses. We predicted that if early performance indicators cannot be relied upon to identify at-risk students, early predictions will exhibit a sensitivity score at the identified subjective threshold or below. This outcome is likely to be present if instructors are more lenient at the start of the semester, thereby making the results of the first assignment and test less representative of the demands that are placed upon students in the courses they teach. Alternatively, the higher a sensitivity score is above the threshold, the more the first assignment and test can be said to represent students’ overall performance. The description of the specific algorithms that we selected and the rationale behind their selection are included in the Methods section.
3. Results
The results of the present study are organized into the following sections: a description of students’ performance, and a description of the performance of the chosen algorithms in predicting the final course grades.
3.1. Students’ Performance
To obtain a sample of grades that adequately reflected students’ key performance levels in the courses in which they were enrolled, and bypassed grade inflation and instructors’ grading idiosyncrasies, we classified the final course grades into three performance categories: High (equal to or greater than 90%); Moderate (80–89.99%); and Low (79% or below). The latter category included at-risk students.
Table 1 displays the percentage of grades that were assigned to H, M, and L performance by course and instructional mode.
Overall, a greater percentage of students yielded L or M performance in face-to-face classes than in online classes, whereas a greater percentage of students yielded H performance online, ꭓ2 (2, n = 5158) = 285.34, p < 0.001. However, when we examined the frequency of grades H, M, and L in individual courses, a more nuanced pattern emerged about the relationship between performance level and instructional modality, ꭓ2 (2, n = 612–1390) ≥ 14.66, p ≤ 0.001. In online classes, H was the most frequent score. The only exception was online STA for which M was the most frequent score. In face-to-face WCO and WED classes, there was a somewhat even distribution of L, M, and H scores. In face-to-face STAT classes, L was the most frequent score, whereas in face-to-face ACS classes most scores were either H or M. These patterns of frequency distribution were reflected in students’ end-of-course feedback surveys for which STA was judged as a difficult course, but less so online. Although the other classes were reported not to be as difficult as STA, they were also seen as easier when they were online. However, at the start of the semester, STA was judged as more difficult when it was online.
3.2. Algorithms’ Performance
The final course grades, labeled as H, M, or L, were used for estimation. We applied the selected algorithms to assess their ability to predict at-risk students in each of the selected performance categories. We relied on sensitivity scores as a measure of the quality of the estimation that was made.
Table 2 displays the sensitivity scores of the L-performance category, which indexed at-risk students, as a function of instructional mode and type of course.
To determine the extent to which actual instructors in actual classrooms would tolerate misclassifications of at-risk students as likely to do well in their courses, we presented 10 faculty who had experience in at least one of the selected courses with the following scenario: “Imagine that after a couple of weeks into a semester, you are asked to take over a class from another instructor who was abruptly granted a leave of absence for health reasons. A colleague offers you an algorithm that can help you identify students who will not do well in this class. Imagine that at present, unbeknownst to you, 10 students will not do well in this class without some sort of early intervention. What is the maximum number of students out of 10 that the algorithm can fail to correctly identify as being at-risk for you to deem the algorithm unlikely to be useful? Alternatively, what is the minimum number of students out of 10 that the algorithm should correctly identify as being at-risk for you to deem the algorithm useful? Keep in mind that the early identification of at-risk students in real life is a complex task, which may lead educators to misclassify some students as likely to do well (misses or false negatives). Thus, provide a realistic number that would apply to you as representative of your teaching experience”. The answers included sensitivity rates ranging from 0.9 to 0.6, leading to an average of 0.7 (the value the arrow points to). Thus, in our study, we considered a sensitivity score of 0.7 or above as acceptable, which was treated as the threshold of subjective effectiveness.
Was there an algorithm that could be described as superior in both face-to-face (FtF) and online instruction?
Figure 1, which plots sensitivity as a function of the type of algorithm and mode of instruction, shows that KNN, LR, and NB consistently performed more effectively face-to-face than online. SVM, MLP, and RF yielded poor performance both online and face-to-face. Interestingly, no algorithm performed adequately (i.e., above the threshold of subjective effectiveness) in online classes.
Across algorithms, did sensitivity change as a function of the type of course and mode of instruction?
Figure 2, which plots sensitivity scores as a function of course and mode of instruction, shows two dissimilar patterns: ACS, WCD, and WED yielded more effective predictions face-to-face than online. Instead, STA yielded more effective predictions online. However, except for ACS and less so for STA, the differences between the modes of instruction were minor.
To better understand the pattern that was yielded by ACS, WCO, WD, STA, and PSY, we examined whether specific algorithms contributed to it.
Figure 3,
Figure 4,
Figure 5,
Figure 6 and
Figure 7, which plot the sensitivity scores as a function of the algorithm and mode of instruction in each of these courses, illustrate that the effectiveness of algorithms in making predictions depended on both the type of course and the mode through which the instruction was delivered. Thus, educators would be well advised to consider both variables in selecting algorithms for predicting at-risk students in the classes they teach. For instance, although ACS, WCO, and WED showed greater effectiveness in the prediction task (as measured by the threshold of subjective effectiveness of 0.7) in the face-to-face mode, not all algorithms did so across all courses. To illustrate, ACS was an exception, as all algorithms made effective predictions in face-to-face classes and none in online classes. However, such a clear pattern was not obtained in WCO and WED. Specifically, only KNN and RF made predictions that were above the threshold of effectiveness in face-to-face WCO classes, whereas all algorithms, except SVM, made predictions that were above the threshold of subjective effectiveness in face-to-face WED classes. Furthermore, only LR and NB made effective predictions in online WED classes. In contrast to the checkered pattern of WCO and WED, STAT exhibited a pattern that was largely the opposite of ACS. Namely, all algorithms yielded predictions that were above the threshold of subjective effectiveness online, whereas only MLP and RF made predictions that were above said threshold face-to-face. Irrespective of whether PSY was delivered face-to-face or online, the predictions of all the algorithms were poor, all well below the threshold of subjective effectiveness.
4. Discussion
The findings of the present research can be summarized in two points. First, if the overall predictive validity of machine learning algorithms is of interest, our evidence suggests that they tend to yield a higher predictive validity (as indexed by sensitivity scores) in face-to-face than in online classes. Is this higher predictive validity due to changes in the way that students approach the curriculum of a course? Is it due to changes in the way that educators deliver content and/or assess learning? We explicitly selected courses that were taught by the same experienced educators and followed the same curriculum requirements online and face-to-face. Educators’ self-reports did not illustrate that the standards of early formative assessment (i.e., the first assignment and test) were changed between face-to-face and online courses. Yet, the lower predictive validity of early formative assessment online indicated that these measures were less useful to both educators and students when embedded in online classes. Inquiries through focus groups and informal exchanges with both students and faculty did not clarify this puzzle, mostly leading to the acknowledgment by both parties that the adaptation to online courses was more challenging for students than the adaptation to face-to-face courses. The following themes were frequently mentioned by students and corroborated by faculty: more time devoted to understanding how to navigate materials posted online (Blackboard) and how to use them; feelings of isolation and perceived distance from the instructor; and fewer opportunities for informal interactions with the instructor and classmates. Instructors reported more initial inquiries regarding course contents and requirements online than face-to-face, often noting that students who were accustomed to on-campus classes required more time to navigate and feel comfortable with the online mode. Thus, qualitative evidence seemed to point to educators who, aware of students’ adaptation challenges, might have become tacitly more lenient when assessing performance, even though they purported not to have changed their standards of assessment. However, this pattern may offer a misleading picture since the variables course type and mode of instruction interacted.
Second, algorithms such as KNN and RF were consistently better predictors of at-risk students in face-to-face courses in the humanities and social sciences, such as ACS, WCO, and WED, whereas they were better predictors online when the course covered mathematical knowledge and skills (STA). The flexibility of KNN and RF may be particularly useful if the SARS-CoV-2 virus, which causes the COVID-19 disease, persists in affecting people’s lives, thereby forcing university administrators to continue relying on the online mode or adopt hybrid modes for courses that are offered at their institutions.
As for the differences in algorithms’ predictive validity between online and face-to-face, the interaction of course type and mode of instruction was not entirely clarified by the self-reports that were produced by students and educators; however, a consistent theme emerged. At the very beginning of the semester, some classes were reported by students as likely to be more difficult online (e.g., STA), whereas others (e.g., ACS and WED) were seen as potentially easier online. These biases could be thought of as capable of shaping students’ behavior inside and outside the virtual classroom. For instance, consider that the anxiety that was experienced by female students towards Math courses increased considerably when the courses were delivered online. Increased anxiety might have led students to pay more attention in class and devote more time to class activities across the entire semester, thereby potentially leading to three interrelated outcomes: (a) enhancing their performance across the entire semester; (b) rendering a view, at the end of the semester, of the online STAT course as easier to manage than expected; and (c) making even initial formative assessment measures more likely to reflect overall course performance online than face-to-face. Instead, the initial expectation of easier online courses, coupled with educators’ purported leniency that was driven by the opposite expectation, might have had a quite different impact. Namely, expectations might have unnecessarily lessened students’ effort towards class activities, and relaxed educators’ grading standards to ease students’ adaptation to online courses, thereby making initial formative assessment less likely to reflect overall course performance online.
Our research suggests that particular machine learning algorithms can be used to make informed predictions regarding students’ performance attainment, but the predictive validity of each algorithm has to be first assessed as a function of two important variables: course type and instructional mode. Our study adds to the growing body of grade prediction studies that rely on machine learning algorithms [
8,
10,
43,
44,
45] by pointing to the relevance of such variables to interventions that are intended to foster academic success in an understudied student population. In our research, the latter is represented by young women of college age from a society that has only recently implemented and enforced gender equity guidelines. Our research also contributes to the extant literature by relying on a subjective criterion of effectiveness that is produced by faculty with direct experience in teaching the courses that are included in our sample. Too often, studies examining the predictive validity of different algorithms have focused on relative comparisons but have failed to give readers an idea of how to conceptualize a desirable outcome for the actual situations/conditions they face.
5. Conclusions
We believe that research in educational settings should be motivated by the intention to improve participants’ existing conditions [
46,
47,
48]. As such, we subscribe to the main tenets of action research according to which the aim of a research project is practical. Namely, it is to identify a problem, condition, or situation; propose and implement a solution that is intended to bring improvement to the very people who participate in the research; assess the effectiveness of the solution; and either (a) start from the beginning if the outcome is unsatisfactory, or (b) broaden the reach of the purported solution if the outcome is within the expected parameters [
49]. Thus, our goal is to rely on machine learning algorithms as feedback tools for students to assess their learning, and for faculty to assess their teaching. If improvement is needed, such tools can also inform the nature of the changes to be implemented in a university’s curriculum and instruction.
To this end, we recognize that too often, algorithms and related data mining techniques are mainly accessible to educators who possess a background in computer science, and, more precisely, in artificial intelligence [
15]. Educators with diverse backgrounds are frequently unable to access data mining techniques, thereby preventing their application to a much wider educational field. Our goal at the selected institution is to offer faculty with backgrounds outside computer science access to such techniques via workshops and mentorship efforts. Specifically, we plan to develop an easy-to-use early-warning system that relies on KNN to identify students at risk in particular courses, depending on the mode of instruction that is used to deliver their content. Currently, we have data supporting the effectiveness of an early-warning system using KNN in face-to-face ACS, WCD, and WED courses, as well as in online STA. The choice of KNN is based on its yielding the best relative performance for different modes of instruction in the courses that are selected for our study. However, the dismal performance of KNN, along with that of all the other algorithms in PSY, suggests that the early formative assessment measures of this course need to be examined closely to determine whether they indeed fit the learning outcomes of the course. A similar examination of the early formative assessment measures of the other courses that were selected for the present examination may also be warranted to improve their predictive validity. Of course, the predictive validity of KNN for at-risk students in general education courses that were not included in the present research, especially those involving natural sciences and math, will also need to be examined.
Existing early warning systems for the identification of at-risk students may be too general and thus become poor predictors of academic difficulties in particular courses. They also often rely on norm-referenced scores, which take into account how other students perform, instead of criterion-referenced scores, which consider how students perform relative to the learning outcomes that are set by the curriculum of different courses (as illustrated by summative assessment measures). Thus, an algorithm, such as KNN, that uses criterion-based scores to predict at-risk students may be seen as particularly helpful by educators. Indeed, it has the ability not only to effortlessly identify students who experience difficulties, but also to inform revisions of the curriculum and assessment protocol to ensure adequate coverage of the learning outcomes of a course. The attainment of such learning outcomes is particularly critical to general education courses, which lay the foundations for academic success in major-specific courses [
50,
51,
52,
53].
Two important lessons that are learned from our investigation and the pertinent extant literature are reminders of the limitations of our study. First, machine learning solutions for grade prediction, albeit most useful early in a course, may require adaptation to the particular student population and the academic environment that an educator or administrator has selected for assessment and intervention. Namely, each educational setting may have features that are common to other educational settings (e.g., reliance on the synchronous online mode), thereby allowing a technical solution to be transferred, and features that may be unique to it. Unique features introduce uncertainties by questioning the transfer of the solution to other settings. For instance, the students selected for the present investigation are exposed to undergraduate courses guided by the principles of student-centered instruction intended to promote deep learning at the expense of rote learning. It is unclear whether our findings generalize to students who are exposed to a different type of instructional principles. Thus, uncertainties, defined by their statistical properties (e.g., parametric or non-parametric factors), and origin (e.g., internal to the learner or environmental, etc.), are likely to depend on the student populations that educators select for assessment and intervention, and the specific factors they deem relevant. Second, a lesson that is learned from the extant literature is that, in a vast array of problem domains, computational models are relentlessly evolving and are often so complex that educators without knowledge in computer science are left out. Innovative algorithmic solutions may be applied to the grade prediction needs of an institution and its faculty [
54,
55] if the unique properties of the grade-prediction conundrum in any given setting (including students and academic environment) are integrated and computing resources for training and inference purposes are made available e.g., [
56]. However, such models also need to become more transparent and user-friendly for non-experts to ensure broad and reliable adoption [
15]. Our paper is a modest call for action that specifically targets non-expert educators and administrators to approach the field of machine learning due to its potential benefits to the quality of learning and teaching.