1. Introduction
Formative assessment have been proposed to make education more accessible and more effective [
1,
2,
3,
4]. The distinction between summative and formative roles of assessment was first proposed by Scriven [
5] and then applied to students by Bloom [
6,
7]. Formative assessment is specifically intended to generate feedback on performance to improve and accelerate competency acquisition as opposed to summarizing the achievement status of a student [
8,
9]. Any learning activity has potential value as formative assessment from oral discourse to conventional quizzes [
10]. Three core principles form the basis for formative assessment [
11]. Firstly, formative assessment should be viewed as an integral part of instruction, and it should be used in real time for guiding learning process. The material provided to students should depend on their current state of knowledge and understanding. Secondly, formative assessment fosters student involvement. Students are not punished for their mistakes since assessment does not affect the final grade. Instead they can use assessments for self-guidance and to monitor their progress towards learning objectives. Thirdly, formative assessment requires constructive feedback. Feedback is the key element in formative assessment [
12], it is claimed to be the single most powerful factor in enhancing achievement [
13]. Studies indicate that feedback may improve learning from about 0.4 Standard Deviation (SD) to 0.8 SD [
14], however a critique of these results exists [
2]. One of the goals of educational process is identifying the gaps between what is known and what is aimed to be known [
15]. Feedback which is separated from competency demonstration by days or weeks, as it happens with paper based tests that have to be manually graded, have diminished value [
10]. Feedback in a form of a mark as provided by traditional tests has a limited use to a student as it is not congruent with good feedback practices [
16]. Simply put, a mark does not specify the deficiencies in the student’s knowledge which they should address in the future [
17]. Another strength of formative assessment is that it can aid in using students’ strengths and weaknesses to frame learning goals and monitor progress towards them [
18].
In addition to linear traditional tests with fixed test items, adaptive tests based on Item Response Theory (IRT) have been proposed [
19] and implemented [
20]. In IRT computer adaptive tests each test item is dynamically chosen based on the student’s answers. This can improve the precision of testing by adapting to the knowledge demonstrated by each student and shorten test times [
21]. These advantages come at cost of introducing several new issues. Most problematic of which are greater test creation complexity because of the necessity of calibration of question pool and the premise that different questions measure one common trait [
22]. This premise is not always useful when assessing separate competencies even in the context of Multidimensional Item Response Theory (MIRT) [
23], which can even further exacerbate the challenge of item calibration. In practice, the main drawbacks remain the complexity and time required to create a test, which may dissuade teachers from using it as a learning tool.
Another alternative to traditional methods is Elo rating based approach [
24]. Elo rating system was originally developed for chess skill estimation and ranking. But since has been adapted for use in education [
25,
26]. Within the system each participant is assigned a numerical rating and in case of educational applications so are the test items. The expected probability that a player wins, or a student successfully completes the task is then given by the logistic function and depends on the ratings involved. In this formulation Elo rating system estimates the probability of correct answer in the same way as one parameter IRT model (Rasch model) [
27]. What differs is the parameter estimation procedure. Analogically with MIRT multivariate Elo extensions have been tested [
28]. Elo rating-based systems, like IRT systems, depend on knowledge of item difficulties to estimate students’ proficiency.
Apart from IRT and Elo rating there are other approaches to model learning [
29], such as performance factors analysis [
30] and Bayesian knowledge tracing [
31]. However, these models can be equally hard to implement and use due to non-trivial calibration and set up procedures.
In this paper, an alternative testing method based on Upper-Confidence Bound [
32,
33] Multi-Armed Bandit algorithm is proposed and tested. Multi-Armed Bandit (MAB) family of algorithms is named after a problem for a gambler who has to decide which arm of a K-slot machine to pull to maximize his total reward in a series of trials [
34]. These algorithms capable of negotiating exploration–exploitation trade-offs are applied in several modern applications such as advertisement placement, website optimization, and packet routing [
35]. There are emerging applications of MAB algorithms in education for optimal learning material selection [
36,
37,
38,
39].
The problem of choosing next question during a formative quiz can be modeled as a MAB problem. To the best of our knowledge no applications of MAB algorithms for formative assessment item selection have been proposed. Research supports that the key element of formative assessment is feedback, awareness of gaps between current students’ knowledge and their aims, and where to go next to alleviate those deficiencies. The proposed in this paper assessment method addresses this need by quickly identifying the lacking areas of knowledge and thoroughly exploring them in order to assist further learning. The process can be viewed through the lens of J. Hatties three question feedback model [
15]. To utilize the algorithm the teacher must first form topics or competences (Where am I going?). The algorithm quickly identifies lacking areas of knowledge (How am I going?) and explores the topic in detail helping further instruction (Where to next?). This approach in combination with presently widespread mobile devices has the potential to mitigate the aforementioned issues such as test creation complexity and long test times, while providing accurate formative assessment data compatible with J. Hattie three question feedback, Competency Based Learning and Assessment methodologies.
2. Materials and Methods
2.1. Modelling Assessment as UCB Problem
When tutoring, a teacher will often engage in a dialogue with a student. The teacher may ask the student a series of formative questions in order to diagnose the gaps in student’s knowledge. Assume the material consists of two topics, and the teacher asked 5 questions on each topic. The knowledge about first topic appears to be in a worse shape, two incorrect answers, than the knowledge on the second topic, one incorrect answer (see
Table 1).
If there is time for five more questions before proceeding with didactic instruction, the teacher must face the dilemma of which topic should they explore with further questions? If the two incorrect answers to questions on first topic are attributable to bad luck due to small sample size, should the teacher explore first or second topic? What if there are more than two topics?
The family of bandit algorithms are designed to cope with uncertainty by balancing exploration and exploitation [
40]. However, when applied to formative assessment the exploitation component is non obvious, as ultimately the goal is to explore the knowledge of the student. The algorithm should probe and explore the different topics and engage in focused questioning, exploiting those which are possibly in most need of instruction. This presents an opaque bandit problem where a unique answer, reward, is observed at each round, in contrast with the transparent one where all rewards are observed [
34]. Thus, in context of assessment, a sequential allocation problem is obtained when the assessor has to choose from many questions from multiple topics, bandits, and has to repeatedly choose a topic to explore, which bandit arm to pull. When choosing next question to ask the decision should depend on the history of already known answers. Then a policy is the mapping from the individual history of the student to actions (questions to be asked of the student).
Suppose student’s knowledge on number of topics
T = {1, 2, …,
k}. The reward in case of a multiple-choice quiz with either correct on incorrect answer to each question
Xr ∈ {0, 1} is binary valued. Each topic corresponds to an unknown probability distribution. There exists a vector
µ ∈ [0, 1]
k such that the probability that
Xr = 0 given the algorithm chose topic
Tr =
t is
µt. This kind of environment is called a stochastic Bernoulli bandit. If the mean vector associated with the environment was known, the optimal policy is to always choose a question on one topic
t∗ = argmin
t∈Tµt. This will result in the exploration of the weakest area of student’s knowledge, so as to aid in the further instruction. The regret over the
n questions is
where the expectation
E is with respect to stochastic environment and policy. However, in practical setting, the number of questions on one topic is usually rather limited due to the scope of the curriculum and the question pool. As is the length, or the horizon, of the quiz. Thus the value of calculated this way regret is of little practical value.
The main challenge of the task is finding the weakest topic of a student. To do so the algorithm must explore different topics and exploit particular topic to obtain more accurate estimation of the student’s level of knowledge on that subject. This basic exploration–exploitation dilemma is the key to obtaining a good strategy. A heuristic principle for dealing with this issue chosen in this paper is optimism in face of uncertainty and an algorithm which operates on this principle Upper-Confidence Bound (UCB). UCB algorithm is one of the simplest algorithms that offers sub-linear regret. The algorithm suggest choosing the action with the largest upper confidence bound, or in case of our model a topic with the smaller lower confidence bound. Then the question number
n chosen on a topic
t will be
where
C is a constant that can be chosen to regulate the impact the second exploration component has on the choice of the topic, and
Nt is the number questions on the topic has been asked so far. As the number of questions on the topic increases, so the uncertainty and the exploration term of the formula decrease [
40]. Thus the algorithm will seek out the weakest topics of knowledge for a student, once identified it will thoroughly question the student on said topics. This is pedagogically valuable because, once identified the lacking topic knowledge can be corrected. In addition, the algorithm will gather a more fine-grained information on the weakest topic by “exploiting it”, which will be useful in post assessment knowledge correction. In the case when the item pool of the topic is exhausted the algorithm chooses the topic with the second smallest value as estimated by Formula (2).
2.2. Simulated Students’ Experiments
A number of experiments with simulated students were carried out before proceeding with testing using empirical data. Number of simulated students for each experiment was 1000, unless stated otherwise. Each simulated quiz had a number of question on two or more topics. Each topic, then, would be represented as a vector of weights for each question in the quiz. Each weight would represent the relevancy of the question to the topic. In this paper only experiments with binary weights, 0 or 1, where carried out. Moreover, each question was assumed to belong only to one topic. The number of question items on each topic were set to be equal. Each simulated student had a vector equal in length to the number of the questions in the quiz, where each element represented the knowledge on one question, either yes or no. For all simulation experiments, unless stated otherwise, the inter-topic correlation of answers was random. The probability that the student will know the answer on a topic pt had uniformly distributed random bias between 0 or 1.
2.3. Real Students’ Assessment Methodology
In this study, feasibility of application of UCB algorithm to formative assessment is explored using the data collected from a 60-question quiz covering 15 subtopics. (i.e., network topologies, networking devices, Internet Protocol version 4, Ethernet, cloud computing services). The assessment was held at Vilnius Gediminas Technical University, Lithuania (on 25 April 2019). The test length and item pool size were set to 60 questions to keep quiz length close to an hour. Number of topics was set to 15 because it is the number of lectures in the course. The quiz was designed to assess students’ knowledge of basic computer networking and cloud computing technologies. In total 104 undergraduate, sophomore and junior (third year), students from 7 different groups where tested. All of the students took the test at the same place and time, (Saulėtekio al. 11, Vilnius, from 12:30 to 13:30). The question pool contained questions of varying difficult (hardest question was answered by 16 students, easiest by 103). This was done to test robustness of the algorithm, as it is meant as an alternative to methods that require item calibration. Therefore the algorithm must be able to work with items of varying and priory unknown difficulty.
All questions in the 60-question quiz where multiple choice questions with four options, only one of which was correct. Questions on the same topic were made to never appear in consequence so as to lessen the impact of deductive reasoning over the knowledge of the subject. Students were not allowed to assist each other during the assessment. All students were also informed that if they so desire the test will have no summative impact on their grade and will serve exclusively formative function. Every student gave a written permission allowing their anonymized data to be used for scientific research. The answers to questions have been aggregated in a comma separated file (csv), anonymized and later processed using python software written for the purpose of this experiment.
The true knowledge of a student on each topic, the ground truth, was calculated by dividing the number of correct answers within a topic by a total number of questions within that topic. In the experiment with participation of real students the number of correct answers was known because all students were required to answer all questions in the item pool. After complete knowledge of the test material for each student was known, the algorithms would question the database. The accuracy was then measured as a relationship between an estimated student knowledge from the incomplete information accessed by the algorithm and the complete information in the database. Formulas used for accuracy calculation are provided in the following statistical analysis section.
2.4. Statistical Analysis
The accuracy (performance) of the test was established based on Positive Predictive Value (PPV), which defines the probability of supplying the correct learning material to a student after the formative assessment and evaluated according to the formula, PPV = TP/(TP + FP) for one student. Where TP is True Positive, or the number of correctly identified weakest topics for a student. The number of weakest topics is not always one, because topic proficiency is assumed to be equal to an expected value of an answer on a topic question, a Bernoulli variable E(T) = pt. This value can be the same for several topics, in that case it is assumed that the student would equally benefit from instruction on any of the topics. The FP, False Positive, is a number of incorrectly identified topics for which topic mean µt is larger than the smallest mean, µm. For the group of students average of individual accuracies was taken.
Where applicable experimental results where expressed as mean ± Standard Error of the Mean (SEM). Correlation matrix of questions for heat-map visualization was computed using Pearson correlation coefficient using Pandas Python data analysis library.
Variance of answers on questions on one topic in simulation experiments was calculated using Var[
T] =
pt(1 −
pt) formula for Bernoulli distribution. The probability of correct answer on the topic
pt was known and controlled for each topic
T to observe its effect on assessment accuracy. In the experiments where real students’ variance was computed using same formula,
pt was estimated using formula
where
s is a student,
q is a question within a topic and
asq is an answer of a particular student on a particular question within a topic,
ns and
nst are the number of students and questions within a topic, respectively.
4. Discussion
UCB algorithm has a potential to significantly reduce assessment length without the loss of accuracy. Even for very short tests with few topics the algorithm offers significant reduction of test length (
Table 2). However, there is no advantage for quizzes in which each topic contains only one question, at which point the notion of topic loses its pedagogical meaning. There is evidence that time allocated to study positively affects student performance [
41], thus reducing the time spend on assessment is desired. The experiments with real students support the conclusions drawn from the simulations (
Figure 3). Such strong change in quiz length has the potential to change the dynamics in the classroom, because the instructor would not need to spend an entire lesson just for formative assessment. This in turn, may increase opportunities to provide personalized feedback to students linked to better performance [
14,
42]. This advantage comes at no cost in quiz creation complexity, unlike IRT and Elo rating based systems where quiz creation can be prohibitively complex in some situations due to item calibration problem [
22,
27]. Because of this fundamental difference we do not compare performance of UCB to IRT and Elo based algorithms. Such comparison would not discredit either approach: If UCB performs worse (as is safe to assume), its simplicity of use and independence from question item calibration makes it an interesting alternative to traditional assessment.
It is clear that exploration component of the algorithm becomes more important with the increase in number of items within a topic. This can be observed in bad performance of algorithm in 4 topic quiz when
C was set to 0 in
Figure 2. It took 45 questions to reach >95% assessment accuracy. Also, unintuitively, it takes less questions to identify weakest topics of students’ when there are more topics when the size of the item pool is kept constant.
Compared to IRT and Elo algorithms UCB will obtain less information about the student in a mathematical sense [
27,
43], and this can be seen as a disadvantage. However, not all information is equally pedagogically valuable [
12,
16]. For example, assume we are assessing student’s knowledge on two topics. We are nearing the end of the quiz and have determined that knowledge on first topic is adequate, but lacking on the second topic. Because of the nature of UCB algorithm topic two is more explored than topic one, therefore we expect to gain less information by exploring second topic. However, from pedagogical point of view the information on topic two can be more valuable to address the assessment needs according to the three question feedback model [
15]. According to the assessment, topic two is in need of instruction. If we are going to proceed to teach it, we can use the extra diagnostic data to save time and effort by not teaching what the student already knows. Meanwhile we have no immediate use for the more precise data about first topic.
Simulation data presented in
Figure 4 and empirical results (
Figure 3 and
Figure 5) suggest that the UCB assessment approach can offer significant reduction in quiz length for any practical item pool size and inter-topic variance of answers. This is an important result because it indicates suitability of the method for any grouping of questions regardless of how correlated the answers are within one topic. Method effectiveness stays almost constant for any observed answer variance, which makes it easy to predict quiz length. This allows a teacher to group questions on each topic as they see fit according with syllabus and the learning material at hand, regardless of existence or lack of a common latent trait underlying the items within a topic. This separates UCB method from IRT and Elo based alternatives which depend on the assumption of common latent trait [
22,
28].
At quiz lengths between 10 and 32 questions a substantial portion (4–38%) of students were assessed very poorly implies that using shorter assessments is morally questionable, as the majority of the students will receive a very accurate guidance, while the rest will be tutored on topics which they already know. This presents a problem for more important formative assessments (i.e., entire semester assessment). This property of the algorithm can be offset my small increase in quiz length as seen in
Figure 6.