1. Introduction
Item response theory models [
1,
2,
3] are central to analyzing dichotomous random variables used for model testing data from educational or psychological applications. This class of statistical model can be regarded as a factor-analytic multivariate technique to summarize a high-dimensional contingency table by a few latent factor variables of interest. Of particular relevance is the application of item response models in educational large-scale assessment [
4], such as the studies programme for international student assessment (PISA; [
5]) or progress in international reading literacy study (PIRLS; [
6]).
Educational tests often use multiple-choice items [
7,
8] to assess the ability of test takers in a well-defined domain of interest. In multiple-choice items, test takers have to choose the correct response alternative from a set of response alternatives (e.g., one out of four response alternatives is the correct solution to the item). If test takers do not know the correct answer, they can obviously guess the correct alternative. In the case of random guessing, the probability of providing the correct answer by a random guess is 0.25 for a multiple-choice item with four response alternatives.
Typically, the occurrence of random guessing should be taken into account in statistical modeling [
9,
10] (see also [
11,
12,
13]). The three-parameter logistic item response model [
14] is frequently used for handling guessing effects in multiple-choice items [
6]. However, this model has been criticized because of implausible assumptions because it does not correctly reflect the process of random guessing [
15,
16]. An alternative, more plausible item response model has been proposed that circumvents the drawbacks of the three-parameter logistic model. The four-parameter guessing model [
15,
17] can also potentially model the guessing process adequately. However, neither a simulation study nor an empirical application exists that compares the four-parameter guessing model with competitive item response models. This article fills the gaps in the literature.
The rest of the article is structured as follows. An overview of different item response models for handling guessing effects is given in
Section 2. In
Section 3, the statistical properties of the four-parameter guessing model are assessed in a simulation study. The four-parameter guessing model is compared with alternative item response models for handling guessing effects in an educational large-scale assessment study application in
Section 4. Finally, the paper closes with a discussion in
Section 5.
2. Item Response Models
In this section, we present an overview of different item response models that are used for analyzing educational testing data to obtain a unidimensional summary score [
18]. In the rest of the article, we restrict ourselves to the treatment of dichotomous items.
Let
be the vector of
I dichotomous random variables
(also referred to as items). A unidimensional item response model [
1,
18] is a statistical model for the probability distribution
for
, where
where
is the density of the standard normal distribution. The vector
contains all estimated item parameters of item response functions
.
In Equation (
1), the latent variable
can be interpreted as a unidimensional summary of the test items
. The distribution of
is modeled as a standard normal distribution with density function
, although this assumption can be weakened [
19,
20,
21,
22]. The item response functions (IRF)
model the relationship of the dichotomous item with the latent ability
. Moreover, the multivariate dependency in
is entirely captured by the unidimensional variable
. This means that in (
1), item responses
are conditionally independent on
; that is, after controlling the latent ability
, pairs of items
and
are conditionally uncorrelated. This property is also known as the local dependence assumption that can be statistically tested [
18,
23].
The item parameters
of the item response functions in Equation (
1) can be estimated by (marginal) maximum likelihood (ML) using an expectation-maximization algorithm [
24,
25,
26]. The corresponding likelihood function to the multivariate distribution defined in (
1) can also be applied to test designs, where each test taker only receives a subset of items [
27,
28]. In this case, non-administered items are skipped in the computation of the likelihood function.
In the remainder of this section, different item response models (i.e., specifications of the item response functions ) are discussed that can handle guessing effects in testing data.
2.1. Two-Parameter Model (2PL)
The two-parameter logistic (2PL) model [
29] parametrizes the item response function
as a function of item discrimination
and item intercept
:
where
denotes the logistic link function. The Rasch model can be considered a special case of the 2PL model (
2) (see [
30,
31]) that constrains all item discriminations
to be equal to a common discrimination parameter
a. The 2PL model does not handle guessing effects, and its item response function has a lower asymptote of 0 and an upper asymptote of 1.
2.2. Three-Parameter Model (3PL)
The three-parameter logistic (3PL) model [
14] introduces an additional pseudo-guessing parameter
in the 2PL model that models a lower asymptote different from 0:
Guessing effects are intended to be captured by the pseudo-guessing parameter
. In particular, the 3PL model is used for multiple-choice items in educational and psychological assessment data. Large sample sizes or (weakly) informative prior distributions are required for stable estimation of the 3PL model [
18,
32]. Variants of the 3PL model (
3) that constrain parameters have also been proposed to address estimation issues [
33,
34,
35]. Some researchers question the identifiability of the 3PL model [
36,
37], while others argue that the 3PL model can be identified by relying on a normal distribution assumption of the latent trait
[
3].
2.3. Four-Parameter Model (4PL)
In educational and psychological testing data, it might be possible that incorrect item responses would result, even if the test taker had sufficient ability to solve the item correctly. Such a situation can be described by the occurrence of slipping effects. The four-parameter logistic (4PL) item response model [
38] is a generalization of the 3PL model that also includes an additional parameter
that accommodates slipping effects. The item response function is given by
Contrary to the 1PL, 2PL, or 3PL model, the 4PL model is not yet widely applied in the operational practice of educational studies. However, there are case studies in which the 4PL model is applied to educational testing data [
39,
40,
41].
Like the 3PL, the 4PL model also might suffer from empirical nonidentifiability [
38,
42,
43,
44]. This is why prior distributions for guessing (3PL and 4PL) and slipping (4PL) parameters prove helpful for stabilizing model estimation. Alternatively, regularized estimation using a ridge-type penalty function for all pairwise differences of pseudo-guessing and slipping parameters can ensure feasible model estimation [
45].
2.4. Four-Parameter Guessing Model (4PGL)
It has been pointed out that the 3PL model is not a plausible statistical model for handling guessing effects in testing data. The reason is that it presupposes that all test takers who guess the item get the item correct with a probability of one [
15,
16,
17]. This implausible observation motivated Aitkin and Aitkin [
15] to propose the four-parameter guessing (4PGL) model:
The item parameter
is the probability of guessers; that is, the proportion of test takers that guess item
i. The parameter
quantifies the probability of a correct guess of item
i of test takers that are in the class of guessers for this item. Hence, the total probability
is the marginal probability of test takers that have a correct item response by a random guess. It is advised to fix the guessing probability
to a plausible fixed value [
15]. For a multiple-choice item with
response alternatives, it is plausible to fix the guessing probability
to
.
The 4PGL model defined in Equation (
5) is motivated by a sequential process of responding to the item. In the first stage, students decide whether they try to solve the item (with probability
) or whether they guess the item (with probability
. In the second stage, students that guess the item receive a correct item response with probability
(i.e., by random guessing). Students that try to solve the item get the item correct with probability
. The multiplication in both terms of the righthand side in (
5) reflect the sequential psychological process. The item response probability
of getting the item correct results as the total probability.
2.5. Reparametrized Four-Parameter Model (R4PL)
Obviously, the 4PL and the 4PGL models include four-item parameters. Interestingly, one can define a reparametrized four-parameter logistic (R4PL) model that reparametrizes the 4PL model (
4) into a parameterization of the 4PGL model (
5). The only difference is that guessing probabilities
are estimated from the data. The reparametrized item parameters are given by
In applications (in particular with smaller sample sizes), it might be advantageous to estimate the 4PL instead of the R4PL model. The computation of
in (
6) might be unstable if both pseudo-guessing
and slipping
parameters are close to zero.
Note that the parameters
and
in (
6) correspond to the same parameters in the 4PGL model (see (
5)). However, the crucial difference is that
is typically fixed to
in the 4PGL model, while it is estimated in the R4PL model.
2.6. Three-Parameter Model with Residual Heterogeneity (3PLRH)
As an alternative to the 2PL model, item response functions with skew link functions have been proposed [
41,
46,
47,
48,
49]. The three-parameter model with residual heterogeneity (3PLRH) extends to the 2PL model by including an asymmetry parameter
[
50,
51] in the item response function:
The 3PLRH model has been successfully applied to LSA data and often resulted in superior model fit compared to the 2PL or 3PL model [
41,
52,
53]. Importantly, it has been argued that the 3PLRH model would also be able to handle guessing effects [
54,
55].
2.7. Summary
As pointed out by an anonymous reviewer, it should be emphasized that (pseudo-) guessing parameters in the 3PL, 4PL, or 4PGL model are not an actual empirical quantification of guessing. The item parameters can only be interpreted as quantities obtained by fitting a (misspecified) parametric item response model to the dataset of item responses.
This anonymous reviewer suggests that one can interpret the 4PGL model as quantifying the proportion of respondents that engage in a guessing process, while the 3PL or 4PL model quantifies the probability of a correct response by guessing. The 3PL and the 4PGL models differ in that respondents choose to either guess or problem solve at the outset. According to the 3PL model, students first try to solve the item and only resort to guessing if they fail to solve the item. In contrast, according to the 4PGL model, students decide at the onset whether they try to solve or they guess the item [
15]. Hence, the meaning of the item parameters in the 3PL and 4PGL models is quite different.
Overall, we think that the criteria of psychological plausibility or usefulness may sometimes, if not frequently, outweigh considerations of model fit. The criterion of usefulness might be particularly relevant if differences between the alternative item response models in terms of model fit can be considered small.
3. Simulation Study
In this simulation study, we investigate the performance of the 4PGL model. Item response data are simulated by the 4PGL model. We compare the estimated item parameters of the 4PGL model with alternative item response models described in
Section 2 and contrast the results in terms of parameter recovery and item fit.
3.1. Method
The simulated datasets consisted of 30 items. The first 15 items C1 to C15 were constructed response items. The data-generating model for the constructed response items was the 2PL model because no guessing effects could be expected for this item format. The remaining 15 items M1 to M15 were multiple-choice items that were simulated according to the 4PGL model. The guessing probability
was assumed constant with a fixed value of 0.25. This situation corresponds to a multiple-choice test with four item alternatives. The data-generating item parameters are presented in
Table 1. The item parameters were chosen to mimic parameter values obtained in the empirical example in
Section 4.
We varied the sample sizes of the item response datasets as , 2000, 5000, and 10,000 to reflect different but typical situations in educational test data applications. We did not consider smaller sample sizes because a less stable estimation would be expected. In this case, we refrained in this simulation study from applying Bayesian or regularization methods in low sample size situations.
After simulating a dataset according to the 4PGL model, the dataset was analyzed with the five item response models: 2PL, 3PL, R4PL, 4PGL, and 3PLRH. No prior distributions for item parameters were utilized for model estimation. For constructed response items, item response functions of the 2PL were specified. The five more complex item response models were only utilized for multiple-choice items. Note that the analysis of the item responses involved all 30 items.
Parameter recovery was assessed by bias and root mean square error (RMSE). Because item parameters of all 30 items were of interest, we computed the average absolute bias and the average RMSE of item parameter groups (i.e., the average absolute bias of parameters of all multiple-choice items).
The model-fit assessment was assessed by the root integrated squared error (RISE) between the estimated item response function
and the true item response function
that was used to simulate item responses [
56,
57]. The estimated item response function depends on estimated item parameters
. The functions are evaluated on an equidistant discrete grid of
points
. The RISE statistic is given by
where
are the weights of the discretized standard normal distribution [
58], and
C is a scaling constant to ensure
.
In real data, the true item response function
is typically unknown. Hence, the adequacy of the functional form of the item response function can be assessed by means of item fit statistics [
59]. The root mean square deviation (RMSD; [
60,
61,
62]) statistic assesses the difference between an observed item response function
and the model-implied item response function
:
where
is reconstructed from individual posterior distributions
and
denotes the vector of item responses of person
n [
61,
63].
In practice, a researcher does not know which item response model has generated the data. Hence, model selection based on information criteria is frequently applied [
5,
41,
64,
65,
66]. We assessed the percentage rates of correctly choosing the data-generating 4PGL model employing the Akaike information criterion (AIC) and the Bayesian information criterion (BIC).
The entire simulation study was carried out in the statistical software R [
67]. The item response models were specified using the
xxirt() function in the R package sirt [
68]. In each of the four cells of the simulation (i.e., the four factor levels of the sample size
N), 1500 replications were conducted.
3.2. Results
We now present the findings of choosing the correct data-generating 4PGL model utilizing information criteria AIC and BIC. The model selection based on AIC was satisfactory with accuracy rates 96.8% (), 99.7% (), and 100.0% ( and N = 10,000). In contrast, model selection based on BIC showed issues in correctly choosing the 4PGL model for lower sample sizes (4.4% for and 52.8% for ), while it had accuracy rates of 100.0% for large sample sizes and 10,000. In situations where the 4PGL model was not selected, the simpler 2PL model was chosen.
The average absolute bias (ABias) and average RMSE of estimated item parameters in the 4PGL and R4PL models for constructed response and multiple-choice items are shown in
Table 2. Note again that the 2PL model was specified for constructed response items. The average absolute bias of item discriminations
and item intercepts
was quite satisfactory for constructed response items. However, more interesting findings appeared for multiple-choice items. ABias turned out to be substantially large with moderate sample sizes of
, in particular for item discriminations in the R4PL model. However, for (very) large sample sizes of
10,000, the true 4PGL model and the overparametrized R4PL model provided unbiased estimates. Note that the ABias and RMSE decreased with increasing sample sizes.
Critically, the RMSE of estimated guessing probabilities was very large in the 4PGL model. Most likely, the issues can be traced back to boundary estimates of the probability of guessers . The situation changes when one assesses bias and RMSE for pseudo-guessing parameters and slipping parameters in the 4PL model, which can be accurately estimated in sufficiently large sample sizes.
Overall, the simulation study demonstrated that the 4PGL model could be successfully applied for typical educational testing data applications. We would also like to emphasize that the 3PL model practically estimates pseudo-guessing parameters as zero and is, therefore, inadequate in situations in which the 4PGL model is the data-generating model.
We now turn to the assessment of model fit. Because the five different item response models involved different item parameters, the RISE statistic is an effective summary of the discrepancy between estimated and true item response functions. The item statistics RISE and RMSD are shown in
Table 3. Overall, RISE was always larger than RMSD. The reason is that the RMSD statistic replaces the unknown true item response function
by the observed item response function
. The RISE, as well as the RMSD statistic, decreased with increasing sample sizes.
For constructed response items, there was no practical difference in terms of model fit. This observation seems plausible because the constructed response items were correctly specified according to the data-generating 2PL model. Hence, the misfit in multiple-choice items does not impact the fit in constructed response items.
For multiple-choice items, the data-generating 4PGL model fitted best in terms of RISE and RMSD statistics. The R4PL model includes the true 4PGL model as a special case but introduces additional variability in terms of RISE due to one additional estimated item parameter per item. Notably, the misspecified 3PLRH model outperformed the misspecified 2PL and 3PL models for multiple-choice items in terms of RISE and RMSD. Although there is a clear item misfit regarding the functional form, the RMSD values of the 2PL and the 3PL model were still relatively small compared to the usually employed cutoff values of 0.05 or 0.08 [
61]. Hence, using the 2PL model as the analysis model would not be considered a significant model deviation in applied research. Therefore, the true data-generating 4PGL model would not be detected if only the 2PL or 3PL models had been fitted and RMSD statistics were computed.
To summarize our findings, the adequacy of fitted item response models should be compared based on the average RMSD value or some other aggregated RMSD value statistic, and the best-fitting model should be chosen based on the aggregated statistic.
4. Empirical Example: PIRLS 2016 Reading
In this empirical example, we use a dataset from the PIRLS 2016 reading study [
6].
4.1. Method
We selected 41 countries with moderate to high performance in the PIRLS reading study. The chosen countries are listed in
Appendix A. A random sample of 1000 students per country was drawn for each of the 41 countries. In this example, the pooled sample comprising all 41,000 students was used. We did not focus on country comparisons because our motivation was to investigate the performance of different item response models (see [
41]). No student weights were used in the analysis models for the pooled item response dataset.
In total, 141 items were used in the PIRLS 2016 reading study. There were 70 multiple-choice items and 71 constructed response items. Note that only a small subset of items (e.g., 20 to 30 items) was administered to each student because of limited testing time. Omitted and not-reached item responses were scored as incorrect. Some constructed response items were polytomously scored. These items were dichotomously recoded as correct if the maximum score of the original polytomous item was attained.
We analyzed the pooled item response dataset with five analysis models: 2PL, 3PL, 4PGL, 4PL, and 3PLRH. We did not include prior distributions for item parameters in the models because empirical identifiability issues were not expected in the large sample size of N = 41,000 students. We also computed the resulting reparametrized item parameters of the R4PL model based on the 4PL model estimation. The item fit was assessed using the RMSD statistic. In addition, we used the information criteria AIC and BIC as criteria for model selection. If a parameter was estimated at the boundary of the admissible parameter space (e.g., a pseudo-guessing parameter was estimated as zero), such a parameter was not counted as an estimated parameter in the computation of information criteria.
Moreover, we used the Gilula–Haberman penalty (GHP; [
69,
70,
71]) as a normalized variant of the AIC statistic that is relatively independent of the sample size and the number of items. The GHP is defined as
, where
is the number of estimated model parameters for person
p. The GHP can be seen as a normalized variant of the AIC. A difference in GHP values (i.e.,
) larger than 0.001 is a notable difference regarding global model fit [
41,
71,
72,
73].
4.2. Results
We now present the results for the PIRLS 2016 reading dataset.
Table 4 contains information criteria AIC and BIC and results for the GHP statistic. It can be seen that the 4PL model (which is statistically equivalent to the R4PL model) had the best fit in terms of AIC. However, the 3PL model would be preferred in terms of BIC. Note that model comparisons in terms of differences in the GHP (i.e.,
) turned out to be very small or even negligible according to the discussed cutoff values from the literature.
Average RMSD item fit statistics are displayed in
Table 5. The RMSD values were very similar for constructed response items. For multiple-choice items, the R4PL model had the best fit, followed by the 3PL and the 3PLRH models. Notably, the 4PGL model fitted worse in terms of RMSD values. At least, the 4PGL model outperformed the 2PL model based on average RMSD values.
The item response functions of the 2PL model were utilized for constructed response items for all five analysis models. It turned out that the correlations of item parameters and for the constructed response items were practically equal to 1 (i.e., larger than 0.999).
For multiple-choice items, substantial differences occurred. Out of the 70 multiple-choice items, 43 items had an estimate of zero of
in the 4PGL model, 13 items had a zero estimate of
in the 3PL model, 8 items had a zero estimate of
in the 4PL model, and 18 items had a zero estimate of
in the 4PL model. In
Figure 1, the probability of guessers parameters
are displayed. It can be seen that only three items have larger probabilities than 0.20.
The guessing and slipping parameters in the 4PL model are presented in
Figure 2. It can be seen that the pseudo-guessing parameters
scatter around 0.20 and often range between 0.10 and 0.30, while the slipping parameters
typically do not exceed 0.10.
The correlations and means of estimated item parameters for multiple-choice items are displayed in
Table 6. The correlations between item intercepts
were high, but significant deviations between different scaling models were observed for item discriminations
. Furthermore, the pseudo-guessing parameters of the 3PL and the 4PL model were highly correlated. However, the pseudo-guessing parameter
of the 3PL model correlated only moderately with the probability of guessers
from the 4PGL model. Interestingly, the
parameters from the 4PGL had high correlations with the slipping parameter
in the 4PL model. These findings underline that quantifications about guessing behavior in testing datasets depend on the chosen item parameter and the item response model.
In our study, it turned out that the correlation of the
and
parameters in the 4PL model was zero. Interestingly, ref. [
53] reported moderate positive correlations ranging between 0.26 and 0.43 in their empirical application that involves mathematics test data from a standardized state-wise US-American assessment across multiple grades.
5. Discussion
In this article, the 4PGL model was compared with alternative item response models for handling guessing effects in educational testing data. It has been shown through a simulation study that item parameters of the 4PGL model can be successfully recovered. It turned out that in model selection, AIC should be preferred over BIC. Moreover, the findings from the simulation study also demonstrate that the RMSD item fit statistic is ineffective in detecting model misfit. The much simpler 2PL model would be preferred over the correctly specified data-generating 4PGL model.
In the empirical example that involves PIRLS 2016 reading data, the 4PL model was the frontrunner in terms of AIC and RMSD criteria, followed by the 3PL model. The 4PGL model was obviously inferior to the 3PL and 4PL models and only slightly inferior to the 2PL model. However, we have argued elsewhere that the criterion of statistical model fit should not be used for selecting a model for operational use in an educational large-scale assessment study [
41,
74]. Different choices of item response models imply a different weighing of items in the unidimensional ability variable
utilized for official reporting in the above-mentioned educational studies [
75]. In this sense, statistics (or psychometrics) should not change the quantity of interest [
76,
77]. The fitted item response models in empirical applications are typically intentionally misspecified, and consequences of the misspecification for standard errors of model parameters and reliability of the ability variable
have to be considered [
74].
In the simulation study and the empirical example, we only considered large sample sizes. In the case of smaller sample sizes, estimation issues of the 4PGL model will likely occur. Regularized estimation could prove helpful in avoiding estimating issues [
32,
45].
An anonymous reviewer was concerned about identification issues in the 4PL model. She or he argued that when the upper and the lower asymptotes are present, different combinations of the guessing and the slipping parameters may lead to the same likelihood. The reviewer was unsure of how the R4PL addresses this issue. As the R4PL model is equivalent to the 4PL model (assuming
and
), identification issues would apply to both models. Hence, there must be concerns about general identification issues in the 4PL model. There are several simulation studies that showed that the 4PL model could be empirically identified in sufficiently large samples [
38]. We think that the correct distributional assumption about
might be crucial in obtaining empirical identifiability. Probably, it this difficult to substantially weaken the normal distribution assumption of
in a finite number of items [
78]. In our simulation study and the empirical example, the test consisted of constructed response items and multiple-choice items. As the 2PL model instead of the 4PL model is applied for constructed response items, we expect that the ability distribution for
can primarily be identified based on this item type. This, in turn, enables the identifiability of the guessing and slipping parameters in the 4PL model because they could be identified if the ability
were known for each student.
Furthermore, as suggested by an anonymous reviewer, the 4PGL model could also be advantageous in applications of linking [
79] and differential item functioning [
80]. For example, investigating differential item functioning in guessing parameters of the 4PGL model might be an interesting topic in future research (see, for example, [
81] for related work).
One might acknowledge that all utilized item response models might be misspecified to some extent. This observation would lead us to conclude that ability parameters
would be biased. This reasoning depends on how a true
value would be defined. One could assume that a unidimensional item response model with monotonous item response functions has generated the data. Under this assumption, one can quantify the bias in estimated ability parameters (see [
41]). However, why should one believe that the more complex item response model better reflects the truth? We would think the other way around. A purposely chosen (and useful) item response model defines a scoring rule
for a particular ability parameter estimate [
74]. Hence, the true ability value can be defined by applying the intentionally misspecified item response model. There are good reasons to not rely on the best-fitting item response model because this could imply that (local) scoring rules that do not align with the test blueprint (i.e., the intended weighing of items in the reported score
; see [
74,
75,
82]).
Finally, we assumed that guessing effects were modeled to be item-specific but were assumed to be constant across test takers. This assumption can likely be violated in practice. In particular, guessing can be related to the ability variable which is modeled in ability-based guessing models [
83,
84]. Moreover, guessing (and slipping) effects might also be a statistical property of test takers. Hence, guessing (and slipping) parameters can be modeled as person-specific random variables [
85,
86,
87]. However, the statistical model can also include random variables for test takers to characterize misfitting test takers [
88,
89].