1. Introduction
Multidimensional item response theory (IRT) models are the method of choice for analyzing data from cognitive tests assessing multiple competencies (e.g., science, mathematical literacy, and reading). The vast majority of statistical software for estimating (multidimensional) IRT models draws on marginal maximum likelihood (MML) estimation (see e.g., [
1,
2,
3,
4]), but Bayesian IRT modeling has gained increasing attention in recent years (see [
5,
6] for introductions) due to its great flexibility in model specification and its capability of handling high-dimensional models well.
To estimate the parameters of Bayesian IRT models, researchers can use customized Markov chain Monte Carlo (MCMC) samplers that were developed for specific IRT models, such as the Gibbs sampler by [
7] for the Rasch model, or they can use general-purpose software for Bayesian estimation such as JAGS [
8] or Stan [
9]. In our experience, applied researchers typically employ the latter software packages because they provide them with a high flexibility in model specification. This allows them to take the specifics of their data into account without the need to develop their own customized samplers (for which they may not have the time). Furthermore, general-purpose software packages quickly adopt new state-of-the art Bayesian techniques, allowing applied researchers to easily consider these new techniques in their work. However, the last point may also be a disadvantage, at least when there is little research on the performance of these new techniques as implemented in the general-purpose software.
A case in point is the implementation of Variational Bayes (VB) in Stan. VB is a deterministic method for approximating the posterior distribution [
10] that trades off the performance guarantees of MCMC-based Bayesian estimation against speed. VB may therefore be a real alternative for researchers aiming to draw on Bayesian techniques for estimating complex models in large data sets that could alleviate the potentially infeasible computation times of general-purpose MCMC samplers that are not customized to the specified models.
We believe that VB will receive increasing attention as an alternative estimation approach for multidimensional IRT models. First, research showed that the method performed well in estimating simpler IRT models [
11]. Second, many applications of multidimensional IRT models involve large data sets. A prominent example is large-scale assessments (LSAs) such as the Programme for the International Assessment of Adult Competencies (PIAAC; [
12]) or the Programme for International Student Assessment (PISA; [
13]), which provide publicly available, rich data sets that support investigating multiple competencies in large, representative samples from multiple countries. For the analysis of such data, VB may pose a computationally more feasible alternative to general-purpose MCMC samplers. Third, and most importantly for the present context, general-purpose VB algorithms are now available in the statistical programming language Stan. This provides the opportunity to leverage VB methods to a wide audience while also maintaining flexibility in model specification. However, although Stan’s VB algorithms have been shown to perform well in simple linear and hierarchical logistic regression analyses [
14,
15], it is currently unclear whether the results of Stan’s general-purpose VB algorithms are also trustworthy for complex IRT models.
Focusing on multidimensional IRT models with between-item dimensionality, the aim of the present study is to investigate not only how well the general approach of Stan works for estimating complex IRT models, but also whether applied researchers can use the current implementation in their work. In doing so, we (a) add to an improved understanding of the performance of Stan’s general-purpose VB function for estimating complex models with high-dimensional parameter spaces, (b) further investigate the utility of VB for psychometric applications, and (c) provide psychometric researchers with an evaluation as to whether drawing on Stan’s VB algorithms may be a promising direction to take for implementing their newly developed IRT models. In the following, we first briefly review MML and MCMC-based Bayesian estimation of multidimensional IRT models. Thereafter, we give a short introduction to the fundamentals of VB methods and their implementation in Stan. We then report the results of a simulation study in which we examine the performance of VB as implemented in Stan under different conditions of model complexity and sample size. We continue with an illustration of the approach using a real-world example data set.
2. Conventional Estimation of IRT Parameters
The focus of the present study is a D-dimensional two-parameter logistic (2PL) model with between-item dimensionality (i.e., each item loads on one dimension only). We focus on multidimensional IRT models with between-item dimensionality because we believe (1) that they are more relevant for and more widely employed by applied researchers, because they correspond to the design of typical cognitive assessments and allow the factors to be interpreted more clearly. In typical LSA designs, for example, items are assumed to be reflective of one proficiency only (e.g., science, mathematical literacy, or reading). As we see LSA data as a potential use case for employing VB methods, we aim at investigating a model that is applied in this context. Furthermore, we (2) deem it promising to start with a relatively simple model to investigate the utility of Stan’s VB algorithm for estimating multidimensional IRT models and better understand its performance. In the case that VB does not yield trustworthy results even for a simple multidimensional model, it may be worthwhile to first improve performance for this simpler model before moving to more complex models such as models with within-item dimensionality (i.e., cross-loadings) or non-compensatory models with multiplicative terms.
In multidimensional IRT models with between-item dimensionality, the dichotomously scored item responses
, containing examinee
i’s,
, score on item
j,
, measuring dimension
d,
, are modeled as
where
and
denote items
j’s difficulty and discrimination parameter, respectively. Person
i’s proficiency on the
dth dimension measured by item
j is given by
. The function
maps the item’s index
j to the index
d of the latent proficiency variable it is assumed to measure.
A joint multivariate normal distribution with mean vector
and variance-covariance matrix
is assumed for person proficiencies
.
This yields the following likelihood
where
is a vector of the parameters to be estimated and
denotes a multivariate normal density. For model identification, all elements in
can be set to 0 and variances (i.e., diagonal elements of
) can be set to 1. Depending on the research questions, both incidental (i.e., item difficulties and discriminations) or structural parameters (e.g., correlations among the dimensions) may be of interest.
2.1. Marginal Maximum Likelihood Estimation
Standard statistical software for estimating (multidimensional) IRT models such as TAM [
1] or mirt [
2], but also software used in LSA reporting for item parameter calibration such as mdltm [
4] currently employed in PIAAC and PISA, draw on MML estimation via the expectation–maximization (EM) algorithm. However, using MML is limited by the number of dimensions
D, as maximizing the marginal log-likelihood involves solving high-dimensional integrals that are usually analytically intractable. Hence, numerical approximation methods to the integrals have to been used [
16]. These methods are known to be computationally demanding under conditions with three or more dimensions, and may come with a computational burden that jeopardizes the applicability of higher dimensional IRT models in practical settings.
2.2. Bayesian Estimation via Markov Chain Monte Carlo
In contrast to the MML approach, Bayesian estimation combines the likelihood function of the data with prior distributions
for the to-be estimated model parameters
. This allows the posterior distribution
of the parameters to be obtained, given the data
The posterior distribution is used to construct point and interval estimators of the model parameters by computing certain summary measures and quantiles, respectively, of the posterior distribution itself or the marginal distribution of a model parameter. However, for complex models with many parameters it is difficult to analytically derive these measures. Therefore, MCMC algorithms (see [
17], for an introduction), such as the Gibbs sampler or the no-U-turn sampler [
18] are commonly used to approximate the posterior distribution by iteratively drawing random numbers from it. The sampled values constitute a Markov chain and once the chain has stabilized, one uses the random draws to compute summary statistics such as the mean or quantiles.
Bayesian estimation of (multidimensional) IRT models bears various advantages compared with MML. First, state-of-the-art MCMC sampling techniques (e.g., [
18]) can deal well with high-dimensional, highly correlated parameter spaces, rendering them particularly useful for complex and high-dimensional IRT models. Thus, restrictions on the dimensionality under MML estimation do not apply to MCMC-based estimation techniques commonly employed for Bayesian estimation (e.g., [
19]). Second, the Bayesian approach can be more easily adapted to new data and modeling situations and hence “supports in a natural way extensions of common item response models” (see [
5], p. 2). These extensions are important to address challenges such as complex response behavior (e.g., guessing or switches in response processes), missingness and nonresponse, or complex sampling designs.
MCMC algorithms are important methods for Bayesian statistics and they are important for the (still) increasing adoption of the Bayesian approach in applied research. However, a problem of MCMC algorithms is that they can be very computationally demanding, especially when researchers draw on general-purpose samplers instead of samplers tailored to the specific model employed. Hence, while general-purpose samplers make Bayesian estimation techniques easily accessible and provide researchers with great flexibility in model specification, their computation times can become unbearable when sample sizes are large, rendering Bayesian estimation in such contexts infeasible. Tailoring samplers to a specific model can be a cumbersome endeavor, and developing well-scalable samplers is not equally straightforward for all classes of models. Therefore, alternative approximation methods of the posterior distribution have been investigated, which may also be used in the context of multidimensional IRT models.
3. Variational Bayesian Methods
In contrast to MCMC algorithms that approximate the posterior distribution by sampling from it, VB is a deterministic technique in which approximating the posterior is transferred into an optimization problem [
20,
21]. VB methods were originally developed in the context of machine learning, but are now used in various fields with large and complex data sets, including computational biology, computational neuroscience, or natural language processing and speech recognition [
21]. VB methods only recently gained attention in the psychometric literature, where they have been applied to estimate the parameters of customary IRT models [
11,
22], cognitive diagnostic models [
23,
24,
25], latent class analysis [
26], and models from cognitive and mathematical psychology [
27,
28]. We note that variational methods can not only be used to approximate the posterior distribution, but also to approximate any (complex) probability density. For this reason, variational methods are also suitable for MML estimation where one is dealing with complex likelihood functions. Research on likelihood-based variational inference has covered complex IRT models such as models with random item effects [
29] or multidimensional models [
30], as well as the general case of generalized linear latent variable models [
31,
32].
The basic idea behind VB is to approximate the posterior distribution
with a second distribution over the model parameters
[
10]. The second distribution is controlled by the parameters
that determine how far or close
q is to the posterior distribution. In fact, the main goal of VB is to determine those parameter values
for which the distance between
q and the posterior distribution is minimal. Once this optimization problem is solved, one can use
q instead of the posterior distribution to compute summary statistics such as the mean or quantiles.
In order to determine the variational parameters for which
q is as close as possible to the posterior distribution, one must define a measure that maps the distance between the two probability densities. The most commonly used measure is the Kullback–Leibler (KL) divergence that, applied to the present context, is given by:
The formula for the KL divergence can be further transformed because it is an expectation with respect to
:
where the last line follows from inserting Equation (
4). This shows that it is actually not possible to minimize the KL divergence, because the marginalizing constant
is unknown. However, as
does not depend on
q and hence
, one can
maximize [
10,
21,
27]:
with respect to
to obtain a density
q that is as close as possible to the posterior distribution. This function is also known as the evidence lower bound (ELBO), because it is a lower bound for the marginalizing constant, or ‘evidence’,
.
In addition to specifying the distance measure, it is also necessary to specify which distribution
q is to be used as the approximating distribution. Of course,
q should be chosen in a way that the function is easy to handle but also flexible enough to achieve a good approximation. One way to find such a distribution is to assume that
q can be factorized as
Equation (
8) assumes complete independence of all individual parameters
. This is sometimes referred to as ‘naive’ mean-field assumption. One can also assume that the elements in
can be divided into
M parameter groups and that
q factorizes with respect to these groups [
10].
Note that Equation (
8) makes no assumption about the form or type of distribution (e.g., that it is a normal distribution). Nevertheless, the assumption is sufficient to derive an optimization algorithm (called coordinate ascent variational inference; CAVI) that can be used to find the approximation density for a parameter that maximizes the ELBO. In fact, the optimal solution is given by
where the expectation is taken with respect to the parameters that are currently not considered. Equation (
9) shows that VB methods and Gibbs sampling are closely related because both use the conditional densities of the parameters. Furthermore, even if the form of
q has not been specified, the use of Equation (
9) leads to variational densities that are standard distributions and whose parameters are optimized during CAVI. For example, if one uses VB to estimate the expected value of a normally distributed variable, then
is a normal distribution whose mean and variance are updated during CAVI. In fact, the distribution of
q is implicitly defined by the shape of the likelihood and the prior distributions, again showing the close connection between VB and Gibbs sampling.
Equations (
9) and (
10) can be used to derive a VB approach for a specific statistical model. However, since these derivations are often technically very complex and are tied to the specific model, an alternative general-purpose approach would be (a) to assume a specific distribution per se for all model parameters and all models and (b) to maximize Equation (
7) to determine the parameters of this distribution. For instance, one can assume that the approximating distributions are independent normal distributions:
In this case, Equation (
7) is
where
H is the entropy of the normal distribution. This approach is called mean-field Gaussian and, in fact, is the approach taken in Stan’s built-in VB function under default settings (see [
33]). It is also possible to specify a multivariate normal as the approximating distribution in Equation (
10) and this is called full-rank Gaussian in Stan. The normal distribution assumption can also be exploited to approximate the expectation in Equation (
11) and the gradient of Equation (
11) via Monte Carlo integration by sampling from
q. In Stan, the gradient is computed using automatic differentiation (hence the name automatic differentiation variational inference; ADVI; see [
15]), and together with the normal distribution assumption, this is another reason why Stan’s approach can be used for a wide variety of statistical models—such as multidimensional 2PL models—for which the derivation of a specific variational inference algorithm would be too complicated.
The price for Stan’s default general-purpose VB algorithm is the rather restrictive mean-field Gaussian assumption that may be too restrictive in the context of complex models with high-dimensional, highly correlated parameter spaces. That is, the parameters may neither be independent of each other, nor may they be normally distributed. The full-rank algorithm implemented in Stan, in contrast, overcomes the assumption of parameters being independent of each other; it may, however, also be more challenged by complex models. It is therefore worthwhile to investigate Stan’s built-in VB algorithms’ ability to handle complex models, whether its default yields trustworthy results even under potential violations of the mean-field Gaussian assumption, and to what extent Stan’s algorithms are challenged by the complexity and high-dimensional parameter spaces multidimenional IRT models entail.
6. Illustrative Data Example
We also examined differences between VB, MCMC and MML using data from PISA 2018 [
13]. We focused on the Spanish sample, as this, with
35,922 examinees, comprised the largest country-level sample size. Examinees were administered a total of
items, measuring three domains. For the major domain, reading, 260 items were administered. Items measuring reading fluency were excluded, as these were scaled separately from the other reading items in PISA 2018. For the minor domains, science and mathematical literacy, 115 and 70 items were administered, respectively. Note that due to the incomplete block designs employed for the minor domains and the multistage adaptive test design employed for the major domain, each examinee was administered only a fraction of the items. The data set comprised a total of 1,984,724 item responses; item-level sample sizes ranged from 2055 to 17,815. For estimation, we employed the same set-up as in the simulation study.
As expected, we observed vast differences in both run time and parameter estimates. VB yielded the by far shortest run time of 16 min, followed by MML with 5 h and MCMC with 21 h. In line with results from the simulation study, there were pronounced differences in parameter estimates for correlations among the three domains (see
Table 4) retrieved with VB as compared to MML and MCMC. While there was strong agreement between estimates retrieved from MML and MCMC, VB yielded markedly lower correlations. For instance, while MML and MCMC yielded correlations between science and mathematical literacy of about 0.81, the correlation obtained with VB was 0.42. Hence, in the present application, VB would result in conclusions that are drastically different from those drawn from MML and MCMC results. While researchers investigating the relationship between science and mathematical literacy using MCMC or MML would conclude that these competencies are highly correlated, researchers drawing on VB would conclude that these competencies are related only to a moderate degree. MML and MCMC yielded comparable item parameter estimates, with correlations between estimates of discriminations and difficulties, respectively, above 0.99. As shown in
Figure 1, VB yielded estimates comparable to those from MCMC for difficulties, but somewhat different results for discriminations. The correlation between estimates was 0.95 for the difficulties and 0.86 for the discriminations. Nevertheless, in line with the simulation study, differences in discriminations were much less pronounced than differences in correlations.
7. Discussion
The aim of the present study was to investigate Stan’s VB function for estimating multidimensional 2PL models with between-item dimensionality under realistic research conditions. Stan provides general-purpose VB algorithms that strongly facilitate the implementation of variational inference for a wide array of models, rendering it a potentially powerful tool for fitting complex IRT models to large data sets. Stan provides algorithms either relying on the mean-field assumption, i.e., assuming the approximate distribution to factorize into independent normal distributions, or assuming a multivariate normal as the approximating distribution. In the given context, however, only Stan’s default, implementing the rather restrictive mean-field Gaussian assumption, yielded results (see [
39] for comparable results in the context of Poisson regression with random effects). A potential reason for the failure of the full-rank VB algorithm in the context of multidimensional 2PL models may be their (potentially highly correlated) high-dimensional parameter space and, as such, the high dimensionality of the approximating multivariate normal distribution as well as the associated increased computational costs.
Based on the results of the simulation study, we conclude that in its current form, Stan’s built-in VB algorithm does not (yet) pose a viable alternative to MML and MCMC-based Bayesian estimation of multidimensional IRT models in large data sets. In general, no reliable mean-field VB approximations could be obtained, as indicated by consistently large Pareto
diagnostic values, the accompanying warnings issued by Stan, and biased parameter estimates. Concerning parameters of the measurement model, mean-field VB yielded trustworthy results only for item difficulties, while estimates of item discriminations were downward biased and also slightly more variable compared with estimates of the MCMC and MML benchmarks. These results mirror those from previous studies. For a multilevel random intercept logistic regression model [
40], for instance, reported mean-field VB yielded good results for fixed effects regression parameters, but poor performance for variance components. [
41] reported comparable results for a Poisson regression model with random effects. Note that we fixed variances of the fitted multidimensional 2PL models to one for identification, such that issues related to the estimation of variances may ‘leak’ into the estimation of discrimination parameters. Furthermore, we found discrimination parameters to be especially prone to bias and high variability under conditions with a high number of dimensions, while under conditions with only two dimensions, bias remained below 10%. One may argue that researchers might be willing to trade off this amount of loss in accuracy of parameter estimates against speed of estimation. However, while mean-field VB yielded a marked speed-up in comparison to MCMC-based Bayesian estimation, it did not generally outperform MML in terms of run time, even with as many as 500 nodes employed for quasi Monte Carlo integration, rendering MML the better choice in terms of both run time and trustworthiness of parameter estimates.
Mean-field VB estimates of the considered structural parameters (i.e., correlations between dimensions) were unbiased and displayed low variability when the true population correlations were zero. Given typical competence assessment data, however, zero correlations pose an unrealistic scenario. Under more realistic conditions with highly correlated dimensions, challenging the mean-field assumption, estimates of correlations were severely downward biased, even more so when a high number of dimensions was measured with relatively few items. The poor performance of mean-field VB under highly correlated dimensions may indicate that the mean-field assumption is too restrictive in these settings. We found comparable differences in estimates of correlations between mean-field VB and the MML and MCMC benchmarks in the illustrative real-data example, indicating that performance differences between the estimation approaches are of real relevance under real-data conditions.
Limitations and Future Directions
Stan’s VB algorithm is applicable to a wide array of models and has not been tailored to the context of multidimensional IRT models. We found that in its current form, Stan’s built-in VB algorithm can not well handle multidimensional IRT models, especially when the number of dimensions is high and/or when the latent dimensions are highly correlated. While the default algorithm relying on the mean-field assumption was prone to bias, we were not able to retrieve results under Stan’s full-rank algorithm, which assumes a multivariate normal as the approximating distribution. Future research may investigate to what extent model reparameterizations may aid in retrieving results when drawing on Stan’s full-rank algorithm. It should, however, be kept in mind that Stan’s full-rank algorithm has been reported to actually provide worse approximations to the MCMC posterior means than the mean-field version for a variety of model classes [
42], rendering it unlikely that—even when results can be obtained—Stan’s full-rank algorithm outperforms its mean-field algorithm in terms of the accuracy of parameter estimates of multidimensional 2PL models.
We would like to note that VB has only recently been implemented in Stan and is, at present, still at a beta version stage. Given the rapid development of both VB algorithms and Stan’s functionalities, we are optimistic that the applicability of general-purpose VB algorithms implemented in accessible software for estimating complex multidimensional IRT models will improve in the nearer future. Nevertheless, our research also suggests that the performance of these improved algorithms must be examined before they are used in applied contexts.
In the meantime, to draw on VB algorithms that are both scalable and trustworthy in the context of IRT analyses, we think that future research should examine the performance of VB approaches customized to this specific context. Such developments may be especially useful in contexts where customized MCMC samplers may be challenging to develop, e.g., when no sufficiency of statistics can be assumed (see [
7,
43]). For instance, partially-factorized VB methods—avoiding a fully factorized approximation, and instead assuming a factorization only for a specific subset of parameters (see [
44], for developments and evaluations in the context of Probit models)—weaken the mean-field assumption and may therefore pose a better alternative for settings with highly correlated latent dimensions, while at the same time being less challenged by complex models than full-rank Gaussian algorithms. Another alternative may be to build on VB approaches that have been developed in the context of binary outcomes that circumvent the assumption of independent normal approximate distributions. The authors of [
45], for instance, achieved good performance by employing the Jaakkola–Jordan variational lower bound [
46] in the context of semiparametric logistic regression. The authors of [
47] further developed the approach suggested by [
48] and achieved accurate results for logistic regression. Furthermore, ref. [
22] recently proposed the variational item response theory lower bound (VIBO) as an optimization objective for VB tailored to the context of multidimensional IRT models. In their evaluations of VIBO for multidimensional IRT models, however, the authors considered a single data set and focused only on the recovery of proficiency estimates. While their results are promising, we think that a more thorough investigation of their approach is needed.
Due to computational constraints, we focused on a small set of conditions and considered only 100 replications per condition in the simulation study. To corroborate the conclusions of the present study, future research may aim at replicating results for some selected conditions with an increased number of replications. Furthermore, concerning correlations between latent dimensions, we only considered two rather extreme cases. While correlations of zero were well recovered, correlations among highly correlated dimensions were severely biased. Future research may aim at establishing boundary conditions of maximum latent correlations that still support retrieving trustworthy parameter estimates.
It should also be noted that conditions encountered in LSA data as a major potential use case for VB are commonly much more complex than those investigated in our simulation study. It may well be that Stan’s built-in VB algorithm is even more challenged by these more complex conditions, and we note that this also challenges the development of customized VB algorithms. For instance, LSA data commonly stem from complex sampling designs with multilevel structures that researchers may want to consider in their analyses. Further, LSA data commonly have high proportions of missing responses—both planned (e.g., due to incomplete block or multistage adaptive test designs) and unplanned (e.g., due to item omissions, time restrictions, or quitting behavior). This can result in rather small item-level sample sizes, even when the overall sample size is large—in the illustrative real-data example, for instance, even for an overall sample size of 35,922, the smallest item-level sample size was as low as 2055. While MML and MCMC can well handle such conditions (see [
49,
50]), VB may yield less accurate estimates for items with small item-level sample sizes. Investigating and developing VB algorithms for more complex models and data conditions is an important and challenging task for future research.
In the present study, we investigated models with between-item dimensionality that represent a rather simple special case of multidimensional IRT models and that mirror typical LSA designs. Stan’s built-in VB algorithm may be even further challenged under multidimensional IRT models with more complex structures, e.g., due to complex loading patterns under within-item dimensionality or multiplicative structures in non-compensatory multidimensional IRT models. At the same time, developing customized VB algorithms for more complex multidimensional IRT models poses a further pertinent topic for future research, as customary estimation of such models with MML or MCMC can be very computationally demanding, limiting their applicability [
51].