Linking Error Estimation in Haberman Linking

Robitzsch, Alexander

doi:10.3390/appliedmath5010007

Open AccessArticle

Linking Error Estimation in Haberman Linking

by

Alexander Robitzsch

^1,2

¹

IPN—Leibniz Institute for Science and Mathematics Education (Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik), Olshausenstraße 62, 24118 Kiel, Germany

²

Centre for International Student Assessment (ZIB; Zentrum für Internationale Bildungsvergleichsstudien), Olshausenstraße 62, 24118 Kiel, Germany

AppliedMath 2025, 5(1), 7; https://doi.org/10.3390/appliedmath5010007

Submission received: 19 November 2024 / Revised: 31 December 2024 / Accepted: 9 January 2025 / Published: 13 January 2025

Download

Browse Figures

Versions Notes

Abstract

:

Haberman linking is a widely used method for comparing groups using the two-parameter logistic item response model. However, the traditional Haberman linking approach relies on joint item parameter estimation, which prevents the application of standard M-estimation theory for linking error calculation in the presence of differential item functioning. To address this limitation, a novel pairwise Haberman linking method is introduced. Pairwise Haberman linking aligns with Haberman linking when no items are missing but eliminates the need for joint item parameters, allowing for the use of M-estimation theory in linking error computation. Theoretical derivations and simulation studies show that pairwise Haberman linking delivers reliable statistical inferences for items and persons, particularly in terms of coverage rates. Furthermore, using a bias-corrected linking error is recommended to reduce the influence of sample size on error estimates.

Keywords:

linking; item response model; 2PL model; Haberman linking; pairwise Haberman linking; differential item functioning

MSC:

62F35; 62H10; 62H17; 62H25; 62P25

1. Introduction

Item response theory (IRT) models [1,2] are statistical models to analyze multivariate discrete random variables. In this article, we focus on dichotomous (i.e., binary) random variables and the comparison of multiple groups by means of linking procedures. Let

X = (X_{1}, \dots, X_{I})

denotes a vector of I random variables

X_{i} \in {0, 1}

, in the literature referred to as items or item responses. The set of items is also referred to as a (e.g., cognitive or personality) test in the social sciences. A unidimensional IRT model [3] is a statistical model for the probability distribution

P (X = x)

for

x = (x_{1}, \dots, x_{I}) \in {0, 1}^{I}

P (X = x; δ, γ) = \int \prod_{i = 1}^{I} [{P_{i} (θ; γ_{i})}^{x_{i}} {(1 - P_{i} (θ; γ_{i}))}^{1 - x_{i}}] ϕ (θ; μ, σ) d θ,

(1)

where

ϕ

is the density of the normal distribution, parameterized by the mean

μ

and the standard deviation (SD)

σ

. The distribution parameters of the latent ability variable are contained in

δ = (μ, σ)

. Note that the normal distribution of the

θ

variable is not required [4,5], but is often convenient in estimating the IRT model. Since the normal distribution serves as a prior distribution for the individual posterior distributions

P (θ | X = x)

, its influence diminishes as the number of items increases. The vector

γ = (γ_{1}, \dots, γ_{I})

includes the item parameters associated with the item response functions (IRFs)

P_{i} (θ; γ_{i}) = P (X_{i} = 1 | θ)

for

i = 1, \dots, I

. The IRF of the two-parameter logistic (2PL) model [6] is expressed as

P_{i} (θ; γ_{i}) = Ψ (a_{i} (θ - b_{i})),

(2)

where

a_{i}

and

b_{i}

denote the item discrimination and the item difficulty

b_{i}

, respectively, and

Ψ (x) = {(1 + exp (- x))}^{- 1}

is the logistic distribution function. Here, the item parameter vector is given by

γ_{i} = (a_{i}, b_{i})

.

It is important to note that the item parameters

(a_{i}, b_{i}

) for

i = 1, \dots, I

and distribution parameters

(μ, σ)

cannot be disentangled from each other without additional identification constraints [7]. If the parameters

(a_{i}, b_{i})

and

(μ, σ)

represent the multivariate distribution

P (X = x)

in the IRT model (1), the model could alternatively be represented by item parameters

({\tilde{a}}_{i}, {\tilde{b}}_{i})

and

μ = 0

and

σ = 1

, where

{\tilde{a}}_{i} = a_{i} σ and {\tilde{b}}_{i} = σ^{- 1} (b_{i} - μ) .

(3)

For a sample of N individuals with independently and identically distributed observations

x_{1}, \dots, x_{N}

drawn from the distribution of the random variable

X

, the unknown parameters of the IRT model in (1) can be consistently estimated using marginal maximum likelihood estimation (MML; [8,9]).

IRT models are frequently used to assess and compare the performance of two groups on a test by analyzing differences in the latent variable

θ

, as defined in the IRT model in Equation (1). This paper focuses on linking methods [10] based on the 2PL model. In the initial step of the linking process, the 2PL model is estimated separately for each group, allowing for the possibility of differential item functioning (DIF), where items may exhibit varying behaviors across groups [11,12,13,14]. The absence of DIF is labeled as measurement invariance [15]. In the second step, differences in item parameters are utilized to determine group differences in

θ

through a linking method [10,16,17,18].

The presence of DIF introduces additional variability in the estimated group means and group SDs [19,20,21,22,23]. As a result, the estimated distribution parameters are influenced by the selection of items, even when the sample size of persons is infinite. This variability is captured in the linking error [24,25,26,27,28,29].

The total uncertainty consists of both the ordinary standard error and the linking error, collectively referred to as the total error. Previous work has derived and evaluated estimates for linking errors in methods based on two groups [30,31] or fixed item parameter calibration [32,33]. For comparisons across multiple groups, Haberman linking (HL; [34]) is widely used (e.g., [35,36,37,38,39]). An efficient statistical method for calculating the linking error and total error for HL is still lacking in the literature. An exception is the method proposed in [24], which estimates total error using the computationally intensive double jackknife approach for subjects and items. This article addresses that gap by applying M-estimation theory, as outlined in [30]. The M-estimation theory for the linking method has been described in more detail and refined in [31], but the performance of the linking error and total error estimates was only evaluated for linking methods involving two groups. This paper introduces an adapted version of HL, called pairwise HL (PHL), which avoids the need to estimate joint item parameters. When all items are observed in every group, the HL and PHL methods yield equivalent parameter estimates. This article derives linking errors and total errors for the PHL method based on M-estimation theory and evaluates its performance through a simulation study.

The structure of the article is as follows: Section 2 introduces the proposed PHL approach. Section 3 outlines the estimation of the linking error and total error associated with the PHL method. Section 4 reports the results of a simulation study evaluating the accuracy of the linking error and total error estimates for the PHL approach. Finally, the article concludes with a discussion in Section 5.

2. Pairwise Haberman Linking Approach

In this section, we revisit the estimation of Haberman linking (HL) as originally proposed by [34]. This method estimates group means (

μ

) and standard deviations (

σ

), along with joint item discriminations (

a = (a_{1}, \dots, a_{I})

) and item difficulties (

b = (b_{1}, \dots, b_{I})

). However, this formulation has a notable limitation: M-estimation theory cannot be directly applied because the number of estimated parameters increases with the number of items. In the context of statistical inference based on subject sampling, this phenomenon is referred to as the incidental parameter problem in statistics [40,41].

To address this issue, we propose a pairwise Haberman linking (PHL) approach. This method estimates only the distribution parameters (

μ

and

σ

) and defines an optimization function based on differences in item parameters. When all items are available across all groups, PHL is equivalent to HL. However, discrepancies arise when certain items are missing in some groups.

The HL and PHL methods use estimated item parameters

{\hat{a}}_{i g}

and

{\hat{b}}_{i g}

(

i = 1, \dots, I

;

g = 1, \dots, G

) as inputs, which are obtained from the separate estimation of the 2PL model in each group with a fixed mean

μ = 0

and a fixed SD

σ = 1

. Note that these parameters also depend on the unknown group means

μ_{g}

and group SDs

σ_{g}

(see (3)).

The traditional HL method typically involves two steps [34,42]. First, the log-transformed group standard deviations,

s_{g} = log σ_{g}

, are estimated (

g = 1, \dots, G

). In the second step, group means

μ_{g}

are estimated. For identification purposes, we set

s_{1} = 0

(implying

σ_{1} = 1

) and

μ_{1} = 0

.

Let

s = (s_{2}, \dots, s_{G})

and

μ = (μ_{2}, \dots, μ_{G})

. For each item i and group g, define dummy variables

d_{i g}

, which take the value 1 if item i is observed in group g and 0 if the item is missing. In a fully crossed design where all items are observed in all groups, all

d_{i g}

values are equal to 1.

Log-transformed SDs

s

and log-transformed joint item discriminations

α = (α_{1}, \dots, α_{I})

, where

a_{i} = exp (α_{i})

for

i = 1, \dots, I

, are obtained by minimizing the following optimization function:

H_{1} (s, α) = \sum_{i = 1}^{I} \sum_{g = 1}^{G} d_{i g} ρ (log {\hat{a}}_{i g} - α_{i} - s_{g}),

(4)

where

ρ

denotes the square loss function

ρ (x) = x^{2} / 2

, also referred to as the

L_{2}

loss function. If estimates

\hat{s} = ({\hat{s}}_{2}, \dots, {\hat{s}}_{G})

were obtained by minimizing

H_{1}

in (4), the group SD estimates are calculated as

{\hat{σ}}_{g} = exp (s_{g})

. Subsequently, group means

μ

and joint item difficulties

b

are determined by minimizing

H_{2} (μ, b) = \sum_{i = 1}^{I} \sum_{g = 1}^{G} d_{i g} ρ (exp ({\hat{s}}_{g}) {\hat{b}}_{i g} - b_{i} + μ_{g}),

(5)

where, as before,

ρ

denotes the square loss function. The distribution parameter estimates

\hat{s}

and

\hat{μ}

, along with joint item parameter estimates

\hat{α}

and

\hat{b}

, fulfill the following estimating equations, derived by taking the required partial derivatives in (4) and (5):

\sum_{g = 1}^{G} d_{i g} (log {\hat{a}}_{i g} - α_{i} - s_{g}) = 0 and \sum_{g = 1}^{G} d_{i g} (exp (s_{g}) {\hat{b}}_{i g} - b_{i} + μ_{g}) = 0 for i = 1, \dots, I and

(6)

\sum_{i = 1}^{I} d_{i g} (log {\hat{a}}_{i g} - α_{i} - s_{g}) = 0 and \sum_{i = 1}^{I} d_{i g} (exp (s_{g}) {\hat{b}}_{i g} - b_{i} + μ_{g}) = 0 for g = 2, \dots, G .

(7)

Haberman discussed the conditions under which a unique solution is obtained in the minimization of

H_{1}

and

H_{2}

[34].

For motivating the optimization function for the proposed PHL method, a complete crossed design is assumed, where all items are available in all groups, i.e.,

d_{i g} = 1

for all

i = 1, \dots, I

and

g = 1, \dots, G

. From (7), the following identity holds for all

g \neq h

:

\sum_{i = 1}^{I} (log {\hat{a}}_{i g} - α_{i} - s_{g}) - \sum_{i = 1}^{I} (log {\hat{a}}_{i h} - α_{i} - s_{h}) = \sum_{i = 1}^{I} (log {\hat{a}}_{i g} - log {\hat{a}}_{i h} - s_{g} + s_{h}) = 0 .

(8)

Consequently, it holds for any group g

\sum_{h \neq g} \sum_{i = 1}^{I} (log {\hat{a}}_{i g} - log {\hat{a}}_{i h} - s_{g} + s_{h}) = 0 for g = 2, \dots, G .

(9)

Hence, the log-transformed standard deviations

s

can also be obtained by minimizing the pairwise differences

H_{1} (s) = \sum_{i = 1}^{I} ω_{i} \sum_{g \neq h} d_{i g} d_{i h} ρ (log {\hat{a}}_{i g} - log {\hat{a}}_{i g} - s_{g} + s_{h}) .

(10)

Note that item-specific weights

ω_{i}

are introduced in (10), where they equal 1 in complete crossed designs. In designs with missing items, however, the weights will be chosen differently, as will be clarified later.

Similarly, the identity for group means

μ

can be derived as

\sum_{i = 1}^{I} (exp (s_{g}) {\hat{b}}_{i g} - b_{i} - μ_{g}) - \sum_{i = 1}^{I} (exp (s_{h}) {\hat{b}}_{i h} - b_{i} - μ_{h}) = \sum_{i = 1}^{I} (exp (s_{g}) {\hat{b}}_{i g} - exp (s_{h}) {\hat{b}}_{i h} + μ_{g} - μ_{h}) = 0 .

(11)

Therefore, we arrive at

\sum_{h \neq g} \sum_{i = 1}^{I} (exp (s_{g}) {\hat{b}}_{i g} - exp (s_{h}) {\hat{b}}_{i h} + μ_{g} - μ_{h}) = 0 for g = 2, \dots, G .

(12)

The vector of group means

μ

can be estimated in PHL by minimizing

H_{2} (μ) = \sum_{i = 1}^{I} ω_{i} \sum_{g \neq h} d_{i g} d_{i h} ρ (exp ({\hat{s}}_{g}) {\hat{b}}_{i g} - exp ({\hat{s}}_{h}) {\hat{b}}_{i h} + μ_{g} - μ_{h}) .

(13)

The linking method based on the optimization functions

H_{1}

in (10) and

H_{2}

in (13) forms the PHL method, which closely resembles the optimization functions used in invariance alignment (IA; [43,44]). Note that PHL will differ from HL in designs with missing items. In the remainder of the article, we consider two versions of PHL: either

ω_{i} = 1

(method PHL1) or

ω_{i} = I / G_{i}

(method PHL2), where

G_{i} = \sum_{g = 1}^{G} d_{i g}

. The latter method is motivated by the fact that if item i is observed in

G_{i}

groups, there are

G_{i} (G_{i} - 1) / 2

pairwise differences in the PHL optimization function, whereas HL involves only

G_{i}

terms for item i. To make PHL more similar to HL, the weights in PHL2 are chosen accordingly.

The original HL method is implemented in the function equateIRT::multiec() in the R (Version 4.4.1) [45] package equateIRT [46,47], as well as in the functions sirt::linking.haberman() and sirt::linking.haberman.lq() (with the argument method=’joint’) in the R package sirt [45]. Both PHL methods are available in the function sirt::linking.haberman.lq() (with method=’pw1’ or method=’pw2’ for PHL1 and PHL2, respectively) in the same package. SAS code for HL is also available [48].

3. Linking Error Estimation in the Pairwise Haberman Linking Approach

This section presents the derivation of standard errors, linking errors, and total errors for the newly introduced PHL method in Section 2. The theory for deriving the estimates was proposed in earlier work [31] and carved out in more detail for the two-group case in [30].

This article focuses on estimating the uncertainty associated with the group means

\hat{μ}

and log-transformed group standard deviations

\hat{s}

within the vector

δ = (μ, s)

. The uncertainty in the linking parameter estimate

\hat{δ}

arises from two sources: the sampling (or selection) of individuals, which corresponds to the scheme

N \to \infty

, and the selection (or sampling) of items, as well as the modeling of variability in group comparisons, which corresponds to the scheme

I \to \infty

.

Much of the linking literature addresses the issue of uncertainty in item parameter estimates

\hat{γ}

by calculating the standard error (SE) of

\hat{δ}

for a fixed number of items [46,49,50,51,52]. In this context, variability in population-level item parameters

\hat{γ}

(i.e., for

N = \infty

) arises due to sampling variability in the estimated item parameters

{\hat{γ}}_{i}

, which results from the sampling of individuals. If DIF occurs, the estimated linking parameter

\hat{δ}

will depend on the selected set of items, even with an infinite sample size. This variability, known as linking error (LE) [22,25,27], is handled under the scheme

I \to \infty

for variance estimation.

The total error (TE) encompasses both sources of uncertainty: the standard error arising from randomness in the sampling of persons and the linking error resulting from variability in item selection [22,24,27,28,53]. However, it has been suggested that the traditional LE estimate may be influenced by sampling error [30,31]. To address this, a bias-corrected version of the LE is additionally proposed as an alternative to the traditional LE estimate.

We now present the theoretical framework for a general linking function based on M-estimation [54,55,56,57]. The vector of input parameters for the linking method is given by

{\hat{γ}}_{i} = ({\hat{a}}_{i 1}, {\hat{b}}_{i 1}, d_{i 1}, \dots, {\hat{a}}_{i G}, {\hat{b}}_{i G}, d_{i G})

. As the sample size N increases (assuming that the sample sizes per group also increase)

{\hat{γ}}_{i}

stochastically converges to the true parameter vector

γ_{i} = (a_{i 1}, b_{i 1}, d_{i 1}, \dots, a_{i G}, b_{i G}, d_{i G})

. Let

\hat{γ} = ({\hat{γ}}_{1}, \dots, {\hat{γ}}_{I})

be the vector of all item parameters, with an estimated variance matrix

V_{\hat{γ}}

, typically obtained from software that estimates IRT models. The aim of a linking method is to estimate the vector of distribution parameters

δ

using the set of all estimated item parameters

{\hat{γ}}_{i}

for

i = 1, \dots, I

.

The linking parameter estimate

\hat{δ}

is obtained by solving a multivariate estimating equation with respect to

δ

H_{δ} (δ) = \sum_{i = 1}^{I} h_{δ} (δ; {\hat{γ}}_{i}) = 0,

(14)

where

H_{δ}

and

h_{δ}

denote the partial derivatives of univariate or multivariate functions H and h with respect to the parameter

δ

. In HL and PHL, the vector

H_{δ}

consists of the derivatives of the optimization functions

H_{1}

and

H_{2}

.

In the PHL method, the item-wise contributions

h_{δ} (δ; {\hat{γ}}_{i})

in (14) are given by (see Section 2)

h_{δ} (δ; {\hat{γ}}_{i}) = ω_{i} (\begin{matrix} \sum_{h \neq 2} d_{i 2} d_{i h} (exp (s_{2}) {\hat{b}}_{i 2} - exp (s_{h}) {\hat{b}}_{i h} + μ_{2} - μ_{h}) \\ ⋮ \\ \sum_{h \neq G} d_{i G} d_{i h} (exp (s_{G}) {\hat{b}}_{i G} - exp (s_{h}) {\hat{b}}_{i h} + μ_{G} - μ_{h}) \\ \sum_{h \neq 2} d_{i 2} d_{i h} (log {\hat{a}}_{i 2} - log {\hat{a}}_{i h} - s_{2} + s_{h}) \\ ⋮ \\ \sum_{h \neq G} d_{i G} d_{i h} (log {\hat{a}}_{i G} - log {\hat{a}}_{i h} - s_{G} + s_{h}) \end{matrix}) .

(15)

Note that the dummy variables

d_{i g}

in (15) indicate which terms in the estimating equations are used when an item is observed in a particular combination of groups g and h.

3.1. Standard Error

We now compute the standard error of

\hat{δ}

due to sampling of persons (see [51,58,59]). A Taylor approximation of

h_{δ}

around

(δ, γ_{i})

is performed, yielding

h_{δ} (\hat{δ}; {\hat{γ}}_{i}) \begin{matrix} ≃ \end{matrix} h_{δ} (δ, γ_{i}) + h_{δ γ} (δ, γ_{i}) ({\hat{γ}}_{i} - γ_{i}) + h_{δ δ} (δ, γ_{i}) (\hat{δ} - δ) .

(16)

We can now use

\sum_{i = 1}^{I} h_{δ} (\hat{δ}; {\hat{γ}}_{i}) = 0 and \sum_{i = 1}^{I} h_{δ} (δ, γ_{i}) = 0, and obtain

(17)

\hat{δ} - δ \begin{matrix} ≃ \end{matrix} {(\sum_{i = 1}^{I} h_{δ δ} (δ, γ_{i}))}^{- 1} \sum_{i = 1}^{I} h_{δ γ} (δ, γ_{i}) ({\hat{γ}}_{i} - γ_{i}) = A^{- 1} C (\hat{γ} - γ), where

(18)

A = \sum_{i = 1}^{I} h_{δ δ} (δ, γ_{i}) and C = (\begin{matrix} h_{δ γ} (δ; γ_{1}) & \dots & h_{δ γ} (δ; γ_{I}) \end{matrix}) .

(19)

This allows us to compute the variance matrix in

\hat{δ}

due to sampling error as

V_{SE} = A^{- 1} D A^{- ⊤} with D = C V_{\hat{γ}} C^{⊤} .

(20)

The unknown quantities in (20) can be estimated using

\hat{C} = (\begin{matrix} h_{δ γ} (\hat{δ}, {\hat{γ}}_{1}) & \dots & h_{δ γ} (\hat{δ}, {\hat{γ}}_{I}) \end{matrix}) and \hat{A} = \sum_{i = 1}^{I} h_{δ δ} (\hat{δ}, {\hat{γ}}_{i}) .

(21)

Standard errors for entries in

\hat{δ}

are obtained by taking the square root of the entries in the diagonal of the estimate of

V_{SE}

.

The entries in

h_{δ δ}

and

h_{δ γ}

for the PHL method can be computed in a straightforward manner. For example, the entries in

h_{δ δ}

are given by

h_{s_{g} s_{g}} = - \sum_{i = 1}^{I} ω_{i} d_{i g} (G_{i} - 1), h_{s_{g} s_{h}} = \sum_{i = 1}^{I} ω_{i} d_{i g} d_{i h} for g \neq h,

(22)

h_{μ_{g} μ_{g}} = \sum_{i = 1}^{I} ω_{i} d_{i g} (G_{i} - 1), h_{μ_{g} μ_{h}} = - \sum_{i = 1}^{I} ω_{i} d_{i g} d_{i h} for g \neq h,

(23)

h_{μ_{g} s_{g}} = exp (s_{g}) \sum_{i = 1}^{I} ω_{i} d_{i g} (G_{i} - 1) {\hat{b}}_{i g}, h_{μ_{g} s_{h}} = - exp (s_{h}) \sum_{i = 1}^{I} ω_{i} d_{i g} d_{i h} {\hat{b}}_{i h} for g \neq h .

(24)

Moreover, we have

h_{s_{g} μ_{h}} = 0

for all

g, h = 2, \dots, G

.

3.2. Linking Error

The linear Taylor approximation (18) can be used to derive the variance to item selection. The variance matrix based on M-estimation is given by

V_{LE} = Var (\hat{δ}) = A^{- 1} B A^{- ⊤},

(25)

where

A

is given in (19) and

B

is obtained by

B = Var (H_{δ} (\hat{δ})) = \sum_{i = 1}^{I} Var (h_{δ} (δ; {\hat{γ}}_{i})) .

(26)

In (26), the independence assumption of items is applied. In M-estimation, the matrix

A

is referred to as the bread matrix, and

B

as the meat matrix. The unknown quantities in (25) can be estimated by

\hat{A}

(see (21)) and

\hat{B} = \sum_{i = 1}^{I} h_{δ} (\hat{δ}; {\hat{γ}}_{i}) h_{δ} {(\hat{δ}; {\hat{γ}}_{i})}^{⊤} .

(27)

Thus, an estimate of the variance matrix

V_{LE}

is given by

{\hat{V}}_{LE} = \frac{I}{I - 1} \cdot {\hat{A}}^{- 1} \hat{B} {\hat{A}}^{- ⊤} .

(28)

The factor

I / (I - 1)

in (28) is included to improve the statistical properties of the linking error estimate for a small number of items [30,60,61,62].

The corresponding linking errors are computed by taking the square roots of the diagonal elements in

{\hat{V}}_{LE}

. In a practical implementation of the linking error in PHL, the partial derivatives

h_{δ γ_{i}}

must be computed. These can also be straightforwardly obtained, similar to the derivations in (22)–(24).

3.3. Bias-Corrected Linking Error

The bias-corrected estimate of the linking error variance matrix

V_{LE}

is derived as follows. The estimated meat matrix

\hat{B}

is expressed as

\hat{B} = \sum_{i = 1}^{I} h_{δ} (\hat{δ}; {\hat{γ}}_{i}) h_{δ} {(\hat{δ}; {\hat{γ}}_{i})}^{⊤} .

(29)

However, the population-level linking error should ideally be computed using the true item parameters

γ_{i}

rather than the estimated item parameters

{\hat{γ}}_{i}

that appear in (29). A linear Taylor approximation provides

h_{δ} (\hat{δ}; {\hat{γ}}_{i}) \begin{matrix} ≃ \end{matrix} h_{δ} (\hat{δ}; γ_{i}) + h_{δ γ} (\hat{δ}, γ_{i}) ({\hat{γ}}_{i} - γ_{i}) .

(30)

Thus, the inflated variance contribution in

\hat{B}

caused by sampling error is given by

Var (\hat{B}) = Var (\sum_{i = 1}^{I} h_{δ γ} (\hat{δ}, γ_{i}) ({\hat{γ}}_{i} - γ_{i})) = \sum_{i = 1}^{I} h_{δ γ} (\hat{δ}, γ_{i}) V_{{\hat{γ}}_{i}} h_{δ γ} {(\hat{δ}, γ_{i})}^{⊤},

(31)

where the approximate independence of item parameters across items was assumed. Based on this, a bias-corrected meat matrix is computed as

{\hat{B}}_{bc} = \hat{B} - \tilde{D} with \tilde{D} = \sum_{i = 1}^{I} h_{δ γ} (\hat{δ}, {\hat{γ}}_{i}) V_{{\hat{γ}}_{i}} h_{δ γ} {(\hat{δ}, {\hat{γ}}_{i})}^{⊤} .

(32)

Finally, the bias-corrected variance matrix for the linking error is given as

{\hat{V}}_{LE, bc} = \frac{I}{I - 1} \cdot {\hat{A}}^{- 1} {\hat{B}}_{bc} {\hat{A}}^{- ⊤} .

(33)

3.4. Total Error

The total uncertainty in

\hat{δ}

(i.e., the total error) can now be quantified. The variance as the total error, defined as the sum of the variances due to sampling error and linking error, is expressed as follows (see [22,27,30]):

V_{TE} = V_{SE} + V_{LE}, estimated by {\hat{V}}_{TE} = {\hat{V}}_{SE} + {\hat{V}}_{LE} .

(34)

The bias-corrected variance matrix for the total error is given by

{\hat{V}}_{TE, bc} = {\hat{V}}_{SE} + {\hat{V}}_{LE, bc} .

(35)

If negative variances are encountered for the bias-corrected linking error or total error estimates, the corresponding error estimate is set to zero.

3.5. Nonlinear Transformation

The error estimates for

\hat{δ}

presented in the previous subsections are computed for group means

\hat{μ}

and log-transformed group SDs

\hat{s}

. To obtain statistical inference for the SDs

\hat{σ}

, the nonlinear transformation

σ_{g} = exp (s_{g})

must be applied in the delta method. This allows the derivation of an error estimate (i.e., standard error, linking error, or total error) for

{\hat{σ}}_{g}

based on the error estimate

Err ({\hat{s}}_{g})

for

{\hat{s}}_{g}

. The delta formula provides the corresponding estimate for the SD

σ_{g}

as

Err ({\hat{σ}}_{g}) = exp ({\hat{s}}_{g}) Err ({\hat{s}}_{g}) .

(36)

The error estimates

μ_{g}

remain unaffected by the transformation of

\hat{s}

.

4. Simulation Study

4.1. Method

The 2PL model for

G = 4

groups was used as the IRF in the data-generating model. The

θ

variable was assumed to be normally distributed in the four groups with group means

μ_{1} = 0

,

μ_{2} = 0.3

,

μ_{3} = 0.6

, and

μ_{4} = - 0.3

, respectively. The SDs were chosen as

σ_{1} = 1

,

σ_{2} = 1.2

,

σ_{3} = 0.8

, and

σ_{4} = 1

.

The simulation study was conducted for

I = 20

and

I = 40

items. Group-specific parameters

a_{i g}

and

b_{i g}

for each item for groups

g = 1, \dots, 4

were based on fixed base item parameters and DIF effects that were newly simulated in each replication of the simulation. The base item discriminations

a_{i}

for the first 10 items were chosen as 0.89, 0.67, 1.44, 1.28, 1.16, 0.84, 1.76, 0.49, 0.80, and 1.19, which resulted in a mean item discrimination of

M = 1.052

and

S D = 0.385

. The base item difficulties

b_{i}

for the first 10 items were chosen as 0.72, 1.49, −0.88, 0.94, 0.95, −0.66, −1.23, 0.85, 0.53, and 0.29, yielding a mean item difficulty of

M = 0.300

and

S D = 0.909

. For the

I = 20

condition, the parameters were doubled, while for

I = 40

, they were replicated four times. The item parameters are also available at https://osf.io/bp6em (accessed on 19 November 2024). The group-specific item difficulties

b_{i g}

(

i = 1, \dots, I; g = 1, \dots, 4

) were simulated as

b_{i g} = b_{i} + e_{i} for g = 1, \dots, 4,

(37)

where

e_{i}

is a random uniform DIF effect. Group-specific item discriminations

a_{i g} (g = 1, \dots, 4)

were simulated as

a_{i g} = a_{i} exp (f_{i}) for g = 1, \dots, G,

(38)

where

f_{i}

is a nonuniform random DIF effect. In the simulation, we assumed that the random DIF effects

e_{i}

and

f_{i}

were uncorrelated. Both DIF effects had zero means and had standard deviation

τ

for

e_{i}

and

0.3 \times τ

for

f_{i}

and followed a normal distribution. The DIF standard deviation for DIF effects

e_{i}

in item difficulties was chosen as 0, 0.2, and 0.4, referring to no DIF, small DIF, and large DIF. The corresponding DIF SDs for DIF effects

f_{i}

in log-transformed item discriminations were 0, 0.06, and 0.12, respectively.

We also varied the proportion of missing items per group to represent situations where certain items were not administered or deleted due to technical reasons (e.g., translation errors) to some groups. Missing item rates were set at 0%, 10%, and 30% per group with missing items randomly selected in each group for each replication.

Sample sizes per group were chosen as

N = 250

, 500, 1000, and 2000 to reflect typical sample sizes in small- to large-scale testing applications of the 2PL model [63,64].

In each of the 4 (sample size N) × 3 (DIF SD

τ

) × 2 (number of items I) × 3 (proportion of missing items) = 96 cells of the simulation, 3000 replications were conducted. The 2PL model was first estimated separately for each of the four groups. Linking was performed with Haberman linking (HL) and the two pairwise Haberman linking methods, PHL1 and PHL2, based on the item parameter estimates from the 2PL model. For identification reasons, the distribution parameters in the first group were fixed at

μ_{1} = 0

and

σ_{1} = 1

.

We evaluated the bias and the root mean square error (RMSE) for the

{\hat{μ}}_{g}

and

{\hat{σ}}_{g}

estimates. Additionally, a relative RMSE was calculated with HL serving as the reference method (set to a relative RMSE of 100). For the PHL1 and PHL2 methods, standard errors (SE), linking errors (LE), bias-corrected linking errors (

{LE}_{bc}

), total errors (TE), and bias-corrected total errors (

{TE}_{bc}

) were computed. Finally, the coverage rates for the

{\hat{μ}}_{g}

and

{\hat{σ}}_{g}

estimates at the 95% confidence level were assessed, defined as the percentage of replications in which the estimated confidence intervals contained the true values of

μ_{g}

or

σ_{g}

(for

g = 2, 3, 4

).

All analyses for this simulation study were performed using the statistical software R (Version 4.4.1; [45]). The 2PL model, along with the corresponding standard errors of item parameters, was estimated with the sirt::xxirt() function from the R package sirt (Version 4.2-89; [65]). The HL, PHL1, and PHL2 methods were estimated using the sirt::linking.haberman.lq() function from the same R package, which also provides standard errors, linking errors, and total errors for the PHL1 and PHL2 methods in the output. Replication material for this simulation study is available at https://osf.io/bp6em (accessed on 19 November 2024).

4.2. Results

To assess the similarity between the PHL1 and PHL2 methods with HL, we computed the average absolute differences for all distribution parameters (i.e.,

{\hat{μ}}_{g}

and

{\hat{σ}}_{g}

for

g = 2, 3, 4

) in each replication. Figure 1 illustrates the average absolute differences in the estimates of

\hat{μ}

and

\hat{σ}

between Haberman Linking (HL) and the pairwise Haberman linking methods (PHL1 and PHL2) across selected subsets of simulation conditions. In the 0% missing items condition, the PHL methods coincided with HL by definition of the linking methods. Across all conditions, the absolute differences between PHL1 and HL, and between PHL2 and HL, were very small, indicating that both PHL1 and PHL2 produced results highly comparable to HL. The average absolute differences tended to be slightly smaller as the number of items increased from

I = 20

to

I = 40

. Additionally, the absolute differences between the PHL methods and HL increased as the percentage of missing items rose. Overall, the absolute differences for PHL2 relative to HL were generally slightly smaller than those for PHL1, suggesting that PHL2 aligned marginally better with HL, though the differences were minimal.

Figure 2 illustrates the bias in the estimates of the group mean

{\hat{μ}}_{g}

and SD

{\hat{σ}}_{g}

(

g = 2, 3, 4

) for the three linking methods HL, PHL1, and PHL2, pooling all bias values across subsets of simulation conditions. Across all methods, the bias in the

{\hat{μ}}_{g}

and

{\hat{σ}}_{g}

estimates was small. There was no significant difference between HL, PHL1, and PHL2 in terms of bias in

\hat{μ}

, indicating that all three linking methods performed comparably in estimating the group means and group SDs. A higher number of items (i.e.,

I = 40

) slightly decreased the absolute values of the bias.

Figure 3 presents the relative RMSE of

{\hat{μ}}_{g}

and

{\hat{σ}}_{g}

estimates across subsets of simulation conditions. Overall, the RMSE values for the PHL1 method were lower than those for PHL2, though both were generally close to 100, suggesting that the PHL methods did not cause a significant efficiency loss compared with the traditional HL linking method. Notably, a few RMSE values deviated notably from 100, but these occurrences were more frequent when the number of items was small (i.e.,

I = 20

).

Below, we report only the results for the PHL2 method, as the PHL1 method showed larger mean absolute differences compared to the HL method and led to more variable linking parameter estimates.

Figure 4 illustrates the median values of the linking error estimate (LE) and the bias-corrected linking error (

{LE}_{bc}

) for the estimated group mean (

{\hat{μ}}_{2}

) in the second group, across selected subsets of simulation conditions for the PHL2 linking method. Overall, LE decreased with larger sample sizes, while

{LE}_{bc}

showed the opposite trend. Notably, the bias-corrected linking error exhibited less dependency on sample size than LE. In the absence of DIF (i.e.,

τ = 0

), the population-level linking error is zero. The true value of 0 was obtained for

{LE}_{bc}

at the median, while LE was substantially larger and only decreased with larger sample sizes. In the presence of DIF (i.e.,

τ > 0

), LE and

{LE}_{bc}

converged as sample size increased.

Table 1 provides an overview of the coverage rates for the estimated group mean

{\hat{μ}}_{2}

for PHL2. Across all conditions, higher percentages of missing items generally led to slightly lower coverage rates. For 0% missing items,

{TE}_{bc}

consistently provided coverage rates within the desired range (close to 95%), outperforming both SE and TE in terms of accuracy. At 30% missing items,

{TE}_{bc}

still maintained coverage near the nominal level of 95%, while TE and SE often fell below this threshold. Overall, statistical inference based on SE resulted in only 33% of the cells meeting the acceptable coverage range, with a systematic undercoverage that produced an average rate of 82.6%. TE performed satisfactorily in 68% of the cells, exhibiting overcoverage with an average coverage rate of 96.4%. In contrast, the

{TE}_{bc}

method excelled in 96% of the cells, with an average coverage rate of 94.4%. Coverage rates based on the bias-corrected total error

{TE}_{bc}

remained robust across all levels of DIF SD

τ

, consistently yielding better coverage rates compared with TE and SE. When

I = 40

,

{TE}_{bc}

achieved better coverage rates than

I = 20

under most conditions, especially for higher levels of missing items and a larger DIF SD

τ

.

Table 2 presents the coverage rates for the estimated standard deviation (

{\hat{σ}}_{2}

) using the pairwise Haberman linking method (PHL2). Coverage rates based on TE slightly declined as the percentage of missing items increased. In contrast, coverage rates for

{TE}_{bc}

remained resilient to missing items, staying close to 95% even with 30% missing items. As

τ

increased from 0 to 0.4, coverage rates for TE significantly decreased, particularly in scenarios with smaller sample sizes or fewer items, resulting in overcoverage with an average coverage rate of 97.6%. Meanwhile,

{TE}_{bc}

maintained coverage rates near the nominal 95% threshold across all conditions, demonstrating its reliability even under substantial DIF effects, with an average coverage rate of 94.5%. For

{\hat{μ}}_{2}

, coverage rates based on SE were inadequate, showing substantial undercoverage, especially in large samples and with large DIF SD (

τ

).

5. Discussion

HL is a widely used linking method in educational assessment, particularly in complex designs involving multiple groups. When items exhibit DIF, linking errors can be computed to address the uncertainty in linking parameter estimates resulting from item selection. However, until now, there has been no straightforward solution for computing linking errors for HL. Since HL estimates joint item parameters, standard M-estimation theory for linking error estimation is not applicable. To address this, an alternative pairwise HL approach (denoted PHL) is proposed in this article, which aligns with the traditional HL method when there are no missing items across groups. When some items are missing, PHL provides linking parameter estimates close to those from the original HL approach. A key advantage of the PHL method is that it does not rely on joint item parameters, allowing for the application of M-estimation theory to compute linking errors. This article demonstrates, through theory and a simulation study, that the PHL method can yield adequate statistical inferences for persons and items. Furthermore, it is shown that a bias-corrected linking error should replace the ordinary linking error to prevent the estimate from being influenced by sample size.

As with any simulation study, our research has limitations. First, we focused solely on dichotomous items. Future studies could explore linking error estimation for the PHL method applied to polytomous items [66]. Second, it was assumed that DIF effects were independent across items. In cases where items are grouped within subsets that share a common stimulus (e.g., in reading comprehension tests with a common text stimulus; [67]), the dependence structure of DIF effects should be incorporated into the computation of linking errors. Third, linking error estimates could be extended to more complex data-generating IRT models beyond the 2PL analysis model (e.g., [68]), which may lead to misspecification of the IRT model. Finally, future research could compare the linking errors from PHL using M-estimation theory with those obtained from jackknife methods [24].

As noted by an anonymous reviewer, the assumption of independent DIF effects across items is a strong assumption. This assumption is certainly violated when items are arranged into testlets (i.e., groups of items) that refer to a common item stimulus. In an example of a reading comprehension test, several items could be administered that refer to the same stimulus (i.e., reading text). In the case of cross-country comparisons in an educational assessment that involve the translation of reading texts into several languages, it is likely that DIF is not solely an item but also a testlet (i.e., a reading text) property [69,70].

DIF effects are assumed to be independent across items. This is a strong assumption. It would be useful to provide situations or realistic applications where (approximate) independence is a reasonable condition.

Ultimately, researchers must assess whether DIF is relevant to the construct being measured or not [71,72,73]. When the original HL method or the PHL method is used with a square loss function, all items are included in the linking process, meaning that group differences are based on all items. This approach aligns with the notion that no item should be excluded from group comparisons if DIF is considered construct-relevant. In contrast, when a robust loss function is applied in Haberman linking [42], items with large DIF effects within a group are effectively excluded from comparisons [74]. Such an approach is only appropriate if DIF is deemed construct-irrelevant. Future research could explore linking error estimation for robust HL methods.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This article only uses simulated datasets. Replication material for creating the simulated datasets in the simulation study (Section 4) can be found at https://osf.io/bp6em (accessed on 19 November 2024).

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

2PL	two-parameter logistic
DIF	differential item functioning
HL	Haberman linking
IA	invariance alignment
IRF	item response function
IRT	item response theory
MML	marginal maximum likelihood
LE	linking error
RMSE	root mean square error
SD	standard deviation
TE	total error
${TE}_{bc}$	bias-corrected total error

References

Bock, R.D.; Gibbons, R.D. Item Response Theory; Wiley: Hoboken, NJ, USA, 2021. [Google Scholar] [CrossRef]
Chen, Y.; Li, X.; Liu, J.; Ying, Z. Item response theory—A statistical framework for educational and psychological measurement. Stat. Sci. 2024; epub ahead of print. Available online: https://rb.gy/1yic0e (accessed on 19 November 2024).
van der Linden, W.J. Unidimensional logistic response models. In Handbook of Item Response Theory, Volume 1: Models; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
Xu, X.; von Davier, M. Fitting the structured general diagnostic model to NAEP data. ETS Res. Rep. Ser. 2008, 2008, i-18. [Google Scholar] [CrossRef]
Casabianca, J.M.; Lewis, C. IRT item parameter recovery with marginal maximum likelihood estimation using loglinear smoothing models. J. Educ. Behav. Stat. 2015, 40, 547–578. [Google Scholar] [CrossRef]
Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
San Martin, E. Identification of item response theory models. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 127–150. [Google Scholar] [CrossRef]
Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar] [CrossRef]
Glas, C.A.W. Maximum-likelihood estimation. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 197–216. [Google Scholar] [CrossRef]
Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
Holland, P.W.; Wainer, H. (Eds.) Differential Item Functioning: Theory and Practice; Lawrence Erlbaum: Hillsdale, NJ, USA, 1993. [Google Scholar] [CrossRef]
Meredith, W. Measurement invariance, factor analysis and factorial invariance. Psychometrika 1993, 58, 525–543. [Google Scholar] [CrossRef]
Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Vol. 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
Wells, C.S. Assessing Measurement Invariance for Applied Research; Cambridge University Press: Cambridge, UK, 2021. [Google Scholar] [CrossRef]
Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
Brennan, R.L. Generalizabilty Theory; Springer: New York, NY, USA, 2001. [Google Scholar] [CrossRef]
Michaelides, M.P. A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating. Front. Psychol. 2010, 1, 167. [Google Scholar] [CrossRef]
Michaelides, M.P.; Haertel, E.H. Selection of common items as an unrecognized source of variability in test equating: A bootstrap approximation assuming random sampling of common items. Appl. Meas. Educ. 2014, 27, 46–57. [Google Scholar] [CrossRef]
Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
Sachse, K.A.; Haag, N. Standard errors for national trends in international large-scale assessments in the case of cross-national differential item functioning. Appl. Meas. Educ. 2017, 30, 102–116. [Google Scholar] [CrossRef]
Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef]
Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar] [PubMed]
OECD. PISA 2012. Technical Report; OECD: Paris, France, 2014; Available online: https://bit.ly/2YLG24g (accessed on 5 November 2024).
Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
Robitzsch, A. Robust and nonrobust linking of two groups for the Rasch model with balanced and unbalanced random DIF: A comparative simulation study and the simultaneous assessment of standard errors and linking errors with resampling techniques. Symmetry 2021, 13, 2198. [Google Scholar] [CrossRef]
Wu, M. Measurement, sampling, and equating errors in large-scale assessments. Educ. Meas. 2010, 29, 15–27. [Google Scholar] [CrossRef]
Robitzsch, A. Linking error in the 2PL model. J 2023, 6, 58–84. [Google Scholar] [CrossRef]
Robitzsch, A. Estimation of standard error, linking error, and total error for robust and nonrobust linking methods in the two-parameter logistic model. Stats 2024, 7, 592–612. [Google Scholar] [CrossRef]
Robitzsch, A. Analytical approximation of the jackknife linking error in item response models utilizing a Taylor expansion of the log-likelihood function. AppliedMath 2023, 3, 49–59. [Google Scholar] [CrossRef]
Robitzsch, A. Bias and linking error in fixed item parameter calibration. AppliedMath 2024, 4, 1181–1191. [Google Scholar] [CrossRef]
Haberman, S.J. Linking parameter estimates derived from an item response model through separate calibrations. ETS Res. Rep. Ser. 2009, 2009, i-9. [Google Scholar] [CrossRef]
Höft, L.; Bernholt, S. Domain-specific and activity-related interests of secondary school students. Longitudinal trajectories and their relations to achievement. Learn. Individ. Differ. 2021, 92, 102089. [Google Scholar] [CrossRef]
Moehring, A.; Schroeders, U.; Wilhelm, O. Knowledge is power for medical assistants: Crystallized and fluid intelligence as predictors of vocational knowledge. Front. Psychol. 2018, 9, 28. [Google Scholar] [CrossRef] [PubMed]
Neuenschwander, M.P.; Mayland, C.; Niederbacher, E.; Garrote, A. Modifying biased teacher expectations in mathematics and German: A teacher intervention study. Learn. Individ. Differ. 2021, 87, 101995. [Google Scholar] [CrossRef]
Olaru, G.; Robitzsch, A.; Hildebrandt, A.; Schroeders, U. Examining moderators of vocabulary acquisition from kindergarten through elementary school using local structural equation modeling. Learn. Individ. Differ. 2022, 95, 102136. [Google Scholar] [CrossRef]
Trendtel, M.; Pham, H.G.; Yanagida, T. Skalierung und Linking [Scaling and linking]. In Large-Scale Assessment mit R: Methodische Grundlagen der österreichischen Bildungsstandards-Überprüfung; Breit, S., Schreiner, C., Eds.; Facultas: Wien, Austria, 2016; pp. 185–224. Available online: https://tinyurl.com/4y9vxysh (accessed on 19 November 2024).
Haberman, S.J. Models with nuisance and incidental parameters. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 151–170. [Google Scholar] [CrossRef]
Lancaster, T. The incidental parameter problem since 1948. J. Econom. 2000, 95, 391–413. [Google Scholar] [CrossRef]
Robitzsch, A. L_p loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2024; Available online: https://www.R-project.org (accessed on 15 June 2024).
Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef]
Battauz, M. equateMultiple: Equating of Multiple Forms; R Package Version 1.0.0; The Comprehensive R Archive Network: Vienna, Austria, 2024. [Google Scholar] [CrossRef]
Yao, L.; Haberman, S.J.; Xu, J. Using SAS to implement simultaneous linking in item response theory. In Proceedings of the SAS Forum 2016, Las Vegas, NV, USA, 18–21 April 2016; Technical Report. Available online: https://tinyurl.com/yzmuxpx4 (accessed on 19 November 2024).
Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef] [PubMed]
Jewsbury, P.A. Error variance in common population linking bridge studies. ETS Res. Rep. Ser. 2019, 2019, 1–31. [Google Scholar] [CrossRef]
Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
Zhang, Z. Asymptotic standard errors of generalized partial credit model true score equating using characteristic curve methods. Appl. Psychol. Meas. 2021, 45, 331–345. [Google Scholar] [CrossRef] [PubMed]
Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing techniques for evaluation of equating accuracy. ETS Res. Rep. Ser. 2009, 2009, I-37. [Google Scholar] [CrossRef]
Boos, D.D.; Stefanski, L.A. Essential Statistical Inference; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
Stefanski, L.A.; Boos, D.D. The calculus of M-estimation. Am. Stat. 2002, 56, 29–38. [Google Scholar] [CrossRef]
Huber, P.J. Robust estimation of a location parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
Zeileis, A. Object-oriented computation of sandwich estimators. J. Stat. Softw. 2006, 16, 1–16. [Google Scholar] [CrossRef]
Ogasawara, H. Item response theory true score equatings and their standard errors. J. Educ. Behav. Stat. 2001, 26, 31–50. [Google Scholar] [CrossRef]
Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef] [PubMed]
Fay, M.P.; Graubard, B.I. Small-sample adjustments for Wald-type tests using sandwich estimators. Biometrics 2001, 57, 1198–1206. [Google Scholar] [CrossRef]
Li, P.; Redden, D.T. Small sample performance of bias-corrected sandwich estimators for cluster-randomized trials with binary outcomes. Stat. Med. 2015, 34, 281–296. [Google Scholar] [CrossRef]
Zeileis, A.; Köll, S.; Graham, N. Various versatile variances: An object-oriented implementation of clustered covariances in R. J. Stat. Softw. 2020, 95, 1–36. [Google Scholar] [CrossRef]
Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, Ireland, 2006; pp. 111–154. [Google Scholar]
Robitzsch, A. sirt: Supplementary Item Response Theory Models. 2024. R Package Version 4.2-89. Available online: https://github.com/alexanderrobitzsch/sirt (accessed on 13 November 2024).
Muraki, E. A generalized partial credit model: Application of an EM algorithm. Appl. Psychol. Meas. 1992, 16, 159–176. [Google Scholar] [CrossRef]
Bradlow, E.T.; Wainer, H.; Wang, X. A Bayesian random effects model for testlets. Psychometrika 1999, 64, 153–168. [Google Scholar] [CrossRef]
Shim, H.; Bonifay, W.; Wiedermann, W. Parsimonious item response theory modeling with the negative log-log link: The role of inflection point shift. Behav. Res. Methods 2024, 56, 4385–4402. [Google Scholar] [CrossRef] [PubMed]
Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
Robitzsch, A.; Lüdtke, O. Comparing different trend estimation approaches in country means and standard deviations in international large-scale assessment studies. Large-Scale Assess. Educ. 2023, 11, 26. [Google Scholar] [CrossRef]
Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
Shealy, R.; Stout, W. A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika 1993, 58, 159–194. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. Some thoughts on analytical choices in the scaling model for test scores in international large-scale assessment studies. Meas. Instrum. Soc. Sci. 2022, 4, 9. [Google Scholar] [CrossRef]
Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychol. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]

Figure 1. Simulation study: average absolute differences in estimates of

\hat{μ}

and

\hat{σ}

between pairwise Haberman linking methods PHL1 and PHL2 with Haberman linking (HL).

Figure 1. Simulation study: average absolute differences in estimates of

\hat{μ}

and

\hat{σ}

between pairwise Haberman linking methods PHL1 and PHL2 with Haberman linking (HL).

Figure 2. Simulation study: bias in

\hat{μ}

and

\hat{σ}

estimates for Haberman linking (HL) and pairwise Haberman linking methods PHL1 and PHL2 across all simulation conditions.

Figure 2. Simulation study: bias in

\hat{μ}

and

\hat{σ}

estimates for Haberman linking (HL) and pairwise Haberman linking methods PHL1 and PHL2 across all simulation conditions.

Figure 3. Simulation study: relative root mean square error (RMSE) in

\hat{μ}

and

\hat{σ}

estimates for Haberman linking (HL) and pairwise Haberman linking methods PHL1 and PHL2 across all simulation conditions. HL served as the reference method for calculating the relative RMSE.

Figure 3. Simulation study: relative root mean square error (RMSE) in

\hat{μ}

and

\hat{σ}

estimates for Haberman linking (HL) and pairwise Haberman linking methods PHL1 and PHL2 across all simulation conditions. HL served as the reference method for calculating the relative RMSE.

Figure 4. Simulation study: median of linking error LE and bias-corrected linking error

{LE}_{bc}

across all simulation conditions for estimated group mean

{\hat{μ}}_{2}

for pairwise Haberman linking method PHL2.

Figure 4. Simulation study: median of linking error LE and bias-corrected linking error

{LE}_{bc}

across all simulation conditions for estimated group mean

{\hat{μ}}_{2}

for pairwise Haberman linking method PHL2.

Table 1. Simulation Study: Coverage rates of of the estimated mean

{\hat{μ}}_{2}

for the pairwise Haberman linking method PHL2 as a function of the DIF SD

τ

, number of items I and sample size N for different percentages of missing items.

Table 1. Simulation Study: Coverage rates of of the estimated mean

{\hat{μ}}_{2}

for the pairwise Haberman linking method PHL2 as a function of the DIF SD

τ

, number of items I and sample size N for different percentages of missing items.

			0% Missing Items			10% Missing Items			30% Missing Items
$τ$	$I$	$N$	SE	TE	${TE}_{bc}$	SE	TE	${TE}_{bc}$	SE	TE	${TE}_{bc}$
0	20	250	95.5	97.8	95.8	95.6	98.1	96.1	95.8	98.4	96.4
		500	94.6	97.4	94.9	95.6	98.0	95.9	95.8	98.7	96.1
		1000	95.2	97.8	95.6	94.6	97.6	94.9	95.0	98.3	95.4
		2000	94.6	97.3	94.9	94.3	97.4	94.8	93.5	98.1	95.5
	40	250	94.5	96.8	94.8	94.6	96.8	94.9	95.2	97.3	95.7
		500	94.3	96.3	94.5	94.9	96.7	95.0	94.9	97.4	95.2
		1000	95.3	96.9	95.3	95.8	97.6	96.0	94.9	97.2	95.1
		2000	94.6	96.6	94.6	94.2	96.5	94.3	95.4	97.5	95.6
0.2	20	250	91.6	97.3	93.9	91.7	97.6	94.2	92.2	98.0	94.6
		500	89.1	96.9	93.8	89.0	96.9	94.0	87.1	97.1	93.6
		1000	82.1	96.0	93.6	82.4	96.9	94.4	78.7	95.9	92.3
		2000	73.6	95.9	94.3	72.6	95.8	94.3	65.1	94.2	91.6
	40	250	92.8	96.5	94.5	92.9	97.3	94.7	92.0	97.2	94.5
		500	90.8	95.9	93.8	90.6	96.5	94.2	89.7	96.8	94.2
		1000	86.7	96.4	94.8	85.4	95.3	93.6	82.6	96.4	94.1
		2000	80.4	96.0	94.8	78.8	95.4	94.0	75.7	95.8	94.4
0.4	20	250	83.5	96.5	94.0	83.2	96.9	94.0	81.9	95.8	92.9
		500	75.0	95.9	94.2	72.9	95.4	93.3	71.9	95.8	93.3
		1000	61.5	94.9	93.4	59.8	95.1	94.0	57.1	94.8	93.2
		2000	48.4	94.5	93.4	45.1	93.8	93.1	41.6	93.9	93.0
	40	250	85.7	95.2	93.8	86.5	96.3	95.0	84.4	96.3	94.3
		500	80.5	95.4	93.8	79.3	95.8	94.7	74.5	95.1	93.6
		1000	70.4	95.0	94.3	68.3	95.7	94.8	64.3	94.3	93.2
		2000	57.7	94.5	93.9	54.1	94.9	94.4	50.5	94.3	93.7

Note. SE = standard error; TE = total error;

{TE}_{bc}

= bias-corrected total error; Coverage rates smaller than 93.0 and larger than 97.0 are printed in bold font.

Table 2. Simulation Study: Coverage rates of of the estimated standard deviation

{\hat{σ}}_{2}

for the pairwise Haberman linking method PHL2 as a function of the DIF SD

τ

, number of items I and sample size N for different percentages of missing items.

Table 2. Simulation Study: Coverage rates of of the estimated standard deviation

{\hat{σ}}_{2}

for the pairwise Haberman linking method PHL2 as a function of the DIF SD

τ

, number of items I and sample size N for different percentages of missing items.

			0% Missing Items			10% Missing Items			30% Missing Items
$τ$	$I$	$N$	SE	TE	${TE}_{bc}$	SE	TE	${TE}_{bc}$	SE	TE	${TE}_{bc}$
0	20	250	94.8	98.2	95.4	94.8	98.6	95.5	94.6	98.4	95.7
		500	95.1	98.4	95.7	94.9	98.4	95.3	94.9	99.0	95.5
		1000	95.2	98.3	95.6	94.2	98.4	94.8	94.5	98.7	95.0
		2000	94.7	98.1	95.3	95.4	98.9	96.1	94.3	98.9	95.8
	40	250	94.4	97.6	94.9	94.7	97.7	95.1	94.3	97.9	95.1
		500	94.3	97.3	94.5	95.4	97.7	95.6	94.2	98.0	94.6
		1000	95.1	97.3	95.2	94.4	97.4	94.6	95.1	98.4	95.4
		2000	95.0	97.3	95.0	94.8	97.8	95.0	95.4	98.5	95.8
0.2	20	250	94.4	98.4	95.3	93.9	98.2	95.1	94.2	98.3	95.4
		500	94.1	98.1	95.2	93.9	98.7	94.9	93.6	98.8	95.2
		1000	92.8	98.3	94.9	91.9	98.2	94.3	91.4	98.2	93.6
		2000	88.9	97.4	93.8	88.4	97.8	93.6	86.9	97.5	93.2
	40	250	93.5	96.8	94.3	93.7	97.2	94.7	93.5	97.5	94.6
		500	93.3	97.1	94.1	93.1	97.3	94.1	93.3	97.7	94.0
		1000	93.0	96.8	94.7	91.9	97.0	93.5	91.9	97.8	94.0
		2000	91.1	96.9	94.3	90.0	96.7	94.2	90.0	97.8	94.6
0.4	20	250	92.1	97.7	94.7	90.5	97.5	93.9	91.3	98.1	94.7
		500	88.8	97.8	93.7	89.5	97.3	93.5	89.2	98.1	93.5
		1000	83.5	97.5	93.5	83.7	96.9	93.5	83.0	96.7	92.2
		2000	74.6	96.2	93.3	74.3	96.1	93.3	71.9	95.4	91.7
	40	250	90.5	96.6	93.4	91.0	96.2	93.7	91.4	97.3	94.2
		500	92.1	97.4	95.1	91.0	97.1	94.6	89.1	97.0	93.4
		1000	87.0	96.4	94.2	86.5	97.2	94.8	85.0	96.8	93.8
		2000	79.8	96.6	94.3	78.9	96.0	94.2	76.0	96.0	93.6

Note. SE = standard error; TE = total error;

{TE}_{bc}

= bias-corrected total error; Coverage rates smaller than 93.0 and larger than 97.0 are printed in bold font.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Robitzsch, A. Linking Error Estimation in Haberman Linking. AppliedMath 2025, 5, 7. https://doi.org/10.3390/appliedmath5010007

AMA Style

Robitzsch A. Linking Error Estimation in Haberman Linking. AppliedMath. 2025; 5(1):7. https://doi.org/10.3390/appliedmath5010007

Chicago/Turabian Style

Robitzsch, Alexander. 2025. "Linking Error Estimation in Haberman Linking" AppliedMath 5, no. 1: 7. https://doi.org/10.3390/appliedmath5010007

APA Style

Robitzsch, A. (2025). Linking Error Estimation in Haberman Linking. AppliedMath, 5(1), 7. https://doi.org/10.3390/appliedmath5010007

Article Menu

Linking Error Estimation in Haberman Linking

Abstract

1. Introduction

2. Pairwise Haberman Linking Approach

3. Linking Error Estimation in the Pairwise Haberman Linking Approach

3.1. Standard Error

3.2. Linking Error

3.3. Bias-Corrected Linking Error

3.4. Total Error

3.5. Nonlinear Transformation

4. Simulation Study

4.1. Method

4.2. Results

5. Discussion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI