2.3. Analysis
Baseline estimation:
Subtest raw score data for the three samples were used for all analyses as any invariance found in the raw scores will also apply to any transformed scores, such as scaled or index scores (
Widaman and Reise 1997). The use of raw score data instead of scaled score data may result in higher factor correlations due to the extended range of scores (
Bowden et al. 2007). Data were cleaned and 22 cases from the French sample, seven cases from the Spanish, and three cases from the US sample were removed because of missing data in one or more subtests. Confirmatory factor analysis (CFA) using Mplus 8.5 (
Muthén and Muthén 2020) was first undertaken to establish the best fitting model in the French, Spanish, and US normative samples independently to serve as a baseline model for further tests of measurement invariance. CFA was used in preference to exploratory methods as CFA has been shown to provide less biased estimates of the factor correlations (
Little et al. 1999). Maximum likelihood estimation was used as it is robust to minor departures in normality (
Brown 2015). All CFA models were identified by fixing the loading of the first subtest (indicator) in each factor to unity by default to become the marker indicator.
A one-factor model, where all 15 subtests load onto a single factor (analogous to Spearman’s
g), was investigated first in each sample to serve as a simple baseline comparison for further, more complex models (
Agelink van Rentergem et al. 2020;
Kline 2016). A previously published four-factor model using the WISC-IV was next examined for each sample and the model fit was compared to the simple one-factor model. The four-factor model comprised of a (i) verbal comprehension (Gc) factor which loaded onto similarities, vocabulary, information, and comprehension; (ii) a perceptual organization (Gv and Gf composite) factor which loaded onto block design, visual puzzles, matrix reasoning, and figure weights; (iii) a working memory (Gwm) factor which loaded on to arithmetic, digit span, picture span, and letter number sequencing; and (iv) a processing speed factor (Gs) which loaded on coding, symbol search, and cancellation (
Schneider and McGrew 2018;
Sudarshan et al. 2016;
Weiss et al. 2013).
Next, the five-factor scoring model of the WISC-V, published in the respective manuals for France and Spain, was investigated in the French, Spanish, and US samples, comprising verbal comprehension (VC or Gc), visual spatial (VS or Gv), fluid reasoning (FR or Gf), working memory (WM or Gwm), and processing speed (PS or Gs) factors (see
Table 1 for CHC broad abilities to WISC-V factors correspondence).
To determine the best fitting baseline model for further tests of measurement invariance, the chi-square test was reported; however, the test has been shown to be overly sensitive in large samples, so the emphasis was placed on the alternative fit indices, namely, root mean error of approximation (RMSEA), comparative fit index (CFI), Tucker–Lewis index (TLI), standardized root mean square residual (SRMR), gamma hat, Akaike information criterion (AIC), and Bayesian information criterion (BIC), in line with current recommendations (
Brown 2015;
Cheung and Rensvold 2002;
Marsh et al. 2004;
Meade et al. 2008). Like chi-square, AIC, BIC, SRMR and gamma hat (a modified version of the goodness-of-fit index) indices, are absolute fit indices that do not use an alternative model as a base for comparison (
Hu and Bentler 1999). Good fit of the baseline model was supported by an SRMR value below 0.080, an RMSEA below 0.060, and CFI, TLI, and gamma hat values greater than 0.950 (
Brown 2015;
Hu and Bentler 1999). In regard to AIC and BIC fit indices, the models with the lowest values were considered to fit the data better compared to the other models (
Brown 2015). The difference in chi-square was reported to test if the more complex factor solution showed significantly better fit compared to the less complex nested model (
Gorsuch 2003).
When determining the baseline model for further tests of measurement invariance, second-order models were not reported due to the statistical limitations of second-order model identification. A model must be identified in order for the unknown parameters in the model to be estimated. Underidentification, where there are more parameters than correlations, can lead to improper solutions such as standardised values over 1.0 or negative error variances, sometimes called Heywood cases (
Bentler and Chou 1988;
Chen et al. 2001). Negative error variances are impossible values in the population and are a symptom of structural misspecification (
Kolenikov and Bollen 2012). While model identification is a common problem in factor analysis, no sufficient conditions of model identification are known (
Bollen and Davis 2009). However, a distinction exists between algebraic underidentification and empirical underidentification. Empirical underidentification occurs when, in principle, the system of equations may be algebraically identified (i.e., positive degrees of freedom); however, in practice, there is no solution for a parameter due to insufficient covariance information (
Kenny 1979;
Kenny and Milan 2012;
Rindskopf 1984). Empirical underidentification in factor analysis can occur if a factor loading approaches zero, the correlations between two factors is high (e.g., higher than 0.9), or a model is specified with a factor with only two indicators in a larger model, such as higher order models (
Bentler and Chou 1988;
Kenny 1979;
Rindskopf 1984). While empirical underidentification can be difficult to identify (
Bentler and Chou 1988), it is suggested that when the analytic software “declares a model unidentified that is algebraically identified, the most likely cause is empirical under-identification” (
Rindskopf 1984, p. 117). Further, the software may produce statistically impossible population estimates such as negative variances or correlations greater than one (
Kenny and Milan 2012). Encountering errors in software outputs, such as negative variances, should not be ignored and reflects an improper solution, and is usually a consequence of poor model identification (
Newsom et al. 2023). When negative estimates of error variances occur, researchers are encouraged to screen for empirical underidentification (
Chen et al. 2001;
Kenny 1979;
Rindskopf 1984).
In this current study, an inspection of higher order model outputs in all three samples revealed negative residual variance on the fluid reasoning factor and a correlation greater than one between the second-order ‘g’ and the fluid reasoning factor. Further, the output produced a warning describing the covariance matrix as not positive definite and an ‘undefined’ fluid reasoning factor. Inspection of factor correlations found several factor correlations above 0.9, approaching 1, suggesting multicollinearity resulting in unstable parameter estimates when running a higher order model. As such, we concluded the higher order model was empirically underidentified and was not investigated further in this study.
However, statistical underidentification does not invalidate any theoretical higher order ‘g’ models or use of the FSIQ (
Bowden et al. 2008a;
Rindskopf and Rose 1988;
Wilson et al. 2023c); instead, it only illustrates that the data conditions do not provide an optimal opportunity to estimate and evaluate higher order models. Importantly, if measurement invariance is established in any first-order model, then measurement invariance is implied to hold for any second-order factor or summary score that is based on the same first-order factor pattern (
Widaman and Reise 1997).
Further, bifactor models, where all subtests additionally load onto an uncorrelated (orthogonal) higher order ‘g’ factor, were not explored as bifactor models in these dataalso have the issue of empirical underidentification leading to statistical estimation problems. Such estimation problems typically require arbitrary fixing of parameter values to obtain admissible solutions (
Canivez et al. 2017,
2020,
2021;
Decker 2020;
Fenollar-Cortés and Watkins 2019;
Markon 2019;
Wilson et al. 2023c). Ideally, well-identified factor analytic models require at least three indicators per factor for the identification of higher order models. However, additional indicators would necessitate the development of additional subtests which load onto the relevant factor, with the consideration that testing time constraints will often make higher order model specification impractical (
Bowden et al. 2008b;
Rindskopf and Rose 1988). Further, both higher order and bifactor models have been shown to have the same pattern of relations as the first-order models between subtests and factors using the WISC-V US standardization sample (
Reynolds and Keith 2017).
Despite these statistical limitations, researchers have reported second-order and bifactor models across some of the different versions of the WISC-V. However, in one example, the researchers failed to describe how five-factor, bifactor models with only two indicators loading onto a first-order factor were able to achieve convergence, making replication difficult (
Lecerf and Canivez 2018). Alternatively, researchers have reported arbitrarily constraining to equality parameter estimates for factors with only two indicators per factor to achieve the identification of five-factor bifactor models (
Canivez et al. 2017,
2020,
2021;
Fenollar-Cortés and Watkins 2019). For example, when investigating the construct validity of the WISC-V US, the researchers imposed equality constraints to achieve convergence with five-factor
bifactor models where “Some first-order factors were underidentified because they were measured by only two subtests. In those CFA, the two subtests were constrained to equality before estimating bifactor models to ensure identification” (
Canivez et al. 2017, p. 461). Paradoxically, on the next page in the same article, an equality constraint to facilitate the convergence of five-factor
higher order models was described as follows “this ‘only masks the underlying problem’ (Hair, Anderson, Tatham, and Black, 1998, p. 610) indicating that these models ‘should not be trusted’ (
Kline 2016, p. 237). Accordingly, neither fit indices nor loadings for these models are reported” (
Canivez et al. 2017, p. 462). In other words, the very authors reporting bifactor models of WISC-V data acknowledge that the estimation problems produce models that should not be trusted to be good solutions. In addition, the above studies all fail to test whether simpler, first-order, and identified models without post hoc ‘fixes’, provided good fit to the data.
Further, an exploratory factor analysis (EFA) was not undertaken in this current research. Firstly, CFA provides many advantages over EFA, such as the ability to undertake significance testing between competing models (
Brown 2015;
Gorsuch 2003). Additionally, CFA uses previous research and theory to apply theory-based solutions (
Gorsuch 2003). Further, CFA offers more flexibility over EFA and facilitates the investigation of a much greater variety of models (
Widaman 2012). Other advantages of CFA over EFA include the ability to test more parsimonious solutions and, importantly for this current study the ability to evaluate the equivalence of measurement models across groups (
Brown 2015). Lastly, with respect to the replication crisis in psychological research, CFA allows for the direct comparison of different models, whereas EFA does not.
Factorial invariance analysis:
Once the first-order baseline model was established across French, Spanish, and US samples, a multigroup CFA was then used to test for factorial invariance. First, the French and US samples were compared, followed by the Spanish and US, and, lastly, the French and Spanish. We used the increasingly restrictive hierarchical approach to factorial invariance, whereby we started with an unconstrained model, other than holding the pattern of factor loadings identical as a test of configural invariance (
Bontempo and Hofer 2007;
Meredith and Teresi 2006;
Widaman and Reise 1997). If configural invariance was established, we added the constraint of equal factor loadings as a test of weak invariance. If weak invariance was found, we added the constraint of equality of intercepts as a test of strong invariance. Lastly, if strong invariance was concluded, we tested for strict invariance by additionally holding the indicator residuals to equality across samples. Configural invariance was supported by fit indices showing a CFI, TLI, and Gamma-hat greater than 0.950 and an SRMR of less than 0.080 (
Hu and Bentler 1999;
Marsh et al. 2004). Evidence for weak, strong, and strict invariance would be supported by changes in CFI or TLI of not greater than 0.010, change in RMSEA of less than or equal to 0.015, or changes in SRMR less than or equal to 0.030 (
Chen 2007;
Cheung and Rensvold 2002;
French and Finch 2006). However, poorer measurement quality, for example, the magnitude of the factor loadings, has been shown to lead to worse data-model fit and, as such, the quality of measurement was also considered when testing invariance across groups (
Kang et al. 2016). Further, strict factorial invariance, which assumes equivalent residual variances across groups, may be overly restrictive and is unnecessary for construct generalization (
Horn and McArdle 1992). Next, structural invariance was investigated across the three pair-wise comparisons. Additional constraints were placed on the strict invariance model, first, equality of factor variances; second, equality of factor variances and factor covariances; and, last, equality of latent means (
Widaman and Reise 1997). Loss of model fit was compared to the strict invariance model using the same criteria for the assessment of factorial invariance.