Evaluating Model Fit in Two-Level Mokken Scale Analysis
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors
Review of the Manuscript “Estimating cluster-level fit statistics in two-level Mokken scale analysis”
The present manuscript deals with multilevel Mokken scale analysis and specifically with the question of how model fit can be evaluated by checks of its assumptions. To this end, the authors applied the usual checks in single-level Mokken scale analysis also to the higher level in two-level Mokken scale analysis---a potentially useful screening tool for evaluating model fit at level 2. Indeed, the authors found that although violations of monotonicity could generally be detected, although significance tests were inaccurate at level 2 (i.e., a too large type II error). They also found that violations of invariant item ordering could accurately be identified.
The manuscript is interesting and fits into the special issue, and I read it with great interest. I was especially happy to read that mokken, an R package for nonparametric IRT, exists and that this type of analysis is becoming increasingly interesting to many researchers. Overall, the manuscript reads well, the math is clear, and there is not much to comment on. However, I think some minor revisions be made before publication.
- The simulation design is not optimal (e.g., the overall number of persons was held constant across all conditions so that number of clusters and number of persons within clusters were dependent on one another). Also, it is not fully crossed. Together, this makes conclusions regarding the impact of specific factors und their interactions difficult if not impossible. I understand that the simulation was not meant to be a rigorous investigation of the potential and limitations of the proposed approach for assessing model fit. However, on the basis of this rather weak design, clear implications were formulated such as that the approach “can safely be used” to detect violations of invariant item ordering. Given the design, I suggest that such conclusions and implications should either be further qualified or weakened.
- In two-level CFA, latent aggregation is often performed by default (e.g., Mplus, see Lüdtke et al., 2008; see also Zitzmann, 2023, in this special issue), meaning that the variable at level 2 is conceptualized as a latent variable (i.e., corrected for sampling only a limited number of persons per cluster). I wonder if this approach is used in two-level Mokken scale analysis as well. If this is the case, the sum score at level 2 should have an error (i.e., sampling error), especially when the number of persons within clusters is only small (e.g., = 5). Besides an only small sample size as level 2, couldn’t a large sampling error explain the suboptimal results for the significant tests at level 2?
- The authors investigated two-level Mokken scale analysis. Do they believe their proposed approach would be applicable also to multilevel Mokken scale analysis with more than two levels? Would their findings generalize even to these scenarios?
- There are some typos in the manuscript (e.g., “we advise is…”). Please correct!
Author Response
Please, see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for Authors
Dear Editor, and Authors,
Thank you for giving me the opportunity to review the article "Estimating Cluster-Level Fit Statistics in Two-Level Mokken Scale Analysis" for possible publication in the special issue “Computational Aspects and Software in Psychometrics II” in the MDPI Journal Psych.
I reviewed the paper carefully and to my best knowledge.
Overview
This paper under review explores the implementation of model-fit procedures for two-level Mokken scale analysis. It investigates whether copying procedures from one-level analysis is effective for two important assumptions in nonparametric item response theory, namely monotonicity and invariant item ordering. Based on a data simulation violations of monotonicity were found at Level 2, but the test lacked power, while violations of invariant item ordering were detected and can be safely used at Level 2.
Review
In general, I welcome the further development of techniques of non-parametric procedures for scaling such as Mokken-Scale-Analysis (MSA). However, my impression of the article as a whole is ambivalent. On the one hand, the intended width of the presentation of the Mokken-Scale-Analysis is quite impressive, in particular, the principles of the model testing are presented quite extensively and on the other hand, I am still missing a few things here, which from my perspective would be essential for better comprehensibility.
I would like to say that I do not see myself as an absolute expert for the Mokken-Scale-Analysis, but the direct comparison with the presentations of the Mokken-Scale-Analysis in other contributions (which are partly also cited) leaves the impression with me that the clarity and easier comprehensibility of some of the explanations in the manuscript in the present form could be still developed.
In particular, I miss three basic things in the paper in its current form.
1) From my point of view, it is not yet shown convincingly -- perhaps by means of a still missing (empirical) example -- why and under which (data) conditions a separate testing of the model fit for level 2 is absolutely necessary. Specifically, it might be helpful to give an example under which conditions, for example, a model fit for level 1 is given with at the same time a (possible) rejection for level 2. I would have liked to read more about this.
2) Basically, it seems to be clear that this article is about a possible / already implemented? extension of the functionality of the R-package mokken. For this purpose, explicit references to single functions of the mokken package are made again and again in different places. The more irritating for the reader is that for details to certain functions (of the R-package) then to manuals / documentation of another software is referred (see e.g. my detailed remarks No. 8 and 10).
3) In connection with my point 2) it would be nice if the accompanying R-code, unfortunately promised only in lines 461-462, would actually be available. I, at least, have not been able to access it. Especially since the data examples in the theoretical descriptions (e.g. pages 8 and 9) obviously refer to (example) data available within the R package mokken (Adjective Checklist Data – ‘alc’).
In my opinion, a closer link between the theoretical explanations and the statements made in the second section with the empirical (data simulation) to an available R code could, from my perspective, contribute considerably to the clarity and comprehensibility of the presentations.
Thus, my general impression is that the text in the present manuscript reflects some time pressure when submitting. For example, there are some somewhat unclearly formulated sentences and definitions and some sources / citations are missing or do not seem quite appropriate for the statements made.
I would suggest taking your time to review the new version of your manuscript carefully.
In order to give a quick and target-oriented possibility for improvement (at least for some examples), I would like to go directly into single points for possible improvement that I noticed during my study of the manuscript.
More detailed feedback listing all the places in need of editing is beyond the scope of this peer-review given the available time, but I hope these few comments will be of some use.
Single points for possible improvement
1) Page 1, lines 17:
I was wondering whether the phrase “proposed by Mokken (1970)” might need a respective citation.
2) Page 2, line 86-87
Consider revising the sentence to provide more clarity:
“It is an attractive because it fits recent results on model fit in two-level MSA [28], and it would leave the structure of the R package mokken unchanged.”
3) Page 3, lines 125 to 126:
consider reformulation of the sentence:
" Monotonicity means that respondent p has a higher Θ value than respondent r, then respondent r has an equal or higher probability to have at least a score x on item i."
Furthermore, I think that the statement intended by this sentence contains a fundamental error or that the presumably intended core statement is formulated too imprecisely. According to Sijtsma and Molenaar (2002, p. 120), the MHM requirement for polytomous items does not refer to the comparison of different persons, but to the relationship between the ISRFs within an item. Specifically, it is required that the ISRF of an single item may not overlap -- but the ISRF of different items may overlap. Furthermore, I have the impression that the term monotonicity is used here in the wrong context.
Sijtsma, K., & Molenaar, I. W. (2002). Introduction to nonparametric item response theory. Sage Publications.
4) Page 3, Line 127:
“All popular IRT models imply monotonicity”
Well, perhaps this general statement could be formulated a bit more precisely, because if you look at the entire range of IRT models, which also includes unfolding models, for example, then this statement is not strictly correct. I would therefore suggest specifying the statement something like this:
All models for dominance relations between items and persons imply...
5) Page 3, Line 130:
“… Without loss generality …”
à is there something missing?
6) Page 5-6, lines 181-191:
well, the statement about the variance of population estimators based on sample data is not really a new statement -- at least with respect to other (standard) parametric test procedures. I think the core statement / core problem that is still somewhat hidden behind this section (and the following up to about line 191) aims at the fact that for an intrinsically non-parametric scaling procedure like Mokken analysis a fit test with different strict assumptions has to be applied according to the principle of a parametric significance test. I could imagine treating this circumstance a bit more detailed …
7) Page 7, lines 229 … onward
Typo: “rest score” vs. “rest-score”. I think you should choose between on type of spelling – by the way, Junker and Sijtsma (2000) chose “rest score” ...
Furthermore, I think the concept of the rest score could be introduced a little more clearly than the authors have done so far in this manuscript. For example, in Junker and Sijtsma (2000), which the authors already cite, the concept is very clearly described.
Junker, B. W., & Sijtsma, K. (2000). Latent and Manifest Monotonicity in Item Response Models. Applied Psychological Measurement, 24(1), 65–81. https://doi.org/10.1177/01466216000241004
8) Page 7, line 241:
“In the software”
à I assume this relates to the mokken package? or does that refer to the software MSP5 for Windows as cited a few lines above in the manuscript? … if the latter is true this should be introduced more clearly.
9) Page 7, lines 251-253:
“For the rationale for using minvi and not correcting for multiple testing, we refer to Ligtvoet et al. [42].”
à I think it would be useful for readers who are not already deeply involved in the topic of nonparametric scaling models for dominance response-processes and the corresponding literature in its entirety to know (at least in the form of a short paraphrase) what the core argumentation of Ligtvoet et al.(2010) is.
10) Page 7, line 254:
The sentence “Method check.monotonicity provides numerous output [see, 41, for details]” seems clearly refer to the R-package mokken but the reference number 41 refers to another software ? …
11) Page 8, line 304:
“Note that due to the aggregation, item and sum scores on Level 2 are no longer integers. ”
à I think that this is a significant consequence of the aggregation on level 2 -- it would be interesting to learn more about the implications thereof; after all, it seems to be so relevant that it motivated the unfortunately only short note.
12) Page 8, line 305
“sum scores”? should this not be termed mean scores? (with regard to the argumentation given in the previous sentence).
13) Page 9, lines 306; 312:
à I wonder if the abbreviations “MHM-1” and “DMM-1” have already been introduced somewhere else -- although it is reasonably obvious that the level 1 model is meant here. The same is true for “TL-NIRT” in line 312 – which possibly denotes two level NIRT.
I would advise to also check the list of abbreviations at the end of the manuscript.
14) Page 9, line 319:
“… are not actually dichotomous …”
à I assume this refers to the comment above about the consequence of aggregation (no longer integers, but rational number values) right? -- if it is as I assume then I would not choose the term dichotomies here but just (non-)integer scoring or something similar.
15) Page 10, line 368 (beginning of section on data simulation):
As a reader, I wonder why some effort was put into presenting different aspects of the Mokken analysis not only for dichotomies, but also for polytomous response scales in the preceding theoretical explanations (instead of possibly being limited to the dichotomous case), when the subsequent data simulation for the empirical part of the paper is limited to the dichotomous case only.
16) Page13, line 459:
“Using function ICC we estimated the …”
perhaps it would be helpful for the readers to know from which R-package the function comes from – specifically is it “psych::ICC” or “mokken::ICC”
17) Page 13, line 461-462:
“Syntax files are available to download from the Open Science Framework via https://osf.io/jq69u”
→ Unfortunately, this link did not work for me
These were a few examples of urgent specific issues in the manuscript.
I would advise checking the entire manuscript again from the perspective of potential readers, who are interested in following the given theoretical and practical aspects in the manuscript.
Overall I would recommend a revision before publication.
Best regards,
Reviewer
Author Response
"Please, see the attachment"
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis manuscript presents an in-depth investigation of Single-Level and Two-Level Nonparametric Item Response Theory (NIRT) models. The research is ambitious, comprehensive, and addresses an area of psychometric analysis that could use further investigation. However, there are aspects of the manuscript that would benefit from further refinement and clarification. Therefore, I recommend a substantial revision before this paper can be considered for publication.
Here are my suggestions for each section:
# Introduction
+ Please provide more context on the research gap in current literature regarding two-level MSA. Clarify what is lacking in existing research.
+ Could you expand on your methodological approach, particularly how you intend to adapt traditional MSA model-fit procedures to two-level MSA?
+ Consider structuring the introduction in a more reader-friendly manner, helping to guide less-experienced readers through the complexity of MSA and its components.
# Model
## 2.1 and 2.2 - Single-Level and Two-Level NIRT Models
+ Please supplement the mathematical notations with more reader-friendly explanations to make the content more accessible.
+ Discuss potential practical complications that could arise due to the assumptions made about respondent values (δsr).
+ Include a critical analysis of the validity and potential limitations of the models' assumptions.
## 2.3 - Model-fit Investigations in Single-Level NIRT
+ Make an effort to simplify the section and explain complex concepts in simpler terms.
+ Discuss potential solutions or guidelines for researchers to address the dependence on the researcher's discretion.
+ Provide guidance or recommendations on choosing between the liberal or conservative approach.
+ Discuss the limitations of the tests used and potential issues that could arise during their application.
## 2.4 - Model-fit Investigations in Two-Level NIRT
+ Please provide further development and explanation for the approach to evaluating local independence.
+ Discuss the potentially conservative nature of the method check.monotonicity and how it might affect the model testing.
+ Address how the aggregated nature of Level 2 scores might impact the evaluation of invariant item ordering.
# Method
+ Expand on the explanation of the data generation strategy using the adapted two-parameter logistic model.
+ Clarify why certain dependent variables were chosen for analysis.
+ Improve the hypotheses section by explicitly outlining expectations for each manipulated condition.
+ Provide more explanation and justification for the choice of statistical models and their appropriateness for the research question.
# Results
+ Include a comparison of your results with previous similar studies, highlighting the novelty or importance of your research.
+ Discuss any unexpected findings or anomalies in the data and speculate on their potential causes.
+ Discuss the practical implications of the research and how these findings could influence future practices.
+ Consider including visual aids such as graphs or charts to complement the tables and enhance the presentation of results.
# Discussion and Conclusion Sections
+ Please elaborate on the implications of your findings, particularly on the power of the Z-test and the T-test at Level 2.
+ Benchmark your results with existing research in the field of MSA and discuss where your findings align or diverge from the established literature.
+ Discuss potential solutions to overcome the identified limitations in future studies. Also, expand on the suggested future research directions by raising more questions surfaced from the current study.
+ Offer more explicit practical implications and recommendations based on your findings.
The manuscript is well-written, but the aforementioned recommendations could enhance its accessibility and depth. I look forward to reading a revised version of this promising research.
Comments on the Quality of English Language
Overall, the quality of English language used in the manuscript is commendable. The authors have managed to communicate complex concepts and techniques effectively. However, there are areas where improvements can be made to enhance readability and clarity. Here are my suggestions:
There are instances where sentences could be shortened or split for improved readability. Sometimes, the use of lengthy sentences makes it difficult to follow the argument.
The authors could also strive to use less jargon when possible, or provide concise definitions when technical terms are introduced. This would make the manuscript more accessible to a wider audience, including those less familiar with Mokken Scale Analysis and Nonparametric Item Response Theory models.
Author Response
Please, see attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
First, I would like to express my appreciation for the efforts taken to improve the manuscript, and for taking my previous comments into consideration. I have also had the opportunity to read the other reviewers' comments and the authors' responses to them. In my opinion, the authors have done a commendable job in addressing the issues raised.
However, upon testing the provided script, I unfortunately encountered some difficulties reproducing the results. Specifically, I was unable to reproduce the results presented in Table 1, 2, 3, and 4.
In addition to this, I was wondering if it might be possible for the authors to provide scripts for generating Figures 1, 2, 3, 4, and 5. This would be invaluable in allowing me to fully verify the results presented in this paper.
I look forward to seeing these improvements in the next version of the manuscript. The work done so far shows significant potential and promise, and I believe these further adjustments will greatly enhance the transparency and reproducibility of the study.
Sincerely,
Author Response
We apologize for the fact that the files on OSF did not reproduce the results in the tables, and we thank the reviewer for noting this. The problem has been fixed. On OSF (https://osf.io/jq69u/) there is now annotated R code that offers two possibilities: Either (1) rerun the entire simulation studie (takes approx. 4-6 hours) and produce the tables and figures based on these results, or (2) download the results (.Rds files) from OSF, read the results, and produce the tables and figures.
We removed the obsolete R code that pertained to a part of the simulation study that is not (and never has been) part of the paper. As out R code has a single 'seed' at the beginning, removing the obsolete R-code affected the results from 'random' sampling, and generated slightly different results for Tables 1, 2, 3 and 4 (usually in the 3rd or 4th decimal, occasionally in the 2nd decimal). To make sure that the R code on OSF produces the exact values reported in the tables, we redid the simulation study using using the R code without the obsolete part, and reported theses results. This had no affect on the outcome of the study.