Topic Editors

Data Science and Analytics Program, Graduate School of Arts and Science, Georgetown University, 3520 Prospect Street NW, Car Barn 207, Washington, DC 20057, USA
Dr. Yunxiao Chen
Department of Statistics, London School of Economics and Political Science, London WC2A 2AE, UK
Prof. Dr. Carolyn Jane Anderson
College of Education, University of Illinois, Champaign, IL 61820, USA

Psychometric Methods: Theory and Practice

Abstract submission deadline
closed (28 February 2023)
Manuscript submission deadline
closed (30 September 2023)
Viewed by
25440

Topic Information

Dear Colleagues,

Measurement and quantification are ubiquitous in modern society. The historical foundation of psychometrics arose from the need to measure human abilities through suitable tests. This discipline then underwent rapid conceptual growth due to the incorporation of advanced mathematical and statistical methods. Today, psychometrics not only covers virtually all statistical methods but also incorporates advanced techniques from machine learning and data mining that are useful for the behavioral and social sciences, including but not limited to the handling of missing data, the combination of multiple-source information with measured data, measurement obtained from special experiments, visualization of statistical outcomes, measurement that discloses underlying problem-solving strategies, and so on. Psychometric methods now have a wide range of applicability in various disciplines, such as education, psychology, social sciences, behavioral genetics, neuropsychology, clinical psychology, medicine, and even visual arts and music, to name a few.

The dramatic development of psychometric methods and rigorous incorporation of psychometrics, data science, and even artificial intelligence techniques in interdisciplinary fields have aroused significant attention and led to pressing discussions about the future of measurement.

The aim of this Special Topic is to gather studies on the latest development of psychometric methods covering a broad range of methods, from traditional statistical methods to advanced data-driven approaches, and to highlight discussions about different approaches (e.g., theory-driven vs. data-driven) to address challenges in psychometric theory and practice.

This Special Topic consists of two subtopics: (1) theory-driven psychometric methods that exhibit the advancement of psychometric and statistical modeling in measurement to contribute to the development of psychological and hypothetical theories; and (2) data-driven computational methods that leverage new data sources and machine learning/data mining/artificial intelligence techniques to address new psychometric challenges.

In this issue, we seek original empirical or methodological studies, thematic/conceptual review articles, and discussion and comment papers highlighting pressing topics related to psychometrics.

Interested authors should submit a letter of intent including (1) a working title for the manuscript, (2) names, affiliations, and contact information for all authors, and (3) an abstract of no more than 500 words detailing the content of the proposed manuscript to the topic editors.

There is a two-stage submission process. Initially, interested authors are requested to submit only abstracts of their proposed papers. Authors of the selected abstracts will then be invited to submit full papers. Please note that the invitation to submit does not guarantee acceptance/publication in the Special Topic. Invited manuscripts will be subject to the usual review standards of the participating journals, including a rigorous peer review process.

Dr. Qiwei He
Dr. Yunxiao Chen
Prof. Dr. Carolyn Jane Anderson
Topic Editors

Participating Journals

Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
Behavioral Sciences
behavsci
2.5 2.6 2011 27 Days CHF 2200
Education Sciences
education
2.5 4.8 2011 26.8 Days CHF 1800
Journal of Intelligence
jintelligence
2.8 2.8 2013 36.5 Days CHF 2600

Preprints.org is a multidiscipline platform providing preprint service that is dedicated to sharing your research from the start and empowering your research journey.

MDPI Topics is cooperating with Preprints.org and has built a direct connection between MDPI journals and Preprints.org. Authors are encouraged to enjoy the benefits by posting a preprint at Preprints.org prior to publication:

  1. Immediately share your ideas ahead of publication and establish your research priority;
  2. Protect your idea from being stolen with this time-stamped preprint article;
  3. Enhance the exposure and impact of your research;
  4. Receive feedback from your peers in advance;
  5. Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (10 papers)

Order results
Result details
Journals
Select all
Export citation of selected articles as:
19 pages, 1958 KiB  
Article
Psychometric Modeling to Identify Examinees’ Strategy Differences during Testing
by Clifford E. Hauenstein, Susan E. Embretson and Eunbee Kim
J. Intell. 2024, 12(4), 40; https://doi.org/10.3390/jintelligence12040040 - 29 Mar 2024
Viewed by 1568
Abstract
Aptitude test scores are typically interpreted similarly for examinees with the same overall score. However, research has found evidence of examinee differences in strategies, as well as in the continued application of appropriate procedures during testing. Such differences can impact the correlates of [...] Read more.
Aptitude test scores are typically interpreted similarly for examinees with the same overall score. However, research has found evidence of examinee differences in strategies, as well as in the continued application of appropriate procedures during testing. Such differences can impact the correlates of test scores, making similar interpretations for equivalent scores questionable. This study presents some item response theory (IRT) models that are relevant to identifying examinee differences in strategies and understanding of test-taking procedures. First, mixture IRT models that identify latent classes of examinees with different patterns of item responses are considered; these models have long been available but unfortunately are not routinely applied. Strategy differences between the classes can then be studied separately by modeling the response patterns with cognitive complexity variables within each class. Secondly, novel psychometric approaches that leverage response time information (in particular, response time residuals) in order to identify both inter and intraindividual variability in response processes are considered. In doing so, a general method for evaluating threats to validity is proposed. The utility of the approach, in terms of providing more interpretable performance estimates and improving the administration of psychological measurement instruments, is then demonstrated with an empirical example. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

17 pages, 1604 KiB  
Article
Explanatory Cognitive Diagnosis Models Incorporating Item Features
by Manqian Liao, Hong Jiao and Qiwei He
J. Intell. 2024, 12(3), 32; https://doi.org/10.3390/jintelligence12030032 - 11 Mar 2024
Viewed by 1754
Abstract
Item quality is crucial to psychometric analyses for cognitive diagnosis. In cognitive diagnosis models (CDMs), item quality is often quantified in terms of item parameters (e.g., guessing and slipping parameters). Calibrating the item parameters with only item response data, as a common practice, [...] Read more.
Item quality is crucial to psychometric analyses for cognitive diagnosis. In cognitive diagnosis models (CDMs), item quality is often quantified in terms of item parameters (e.g., guessing and slipping parameters). Calibrating the item parameters with only item response data, as a common practice, could result in challenges in identifying the cause of low-quality items (e.g., the correct answer is easy to be guessed) or devising an effective plan to improve the item quality. To resolve these challenges, we propose the item explanatory CDMs where the CDM item parameters are explained with item features such that item features can serve as an additional source of information for item parameters. The utility of the proposed models is demonstrated with the Trends in International Mathematics and Science Study (TIMSS)-released items and response data: around 20 item linguistic features were extracted from the item stem with natural language processing techniques, and the item feature engineering process is elaborated in the paper. The proposed models are used to examine the relationships between the guessing/slipping item parameters of the higher-order DINA model and eight of the item features. The findings from a follow-up simulation study are presented, which corroborate the validity of the inferences drawn from the empirical data analysis. Finally, future research directions are discussed. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

17 pages, 2377 KiB  
Article
A Comparative Study of Item Response Theory Models for Mixed Discrete-Continuous Responses
by Cengiz Zopluoglu and J. R. Lockwood
J. Intell. 2024, 12(3), 26; https://doi.org/10.3390/jintelligence12030026 - 25 Feb 2024
Viewed by 1994
Abstract
Language proficiency assessments are pivotal in educational and professional decision-making. With the integration of AI-driven technologies, these assessments can more frequently use item types, such as dictation tasks, producing response features with a mixture of discrete and continuous distributions. This study evaluates novel [...] Read more.
Language proficiency assessments are pivotal in educational and professional decision-making. With the integration of AI-driven technologies, these assessments can more frequently use item types, such as dictation tasks, producing response features with a mixture of discrete and continuous distributions. This study evaluates novel measurement models tailored to these unique response features. Specifically, we evaluated the performance of the zero-and-one-inflated extensions of the Beta, Simplex, and Samejima’s Continuous item response models and incorporated collateral information into the estimation using latent regression. Our findings highlight that while all models provided highly correlated results regarding item and person parameters, the Beta item response model showcased superior out-of-sample predictive accuracy. However, a significant challenge was the absence of established benchmarks for evaluating model and item fit for these novel item response models. There is a need for further research to establish benchmarks for evaluating the fit of these innovative models to ensure their reliability and validity in real-world applications. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

24 pages, 4874 KiB  
Article
Conditional Dependence across Slow and Fast Item Responses: With a Latent Space Item Response Modeling Approach
by Nana Kim, Minjeong Jeon and Ivailo Partchev
J. Intell. 2024, 12(2), 23; https://doi.org/10.3390/jintelligence12020023 - 16 Feb 2024
Viewed by 1682
Abstract
There recently have been many studies examining conditional dependence between response accuracy and response times in cognitive tests. While most previous research has focused on revealing a general pattern of conditional dependence for all respondents and items, it is plausible that the pattern [...] Read more.
There recently have been many studies examining conditional dependence between response accuracy and response times in cognitive tests. While most previous research has focused on revealing a general pattern of conditional dependence for all respondents and items, it is plausible that the pattern may vary across respondents and items. In this paper, we attend to its potential heterogeneity and examine the item and person specificities involved in the conditional dependence between item responses and response times. To this end, we use a latent space item response theory (LSIRT) approach with an interaction map that visualizes conditional dependence in response data in the form of item–respondent interactions. We incorporate response time information into the interaction map by applying LSIRT models to slow and fast item responses. Through empirical illustrations with three cognitive test datasets, we confirm the presence and patterns of conditional dependence between item responses and response times, a result consistent with previous studies. Our results further illustrate the heterogeneity in the conditional dependence across respondents, which provides insights into understanding individuals’ underlying item-solving processes in cognitive tests. Some practical implications of the results and the use of interaction maps in cognitive tests are discussed. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

32 pages, 24122 KiB  
Article
Biclustering of Log Data: Insights from a Computer-Based Complex Problem Solving Assessment
by Xin Xu, Susu Zhang, Jinxin Guo and Tao Xin
J. Intell. 2024, 12(1), 10; https://doi.org/10.3390/jintelligence12010010 - 17 Jan 2024
Viewed by 2016
Abstract
Computer-based assessments provide the opportunity to collect a new source of behavioral data related to the problem-solving process, known as log file data. To understand the behavioral patterns that can be uncovered from these process data, many studies have employed clustering methods. In [...] Read more.
Computer-based assessments provide the opportunity to collect a new source of behavioral data related to the problem-solving process, known as log file data. To understand the behavioral patterns that can be uncovered from these process data, many studies have employed clustering methods. In contrast to one-mode clustering algorithms, this study utilized biclustering methods, enabling simultaneous classification of test takers and features extracted from log files. By applying the biclustering algorithms to the “Ticket” task in the PISA 2012 CPS assessment, we evaluated the potential of biclustering algorithms in identifying and interpreting homogeneous biclusters from the process data. Compared with one-mode clustering algorithms, the biclustering methods could uncover clusters of individuals who are homogeneous on a subset of feature variables, holding promise for gaining fine-grained insights into students’ problem-solving behavior patterns. Empirical results revealed that specific subsets of features played a crucial role in identifying biclusters. Additionally, the study explored the utilization of biclustering on both the action sequence data and timing data, and the inclusion of time-based features enhanced the understanding of students’ action sequences and scores in the context of the analysis. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

22 pages, 548 KiB  
Article
Modeling Sequential Dependencies in Progressive Matrices: An Auto-Regressive Item Response Theory (AR-IRT) Approach
by Nils Myszkowski and Martin Storme
J. Intell. 2024, 12(1), 7; https://doi.org/10.3390/jintelligence12010007 - 15 Jan 2024
Viewed by 1968
Abstract
Measurement models traditionally make the assumption that item responses are independent from one another, conditional upon the common factor. They typically explore for violations of this assumption using various methods, but rarely do they account for the possibility that an item predicts the [...] Read more.
Measurement models traditionally make the assumption that item responses are independent from one another, conditional upon the common factor. They typically explore for violations of this assumption using various methods, but rarely do they account for the possibility that an item predicts the next. Extending the development of auto-regressive models in the context of personality and judgment tests, we propose to extend binary item response models—using, as an example, the 2-parameter logistic (2PL) model—to include auto-regressive sequential dependencies. We motivate such models and illustrate them in the context of a publicly available progressive matrices dataset. We find an auto-regressive lag-1 2PL model to outperform a traditional 2PL model in fit as well as to provide more conservative discrimination parameters and standard errors. We conclude that sequential effects are likely overlooked in the context of cognitive ability testing in general and progressive matrices tests in particular. We discuss extensions, notably models with multiple lag effects and variable lag effects. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

23 pages, 5897 KiB  
Article
Using IRTree Models to Promote Selection Validity in the Presence of Extreme Response Styles
by Victoria L. Quirk and Justin L. Kern
J. Intell. 2023, 11(11), 216; https://doi.org/10.3390/jintelligence11110216 - 17 Nov 2023
Cited by 1 | Viewed by 1833
Abstract
The measurement of psychological constructs is frequently based on self-report tests, which often have Likert-type items rated from “Strongly Disagree” to “Strongly Agree”. Recently, a family of item response theory (IRT) models called IRTree models have emerged that can parse out content traits [...] Read more.
The measurement of psychological constructs is frequently based on self-report tests, which often have Likert-type items rated from “Strongly Disagree” to “Strongly Agree”. Recently, a family of item response theory (IRT) models called IRTree models have emerged that can parse out content traits (e.g., personality traits) from noise traits (e.g., response styles). In this study, we compare the selection validity and adverse impact consequences of noise traits on selection when scores are estimated using a generalized partial credit model (GPCM) or an IRTree model. First, we present a simulation which demonstrates that when noise traits do exist, the selection decisions made based on the IRTree model estimated scores have higher accuracy rates and have less instances of adverse impact based on extreme response style group membership when compared to the GPCM. Both models performed similarly when there was no influence of noise traits on the responses. Second, we present an application using data collected from the Open-Source Psychometrics Project Fisher Temperament Inventory dataset. We found that the IRTree model had a better fit, but a high agreement rate between the model decisions resulted in virtually identical impact ratios between the models. We offer considerations for applications of the IRTree model and future directions for research. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

17 pages, 660 KiB  
Article
Estimating the Multidimensional Generalized Graded Unfolding Model with Covariates Using a Bayesian Approach
by Naidan Tu, Bo Zhang, Lawrence Angrave, Tianjun Sun and Mathew Neuman
J. Intell. 2023, 11(8), 163; https://doi.org/10.3390/jintelligence11080163 - 14 Aug 2023
Cited by 5 | Viewed by 1533
Abstract
Noncognitive constructs are commonly assessed in educational and organizational research. They are often measured by summing scores across items, which implicitly assumes a dominance item response process. However, research has shown that the unfolding response process may better characterize how people respond to [...] Read more.
Noncognitive constructs are commonly assessed in educational and organizational research. They are often measured by summing scores across items, which implicitly assumes a dominance item response process. However, research has shown that the unfolding response process may better characterize how people respond to noncognitive items. The Generalized Graded Unfolding Model (GGUM) representing the unfolding response process has therefore become increasingly popular. However, the current implementation of the GGUM is limited to unidimensional cases, while most noncognitive constructs are multidimensional. Fitting a unidimensional GGUM separately for each dimension and ignoring the multidimensional nature of noncognitive data may result in suboptimal parameter estimation. Recently, an R package bmggum was developed that enables the estimation of the Multidimensional Generalized Graded Unfolding Model (MGGUM) with covariates using a Bayesian algorithm. However, no simulation evidence is available to support the accuracy of the Bayesian algorithm implemented in bmggum. In this research, two simulation studies were conducted to examine the performance of bmggum. Results showed that bmggum can estimate MGGUM parameters accurately, and that multidimensional estimation and incorporating relevant covariates into the estimation process improved estimation accuracy. The effectiveness of two Bayesian model selection indices, WAIC and LOO, were also investigated and found to be satisfactory for model selection. Empirical data were used to demonstrate the use of bmggum and its performance was compared with three other GGUM software programs: GGUM2004, GGUM, and mirt. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

21 pages, 5191 KiB  
Article
Deterministic Input, Noisy Mixed Modeling for Identifying Coexisting Condensation Rules in Cognitive Diagnostic Assessments
by Peida Zhan
J. Intell. 2023, 11(3), 55; https://doi.org/10.3390/jintelligence11030055 - 16 Mar 2023
Cited by 1 | Viewed by 1701
Abstract
In cognitive diagnosis models, the condensation rule describes the logical relationship between the required attributes and the item response, reflecting an explicit assumption about respondents’ cognitive processes to solve problems. Multiple condensation rules may apply to an item simultaneously, indicating that respondents should [...] Read more.
In cognitive diagnosis models, the condensation rule describes the logical relationship between the required attributes and the item response, reflecting an explicit assumption about respondents’ cognitive processes to solve problems. Multiple condensation rules may apply to an item simultaneously, indicating that respondents should use multiple cognitive processes with different weights to identify the correct response. Coexisting condensation rules reflect the complexity of cognitive processes utilized in problem solving and the fact that respondents’ cognitive processes in determining item responses may be inconsistent with the expert-designed condensation rule. This study evaluated the proposed deterministic input with a noisy mixed (DINMix) model to identify coexisting condensation rules and provide feedback for item revision to increase the validity of the measurement of cognitive processes. Two simulation studies were conducted to evaluate the psychometric properties of the proposed model. The simulation results indicate that the DINMix model can adaptively and accurately identify coexisting condensation rules, existing either simultaneously in an item or separately in multiple items. An empirical example was also analyzed to illustrate the applicability and advantages of the proposed model. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

19 pages, 501 KiB  
Article
Is Distributed Leadership Universal? A Cross-Cultural, Comparative Approach across 40 Countries: An Alignment Optimisation Approach
by Nurullah Eryilmaz and Andres Sandoval-Hernandez
Educ. Sci. 2023, 13(2), 218; https://doi.org/10.3390/educsci13020218 - 20 Feb 2023
Cited by 4 | Viewed by 3633
Abstract
Distributed leadership (DL) is defined as the degree of contact and involvement of various people in making choices or carrying out responsibilities, and is an increasingly used concept among researchers, policymakers, and educationalists worldwide. However, few studies have investigated the cross-cultural comparability of [...] Read more.
Distributed leadership (DL) is defined as the degree of contact and involvement of various people in making choices or carrying out responsibilities, and is an increasingly used concept among researchers, policymakers, and educationalists worldwide. However, few studies have investigated the cross-cultural comparability of the distributed leadership scale for school principals, and few have ranked countries according to their levels of distributed leadership. This study employs an innovative alignment optimisation approach to compare the latent means of distributed leadership, as perceived by school principals, across 40 countries, using data from the OECD Teaching and Learning International Survey (TALIS, 2018). We found that South Korea, Colombia, Shanghai (China), and Lithuania had the highest levels of distributed leadership in school decisions, from the perspective of school principals. In contrast, the Netherlands, Belgium, Argentina, and Japan had the lowest levels. Our findings may serve as guidance for education stakeholders over which nations they could learn from in order to enhance school principal distributed leadership. Full article
(This article belongs to the Topic Psychometric Methods: Theory and Practice)
Show Figures

Figure 1

Planned Papers

The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.

Title: Psychometric Modeling to Identify Examinee Strategy Differences Over the Course of Testing
Authors: Susan Embretson1; Clifford E. Hauenstein2
Affiliation: 1Georgia Institute of Technology; 2Johns Hopkins University
Abstract: Aptitude test scores are typically interpreted similarly for examinees with the same overall score. However, research has found evidence of strategy differences between examinees, as well as in examinees’ application of appropriate procedures over the course of testing. Research has shown that strategy differences can impact the correlates of test scores. Hence, the relevancy of test interpretations for equivalent scores can be questionable. The purpose of this study is to present several item response theory (IRT) models that are relevant to identifying examinee differences in strategies and understanding of test-taking procedures. First, mixture item response theory models identify latent clusters of examinees with different patterns of item responses. Early mixture IRT models (e.g., Rost & van Davier, 1995; Mislevy & Wilson, 1996) identify latent classes differing in patterns of item difficulty. More recently, item response time, in conjunction with item accuracy, are combined in joint IRT models to identify latent clusters of examinees with response patterns. Although mixture IRT models have long been available, they are not routinely applied. Second, more recent IRT-based models can also identify strategy shifts over the course of testing (e.g., de Boeck & Jeon, 2019; Hauenstein & Embretson, 2022; Molenaar & de Boeck, 2018). That is, within-person differences in item specific strategies are identified. In this study, relevant IRT models will be illustrated on test measuring various aspects of intelligence. Relevant tests to be used include items on non-verbal reasoning, spatial ability and mathematical problem solving.

Title: Investigating Pre-knowledge and Speed Effects in an IRTree Modeling Framework
Authors: Justin L. Kern; Hahyeong Kim
Affiliation: University of Illinois at Urbana-Champaign
Abstract: Pre-knowledge in testing refers to the situation in which examinees have gained access to exam questions or answers prior to taking an exam. The items the examinees have been exposed to in this way are called compromised items. The exposure of examinees to compromised items can result in an artificial boost in exam scores, jeopardizing test validity and reliability, test security, and test fairness. Furthermore, it has been argued that pre-knowledge may result in quicker responses. A better understanding of the effects of pre-knowledge can help test-creators and psychometricians overcome the problems pre-knowledge can cause. There has been a growing literature in psychometrics focusing on pre-knowledge. This literature has primarily been focused on the detection of person pre-knowledge. However, the majority of this work has used data where it is unknown whether a person has had prior exposure to items. This research aims to explore the effects of pre-knowledge with experimentally obtained data using the Revised Purdue Spatial Visualization Test (PSVT:R). To collect these data, we carried out an online experiment manipulating pre-knowledge levels amongst groups of participants. This was done by exposing a varying number of compromised items to participants in a practice session prior to test administration. Recently, there has also been a growing modeling paradigm using tree-based item response theory models, called IRTree models, to embed the cognitive theories into a model for responding to items on tests. One such form examined the role of speed on intelligence tests, positing differentiated fast and slow test-taking processes (DiTrapani et al., 2016). To investigate this, they proposed using a two-level IRTree model with the first level controlled by speed (i.e., is the item answered quickly or slowly?) and the second level controlled by an intelligence trait. This approach allows for separate parameters at the second level depending upon whether the responses were fast or slow; these can be separate item parameters, person parameters, or both. Building on this literature, we are interested in determining whether and how item pre-knowledge impacts item properties. In this approach, the effects to be studied include 1) whether pre-knowledge impacts the first-level IRTree parameters, affecting response time; 2) whether pre-knowledge impacts the second-level IRTree parameters, affecting response accuracy; and 3) whether the first-level response (i.e., fast or slow) impacts the second-level IRTree parameters. In all cases, an interesting sub-question to be asked is whether any of these effects are constant across items. Estimation of the models will be done using the mirt package in R. To determine efficacy of the IRTree modeling approach to answering these questions, a simulation study will be run under various conditions. Factors to be included are sample size, effect size, and model. The outcomes will include empirical Type I error and power rates. The approach will then be applied to the collected pre-knowledge data.

Title: Bayesian Monte Carlo Simulation Studies in Psychometrics: Practice and Implications
Authors: Allison J. Ames; Brian C. Leventhal; Nnamdi C. Ezike; Kathryn S. Thompson
Affiliation: Amazon
Abstract: Data simulation and Monte Carlo simulation studies (MCSS) are important skills for researchers and practitioners of educational and psychological measurement. Harwell et al. (1996) and Feinberg and Rubright (2016) outline an eight-step process for MCSS: 1. Specifying the research question(s), 2. Defining and justifying conditions, 3. Specifying the experimental design and outcome(s) of interest, 4. Simulating data under the specified conditions, 5. Estimating parameters, 6. Comparing true and estimated parameters, 7. Replicating the procedure a specified number of times, and 8. Analyzing results based on the design and research questions There are a few didactic resources for psychometric MCSS (e.g., Leventhal & Ames, 2020) and software demonstrations. For example, Ames et al. (2020) demonstrate how to operationalize the eight steps for IRT using SAS software and Feinberg and Rubright (2016) demonstrate similar concepts in R. Despite these resources, there is not a current accounting of MCSS practice for psychometrics. For example, there are no resources that describe the typical number of replications for MCSS (step 7), and whether this varies by outcome of interest (step 3) or number of conditions (step 2). Further, there are no resources for describing how Bayesian MCSS differ from frequentist MCSS. To understand the current practice of MCSS and provide a resource for researchers using MCSS, we reviewed six journals focusing on educational and psychological measurement from 2015-2019. This review examined a total of 1004 journal articles. Across all published manuscripts in those six journals, 55.8% contained a MCSS (n=560), of which 18.8% contained Bayesian simulations (n=105). Full results of the review will be presented in the manuscript. Because there is little guidance for Bayesian MCSS, the practice of Bayesian MCSS often utilizes frequentist techniques. This fails, in our opinion, to leverage the benefits of Bayesian methodology. We examined the outcomes of interest in frequentist and Bayesian MCSS. One trend that emerged from our review is the use of Bayesian posterior point estimates alone, disregarding other aspects of the posterior distribution. Specifically, while 58.72% examined some form of bias (e.g., absolute, relative), relying upon a posterior point estimate, only 10.09% examined coverage rates, defined as the proportion of times the true (generating) value was covered by a specified posterior interval. To address the gap in information specific to Bayesian MCSS, this study focuses on current practice and Bayesian-specific decisions within the MCSS steps. Related to current practice, we ask the following: 1) What are the current practices in psychometric Bayesian MCSS across seven journals from during a five-year period? 2) How are the philosophical differences between the practice of frequentist and Bayesian operationalized in MCSS? 3) What overlap exists between the practice of MCSS in the Bayesian and frequentist frameworks? Regarding Bayesian decisions in MCSS, we ask: 4) What are the implications of differing decisions across the eight steps on common MCSS types (e.g., parameter recovery)?

Title: Using keystroke log data to detect non-genuine behaviors in writing assessment: A subgroup analysis
Authors: Yang Jiang; Mo Zhang; Jiangang Hao; Paul Deane
Affiliation: Educational Testing Service
Abstract: In this paper, we will explore the use of keystroke logs – recording of every keypress – in detecting non-genuine writing behaviors in writing assessment, with a particular focus on fairness issues across different demographic subgroups. When writing assessment are delivered online and remotely, meaning the tests can be taken anywhere outside of a well-proctored and monitored testing center, test security related threats arise accordingly. While writing assessments usually require candidates to produce original text in response to a prompt, there are many possible ways to cheat especially in at-home testing. For example, the candidates may hire an imposter to write responses for them; the candidates may memorize some concealed script or general shell-text and simply apply them in whatever prompt they receive; the candidates may have copied text directly from other sources either entirely or partially; etc. Therefore, predicting non-genuine writing behaviors/texts is of great interest to test developers and administrators. Deane et al. (2022) study reported that, by using keystroke log patterns, various machine learning prediction models produced an overall prediction accuracy between .85 and .90 and ROC curve indicated around 80% of true positive and roughly 10% false negative rates. In the paper, we plan to apply similar machine learning methods in predicting non-genuine writing but, in addition to prediction accuracy, we will focus more on the subgroup invariance. It is of important validity concern that non-genuine writing can be predicted equally well across different demographic groups (e.g., race, gender, country, etc.). We will use a large-scale operational data set for exploration.

Back to TopTop